Intermittent CAA SERVFAILs Across Two DNS Providers

2021-07-04T05:55:48Z  Error accepting authorization: acme: authorization error for sauron-retailer-app.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for preprod.k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

2021-07-15T10:14:07Z  Error accepting authorization: acme: authorization error for autoconvert-callback-service.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

2021-07-17T06:16:59Z  Error accepting authorization: acme: authorization error for sauron-auth-app.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

2021-07-17T06:28:17Z  Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
    - the domain''s nameservers may be malfunctioning
    
2021-07-17T07:03:07Z  Error accepting authorization: acme: authorization error for api-admin.preprod.k8.atcloud.io:
      400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for
      preprod.k8.atcloud.io - the domain''s nameservers may be malfunctioning

2021-07-17T07:10:56Z  Error accepting authorization: acme: authorization error for forecourt-reporting.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for forecourt-reporting.preprod.k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

2021-07-19T14:32:11Z  Error accepting authorization: acme: authorization error for conversation-bot.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for conversation-bot.preprod.k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

My domain is: e.g. sweepr.preprod.k8.atcloud.io

We're issuing Let's Encrypt certs via cert-manager on Kubernetes using https://acme-v02.api.letsencrypt.org/directory.
We have hundreds of certificates but have noticed an increasing rate of the above when attempting to renew certificates for existing domains. This doesn't happen for all renewals, but a large enough amount to be noticed.

All the DNS addresses will have existed for months prior to the above error, and what's interesting is the CAA SERVFAIL is being returned for different parts of the domain, e.g. for one it's the full domain, conversation-bot.preprod.k8.atcloud.io, on others it's the higher level domain, atcloud.io

For these addresses atcloud.io is hosted by Cloudflare, but preprod.k8.atcloud.io is delegated to Google Cloud DNS.
I would have started with raising a ticket to one of the DNS providers but since it is seemingly affecting multiple I wasn't so sure -- Unless the SERVFAIL's for e.g. conversation-bot.preprod.k8.atcloud.io are still coming from Cloudflare as part of the recursive lookup to get the authoritative nameservers for that domain?

Any pointers in the correct direction would be greatly appreciated

Cheers

2 Likes

I'm not familair with cert-manager on Kubernetes (and I'm trying not to..) and I'm not sure if this is actually related to CAA SERVFAILs, but at which time does cert-manager initiate its renewals? Does it initiate the renewal at the stroke of the clock at xx:00? Or does it wait a random amount of minutes?

I'm asking this because in the past we've seen issues with DNS lookup errors due to excessive amount of load on the Let's Encrypt servers for the DNS lookups at peak hours at the stroke of the clock, because cronjobs and systemd timers were set up to renew at xx:00 exactly.

1 Like

The renewal period is configurable but the default is to renew when the certificate has <30 days left from memory. I believe the renewal is scheduled exactly based on the expiry date on the certificate, so that would be whatever time the certificate was initially issued.

For the example above of autoconvert-callback-service.preprod.k8.atcloud.io, the renewal was attempted and failed at 2021-07-15 10:14:07, the certificate it is trying to renew is not valid after:

Saturday, 14 August 2021 at 11:14:05 British Summer Time, so that lines up with my understanding (the hour difference is due to timezones)

1 Like

In that case it's probably not due to peak load on the LE systems. I'll leave debugging your issue to my colleague volunteers now :wink:

1 Like

Hey @Osiris

Do you think there's anyone who could take a quick look at this and weigh in on whether this is likely to be an issue on Let's Encrypts side or at the DNS providers end, just so I know where to focus my attention.

We're still seeing intermittent issues on certificate issuance due to SERVFAIL's being returned on challenges/CAA checks, some more examples below:

2021-07-25T02:16:52Z  Error accepting authorization: acme: authorization error for traffic-mirror-cws.preprod.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io - the domain's nameservers may be malfunctioning

2021-07-28T09  Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning

2021-08-05T10  Accepting challenge authorization failed: acme: authorization error for prometheus-pushgateway.data-platform.preprod.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up TXT for _acme-challenge.prometheus-pushgateway.data-platform.preprod.k8.atcloud.io - the domain's nameservers may be malfunctioning

As mentioned before we are hitting issues for domains hosted on both on Google Cloud DNS & Cloudflare DNS

Cheers!

2 Likes

Welcome to the Let's Encrypt Community, Mike :slightly_smiling_face:

Try running the domains in question through:

https://dnsviz.net/

1 Like

Hosting isn't done by DNS servers.
I take it that your DNS zones are being served by Google Could DNS & Cloudflare DNS.
That said, I don't see any DNSSEC applied to zone bikeshistorycheck.com - which could have explained this strange behaviour.
See: DNS Spy report for bikeshistorycheck.com

So that we may better test, please give example domains for each DSP.

1 Like

I think there may be a misfiring at zone: k8.atcloud.io
Notice the very different response given along that path:

nslookup -q=ns atcloud.io carol.ns.cloudflare.com
Server:  carol.ns.cloudflare.com
Address:  172.64.32.80
atcloud.io      nameserver = carol.ns.cloudflare.com
atcloud.io      nameserver = lloyd.ns.cloudflare.com

nslookup -q=ns k8.atcloud.io carol.ns.cloudflare.com
Server:  carol.ns.cloudflare.com
Address:  172.64.32.80
atcloud.io
        primary name server = carol.ns.cloudflare.com
        responsible mail addr = dns.cloudflare.com
        serial  = 2037966860
        refresh = 10000 (2 hours 46 mins 40 secs)
        retry   = 2400 (40 mins)
        expire  = 604800 (7 days)
        default TTL = 3600 (1 hour)

nslookup -q=ns preprod.k8.atcloud.io carol.ns.cloudflare.com
Server:  carol.ns.cloudflare.com
Address:  172.64.32.80
preprod.k8.atcloud.io   nameserver = ns-cloud-b1.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b2.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b3.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b4.googledomains.com

nslookup -q=ns api-admin.preprod.k8.atcloud.io carol.ns.cloudflare.com
Server:  carol.ns.cloudflare.com
Address:  172.64.32.80
preprod.k8.atcloud.io   nameserver = ns-cloud-b1.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b2.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b3.googledomains.com
preprod.k8.atcloud.io   nameserver = ns-cloud-b4.googledomains.com

The response from CAROL is not as expected when asked about K8 name servers.
[which if followed as instructed would create a loop - CAROL says go ask CAROL]

1 Like

Thanks @rg305

This is correct, we don't have DNSSEC enabled for any of our zones, whether they're in google or cloudflare. I didn't think this would cause any issues as I wasn't aware if DNSSEC was a requirement, we issue/renew certificates 90% of the time without issue

An example purely on Cloudflare would be bikeshistorycheck.com

An example with Google DNS would be sweepr.preprod.k8.atcloud.io. In this example preprod.k8.atcloud.io is delegated to Google DNS, the atcloud.io zone is served by Cloudflare.
The majority of our certificates are on domains delegated to Google which is why most of the errors are for these domains, however, you can see in my original examples occasionally the SERVFAILs were coming back when looking up against a domain whose zone is in Cloudflare, e.g.

2021-07-17T06:28:17Z  Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
    - the domain''s nameservers may be malfunctioning

I may be misunderstanding but I believe this is the correct behaviour.

We have atcloud.io served by Cloudflare, and preprod.k8.atcloud.io is delegated to Google. So atcloud.io will point to Cloudflare nameservers, k8.atcloud.io will then also point to Cloudflare nameservers, preprod.k8.atcloud.io which is delegated to Google points to Google nameservers

It's worth saying for all these domains we have managed to initially issue a certificate and they generally renew OK, we're just hitting these SERVFAIL's every so often during renewals. When we then try again at a later time, they work without making any changes

Appreciate your help on this

2 Likes

If you cut to the chase, yes, it seems that FQDN sweepr.preprod.k8.atcloud.io is indeed serviced by Google Cloud DNS.
But that's NOT how things are done.
You have to go step by step.
Literrally:
.
io.
atcloud.io.
k8.atcloud.io.
preprod.k8.atcloud.io.
sweepr.preprod.k8.atcloud.io.

And each line must pass before proceeding to the next one.

1 Like

Unlike this:
[which points you forward]

nslookup -q=ns -d preprod.k8.atcloud.io. carol.ns.cloudflare.com.
------------
Got answer:
    HEADER:
        opcode = QUERY, id = 1, rcode = NOERROR
        header flags:  response, auth. answer, want recursion
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        80.192.162.108.in-addr.arpa, type = PTR, class = IN
    ANSWERS:
    ->  80.192.162.108.in-addr.arpa
        name = carol.ns.cloudflare.com
        ttl = 1800 (30 mins)

------------
Server:  carol.ns.cloudflare.com
Address:  108.162.192.80

------------
Got answer:
    HEADER:
        opcode = QUERY, id = 2, rcode = NOERROR
        header flags:  response, want recursion
        questions = 1,  answers = 0,  authority records = 4,  additional = 0

    QUESTIONS:
        preprod.k8.atcloud.io, type = NS, class = IN
    AUTHORITY RECORDS:
    ->  preprod.k8.atcloud.io
        nameserver = ns-cloud-b1.googledomains.com
        ttl = 300 (5 mins)
    ->  preprod.k8.atcloud.io
        nameserver = ns-cloud-b2.googledomains.com
        ttl = 300 (5 mins)
    ->  preprod.k8.atcloud.io
        nameserver = ns-cloud-b3.googledomains.com
        ttl = 300 (5 mins)
    ->  preprod.k8.atcloud.io
        nameserver = ns-cloud-b4.googledomains.com
        ttl = 300 (5 mins)

------------
preprod.k8.atcloud.io
        nameserver = ns-cloud-b1.googledomains.com
        ttl = 300 (5 mins)
preprod.k8.atcloud.io
        nameserver = ns-cloud-b2.googledomains.com
        ttl = 300 (5 mins)
preprod.k8.atcloud.io
        nameserver = ns-cloud-b3.googledomains.com
        ttl = 300 (5 mins)
preprod.k8.atcloud.io
        nameserver = ns-cloud-b4.googledomains.com
        ttl = 300 (5 mins)

This points you backwards:

nslookup -q=ns -d k8.atcloud.io. lloyd.ns.cloudflare.com.
------------
Got answer:
    HEADER:
        opcode = QUERY, id = 1, rcode = NOERROR
        header flags:  response, auth. answer, want recursion
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        197.59.245.173.in-addr.arpa, type = PTR, class = IN
    ANSWERS:
    ->  197.59.245.173.in-addr.arpa
        name = lloyd.ns.cloudflare.com
        ttl = 1800 (30 mins)

------------
Server:  lloyd.ns.cloudflare.com
Address:  173.245.59.197

------------
Got answer:
    HEADER:
        opcode = QUERY, id = 2, rcode = NOERROR
        header flags:  response, auth. answer, want recursion
        questions = 1,  answers = 0,  authority records = 1,  additional = 0

    QUESTIONS:
        k8.atcloud.io, type = NS, class = IN
    AUTHORITY RECORDS:
    ->  atcloud.io
        ttl = 3600 (1 hour)
        primary name server = carol.ns.cloudflare.com
        responsible mail addr = dns.cloudflare.com
        serial  = 2037966860
        refresh = 10000 (2 hours 46 mins 40 secs)
        retry   = 2400 (40 mins)
        expire  = 604800 (7 days)
        default TTL = 3600 (1 hour)

------------
atcloud.io
        ttl = 3600 (1 hour)
        primary name server = carol.ns.cloudflare.com
        responsible mail addr = dns.cloudflare.com
        serial  = 2037966860
        refresh = 10000 (2 hours 46 mins 40 secs)
        retry   = 2400 (40 mins)
        expire  = 604800 (7 days)
        default TTL = 3600 (1 hour)
1 Like

You may be right... I may be crazy.
But I think if the delegation for preprod.k8.atcloud.io.
to Google was done at: k8.atcloud.io.
instead of at: atcloud.io.
this might not be a problem (any longer).

1 Like

The reason the delegation is done at preprod.k8.atcloud.io is that we have several subdomains like this (dev, preprod, prod etc.), that are delegated to Google Cloud DNS in different Google projects for environment separation, which means they're all on different nameservers, so we can't simply delegate at k8.atcloud.io. I imagine we could setup k8.atcloud.io as a separate delegated zone in Cloudflare which would give us unique nameservers, different from atcloud.io but I don't see this solving anything.

This setup has been in place for 2+ years, and we have about 1000 certificates through Let's Encrypt on these domains, and have only started seeing these occasional (but frequent enough) issues in the past couple of months.
If it was a configuration issue I'd expect the issue to be consistent, and broader.

As I understand it, the CAA check is done backwards up the DNS tree, so for the below example:

2021-07-15T10:14:07Z  Error accepting authorization: acme: authorization error for autoconvert-callback-service.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

the ordering would be:

CAA autoconvert-callback-service.preprod.k8.atcloud.io
CAA preprod.k8.atcloud.io
CAA k8.atcloud.io
CAA atcloud.io

You can see from the error, the initial CAA lookups have presumably succeeded, it's the one against atcloud.io that has then returned a SERVFAIL. At this point, the delegation bits are surely out the equation.

That we're also seeing it with e.g.

2021-07-28T09  Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning

which also has no such delegation makes me think it's a different issue than the setup of those specific <env>.k8.atcloud.io zones

1 Like

I read the failure at K8:

And the reason being:

1 Like

Apologies, I copied in the wrong example.

Here's one where the failure is with the full domain:

2021-07-17T07:10:56Z  Error accepting authorization: acme: authorization error for forecourt-reporting.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for forecourt-reporting.preprod.k8.atcloud.io
    - the domain''s nameservers may be malfunctioning

And one where the failure is at atcloud.io:

2021-07-17T06:28:17Z  Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
    400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
    - the domain''s nameservers may be malfunctioning

Finally one for a completely different domain that is just a normal zone in Cloudflare:

2021-07-28T09  Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning
2 Likes

Any news?

1 Like

@rg305 Thanks for checking in

Other than more examples of this popping up I've not made much progress in figuring out what's causing these SERVFAIL's

Recent example:

2021-08-12T11:06:45Z  Accepting challenge authorization failed: acme: authorization error for dpw-20210812-<redacted>.dev.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for dpw-20210812-<redacted>.dev.k8.atcloud.io - the domain's nameservers may be malfunctioning

I also set up some external monitoring doing frequent CAA lookups against a bunch of different domains and have been unable to replicate the SERVFAIL's Let's Encrypt is getting

My next port of call is going to be seeing if these errors are occurring when multiple certificates are being renewed/issued, but either way I'm struggling to see the same things that LE is seeing

1 Like

Are the renewals running at the top of the hour?
If so, that has been known to cause problems due to spikes.

I wish I knew why... but I still feel like this is part of the problem:

nslookup -q=ns atcloud.io. carol.ns.cloudflare.com. [PERFECT RESPONSE]
atcloud.io      nameserver = carol.ns.cloudflare.com
atcloud.io      nameserver = lloyd.ns.cloudflare.com

nslookup -q=ns k8.atcloud.io. carol.ns.cloudflare.com. [UNEXPECTED RESPONSE]
atcloud.io
        primary name server = carol.ns.cloudflare.com
        responsible mail addr = dns.cloudflare.com
        serial  = 2037966860
        refresh = 10000 (2 hours 46 mins 40 secs)
        retry   = 2400 (40 mins)
        expire  = 604800 (7 days)
        default TTL = 3600 (1 hour)

nslookup -q=ns dev.k8.atcloud.io. carol.ns.cloudflare.com. [PERFECT RESPONSE]
dev.k8.atcloud.io       nameserver = ns-cloud-e1.googledomains.com
dev.k8.atcloud.io       nameserver = ns-cloud-e2.googledomains.com
dev.k8.atcloud.io       nameserver = ns-cloud-e3.googledomains.com
dev.k8.atcloud.io       nameserver = ns-cloud-e4.googledomains.com
1 Like

Are the renewals running at the top of the hour?

Renewals are happening at random times according to when they were issued, there's nothing triggering renewals on the hour, or at midnight etc, the times we're seeing issues are pretty random.

I wish I knew why... but I still feel like this is part of the problem

Interestingly when I run the same as you I actually get a slightly different response:

$ nslookup -q=ns atcloud.io. carol.ns.cloudflare.com.                                                                                                                             
Server:		carol.ns.cloudflare.com.
Address:	173.245.58.80#53

atcloud.io	nameserver = carol.ns.cloudflare.com.
atcloud.io	nameserver = lloyd.ns.cloudflare.com.

$ nslookup -q=ns k8.atcloud.io. carol.ns.cloudflare.com.                                                                                                                          
Server:		carol.ns.cloudflare.com.
Address:	108.162.192.80#53

*** Can't find k8.atcloud.io.: No answer

$ nslookup -q=ns dev.k8.atcloud.io. carol.ns.cloudflare.com.                                                                                                                     
Server:		carol.ns.cloudflare.com.
Address:	108.162.192.80#53

Non-authoritative answer:
*** Can't find dev.k8.atcloud.io.: No answer

Authoritative answers can be found from:
dev.k8.atcloud.io	nameserver = ns-cloud-e1.googledomains.com.
dev.k8.atcloud.io	nameserver = ns-cloud-e2.googledomains.com.
dev.k8.atcloud.io	nameserver = ns-cloud-e3.googledomains.com.
dev.k8.atcloud.io	nameserver = ns-cloud-e4.googledomains.com.

I do have a couple of other examples on some other domains:

2021-08-07T08:02:17Z  Error accepting authorization: acme: authorization error for autotrader.co.uk: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for autotrader.co.uk - the domain's nameservers may be malfunctioning

2021-08-07T08:02:17Z  Error accepting authorization: acme: authorization error for www.autotrader.co.uk: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for www.autotrader.co.uk - the domain's nameservers may be malfunctioning

The first one being autotrader.co.uk which is configured as a CNAME in Cloudflare but as far as Let's Encrypt etc. are concerned is just a single A record

I can look at seeing if it's feasible for me to change how k8.atcloud.io is configured so it's properly delegated to separate nameservers, that way we can at least see if it rules that out as being issue. The others domains being affected though lends me to think it's something else

I don't see the CNAME...

Me too.
This domain does look better but still has the same problem!
I can only find Cloudflare DNS as the common denominator.

1 Like