2021-07-04T05:55:48Z Error accepting authorization: acme: authorization error for sauron-retailer-app.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for preprod.k8.atcloud.io
- the domain''s nameservers may be malfunctioning
2021-07-15T10:14:07Z Error accepting authorization: acme: authorization error for autoconvert-callback-service.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
- the domain''s nameservers may be malfunctioning
2021-07-17T06:16:59Z Error accepting authorization: acme: authorization error for sauron-auth-app.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
- the domain''s nameservers may be malfunctioning
2021-07-17T06:28:17Z Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
- the domain''s nameservers may be malfunctioning
2021-07-17T07:03:07Z Error accepting authorization: acme: authorization error for api-admin.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for
preprod.k8.atcloud.io - the domain''s nameservers may be malfunctioning
2021-07-17T07:10:56Z Error accepting authorization: acme: authorization error for forecourt-reporting.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for forecourt-reporting.preprod.k8.atcloud.io
- the domain''s nameservers may be malfunctioning
2021-07-19T14:32:11Z Error accepting authorization: acme: authorization error for conversation-bot.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for conversation-bot.preprod.k8.atcloud.io
- the domain''s nameservers may be malfunctioning
We're issuing Let's Encrypt certs via cert-manager on Kubernetes using https://acme-v02.api.letsencrypt.org/directory.
We have hundreds of certificates but have noticed an increasing rate of the above when attempting to renew certificates for existing domains. This doesn't happen for all renewals, but a large enough amount to be noticed.
All the DNS addresses will have existed for months prior to the above error, and what's interesting is the CAA SERVFAIL is being returned for different parts of the domain, e.g. for one it's the full domain, conversation-bot.preprod.k8.atcloud.io, on others it's the higher level domain, atcloud.io
For these addresses atcloud.io is hosted by Cloudflare, but preprod.k8.atcloud.io is delegated to Google Cloud DNS.
I would have started with raising a ticket to one of the DNS providers but since it is seemingly affecting multiple I wasn't so sure -- Unless the SERVFAIL's for e.g. conversation-bot.preprod.k8.atcloud.io are still coming from Cloudflare as part of the recursive lookup to get the authoritative nameservers for that domain?
Any pointers in the correct direction would be greatly appreciated
I'm not familair with cert-manager on Kubernetes (and I'm trying not to..) and I'm not sure if this is actually related to CAA SERVFAILs, but at which time does cert-manager initiate its renewals? Does it initiate the renewal at the stroke of the clock at xx:00? Or does it wait a random amount of minutes?
I'm asking this because in the past we've seen issues with DNS lookup errors due to excessive amount of load on the Let's Encrypt servers for the DNS lookups at peak hours at the stroke of the clock, because cronjobs and systemd timers were set up to renew at xx:00 exactly.
The renewal period is configurable but the default is to renew when the certificate has <30 days left from memory. I believe the renewal is scheduled exactly based on the expiry date on the certificate, so that would be whatever time the certificate was initially issued.
For the example above of autoconvert-callback-service.preprod.k8.atcloud.io, the renewal was attempted and failed at 2021-07-15 10:14:07, the certificate it is trying to renew is not valid after:
Saturday, 14 August 2021 at 11:14:05 British Summer Time, so that lines up with my understanding (the hour difference is due to timezones)
Do you think there's anyone who could take a quick look at this and weigh in on whether this is likely to be an issue on Let's Encrypts side or at the DNS providers end, just so I know where to focus my attention.
We're still seeing intermittent issues on certificate issuance due to SERVFAIL's being returned on challenges/CAA checks, some more examples below:
2021-07-25T02:16:52Z Error accepting authorization: acme: authorization error for traffic-mirror-cws.preprod.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io - the domain's nameservers may be malfunctioning
2021-07-28T09 Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning
2021-08-05T10 Accepting challenge authorization failed: acme: authorization error for prometheus-pushgateway.data-platform.preprod.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up TXT for _acme-challenge.prometheus-pushgateway.data-platform.preprod.k8.atcloud.io - the domain's nameservers may be malfunctioning
As mentioned before we are hitting issues for domains hosted on both on Google Cloud DNS & Cloudflare DNS
Hosting isn't done by DNS servers.
I take it that your DNS zones are being served by Google Could DNS & Cloudflare DNS.
That said, I don't see any DNSSEC applied to zone bikeshistorycheck.com - which could have explained this strange behaviour.
See: DNS Spy report for bikeshistorycheck.com
So that we may better test, please give example domains for each DSP.
The response from CAROL is not as expected when asked about K8 name servers.
[which if followed as instructed would create a loop - CAROL says go ask CAROL]
This is correct, we don't have DNSSEC enabled for any of our zones, whether they're in google or cloudflare. I didn't think this would cause any issues as I wasn't aware if DNSSEC was a requirement, we issue/renew certificates 90% of the time without issue
An example purely on Cloudflare would be bikeshistorycheck.com
An example with Google DNS would be sweepr.preprod.k8.atcloud.io. In this example preprod.k8.atcloud.io is delegated to Google DNS, the atcloud.io zone is served by Cloudflare.
The majority of our certificates are on domains delegated to Google which is why most of the errors are for these domains, however, you can see in my original examples occasionally the SERVFAILs were coming back when looking up against a domain whose zone is in Cloudflare, e.g.
2021-07-17T06:28:17Z Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
- the domain''s nameservers may be malfunctioning
I may be misunderstanding but I believe this is the correct behaviour.
We have atcloud.io served by Cloudflare, and preprod.k8.atcloud.io is delegated to Google. So atcloud.io will point to Cloudflare nameservers, k8.atcloud.io will then also point to Cloudflare nameservers, preprod.k8.atcloud.io which is delegated to Google points to Google nameservers
It's worth saying for all these domains we have managed to initially issue a certificate and they generally renew OK, we're just hitting these SERVFAIL's every so often during renewals. When we then try again at a later time, they work without making any changes
If you cut to the chase, yes, it seems that FQDN sweepr.preprod.k8.atcloud.io is indeed serviced by Google Cloud DNS.
But that's NOT how things are done.
You have to go step by step.
Literrally: . io. atcloud.io. k8.atcloud.io. preprod.k8.atcloud.io. sweepr.preprod.k8.atcloud.io.
And each line must pass before proceeding to the next one.
You may be right... I may be crazy.
But I think if the delegation for preprod.k8.atcloud.io.
to Google was done at: k8.atcloud.io.
instead of at: atcloud.io.
this might not be a problem (any longer).
The reason the delegation is done at preprod.k8.atcloud.io is that we have several subdomains like this (dev, preprod, prod etc.), that are delegated to Google Cloud DNS in different Google projects for environment separation, which means they're all on different nameservers, so we can't simply delegate at k8.atcloud.io. I imagine we could setup k8.atcloud.io as a separate delegated zone in Cloudflare which would give us unique nameservers, different from atcloud.io but I don't see this solving anything.
This setup has been in place for 2+ years, and we have about 1000 certificates through Let's Encrypt on these domains, and have only started seeing these occasional (but frequent enough) issues in the past couple of months.
If it was a configuration issue I'd expect the issue to be consistent, and broader.
As I understand it, the CAA check is done backwards up the DNS tree, so for the below example:
2021-07-15T10:14:07Z Error accepting authorization: acme: authorization error for autoconvert-callback-service.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for k8.atcloud.io
- the domain''s nameservers may be malfunctioning
You can see from the error, the initial CAA lookups have presumably succeeded, it's the one against atcloud.io that has then returned a SERVFAIL. At this point, the delegation bits are surely out the equation.
That we're also seeing it with e.g.
2021-07-28T09 Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning
which also has no such delegation makes me think it's a different issue than the setup of those specific <env>.k8.atcloud.io zones
Here's one where the failure is with the full domain:
2021-07-17T07:10:56Z Error accepting authorization: acme: authorization error for forecourt-reporting.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for forecourt-reporting.preprod.k8.atcloud.io
- the domain''s nameservers may be malfunctioning
And one where the failure is at atcloud.io:
2021-07-17T06:28:17Z Error accepting authorization: acme: authorization error for sweepr.preprod.k8.atcloud.io:
400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for atcloud.io
- the domain''s nameservers may be malfunctioning
Finally one for a completely different domain that is just a normal zone in Cloudflare:
2021-07-28T09 Failed to finalize Order: 403 urn:ietf:params:acme:error:caa: Error finalizing order :: While processing CAA for www.bikeshistorycheck.com: DNS problem: SERVFAIL looking up CAA for www.bikeshistorycheck.com - the domain's nameservers may be malfunctioning
Other than more examples of this popping up I've not made much progress in figuring out what's causing these SERVFAIL's
Recent example:
2021-08-12T11:06:45Z Accepting challenge authorization failed: acme: authorization error for dpw-20210812-<redacted>.dev.k8.atcloud.io: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for dpw-20210812-<redacted>.dev.k8.atcloud.io - the domain's nameservers may be malfunctioning
I also set up some external monitoring doing frequent CAA lookups against a bunch of different domains and have been unable to replicate the SERVFAIL's Let's Encrypt is getting
My next port of call is going to be seeing if these errors are occurring when multiple certificates are being renewed/issued, but either way I'm struggling to see the same things that LE is seeing
Renewals are happening at random times according to when they were issued, there's nothing triggering renewals on the hour, or at midnight etc, the times we're seeing issues are pretty random.
I wish I knew why... but I still feel like this is part of the problem
Interestingly when I run the same as you I actually get a slightly different response:
I do have a couple of other examples on some other domains:
2021-08-07T08:02:17Z Error accepting authorization: acme: authorization error for autotrader.co.uk: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for autotrader.co.uk - the domain's nameservers may be malfunctioning
2021-08-07T08:02:17Z Error accepting authorization: acme: authorization error for www.autotrader.co.uk: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for www.autotrader.co.uk - the domain's nameservers may be malfunctioning
The first one being autotrader.co.uk which is configured as a CNAME in Cloudflare but as far as Let's Encrypt etc. are concerned is just a single A record
I can look at seeing if it's feasible for me to change how k8.atcloud.io is configured so it's properly delegated to separate nameservers, that way we can at least see if it rules that out as being issue. The others domains being affected though lends me to think it's something else