Intermittent SERVFAIL during distributed CAA rechecks despite healthy authoritative DNS responses

Hi Let's Encrypt team,

We are currently investigating repeated ACME CAA validation failures during certificate issuance for a large SAN certificate (~46 SANs) being processed through Akamai CPS using Let's Encrypt.

The errors returned are intermittent SERVFAIL responses during CAA rechecks at finalization time, for example:

"Error finalizing order :: rechecking caa: While processing CAA for app.env.example-domain.io: DNS problem: SERVFAIL looking up CAA for example-domain.io - the domain's nameservers may be malfunctioning"

Current observations:

  • Authoritative DNS is hosted on AWS Route53.
  • Manual and repeated direct authoritative queries consistently return NOERROR responses.
  • DNSSEC validation currently appears healthy.
  • Multiple public resolvers (8.8.8.8 / 1.1.1.1 / 9.9.9.9) return healthy responses.
  • We performed repeated direct authoritative CAA checks (hundreds of queries across all NS) without reproducing SERVFAIL or timeout behavior.
  • Force Early Renewal and re-submission attempts have already been performed multiple times.
  • We also tested issuance/renewal for another certificate using the same DNS infrastructure, and that validation completed successfully.

Example successful responses observed during testing:

dig @ns-xxxx.awsdns-xx.net CAA example-domain.io

;; ->>HEADER<<- opcode: QUERY, status: NOERROR

dig @ns-yyyy.awsdns-yy.com CAA app.env.example-domain.io

;; ->>HEADER<<- opcode: QUERY, status: NOERROR

Repeated validation loops across all authoritative NSes also consistently returned NOERROR responses without SERVFAIL, REFUSED, or timeout conditions.

We understand that Let's Encrypt validators are globally distributed and validation behavior may differ from localized manual testing.

We wanted to check whether:

  • there are known intermittent resolver behaviors or validation edge conditions that could explain transient SERVFAIL during distributed CAA rechecks,
  • whether large SAN counts could contribute to distributed resolver edge behavior,
  • or whether there is any additional visibility recommended for troubleshooting cases where authoritative DNS appears healthy but Boulder intermittently receives SERVFAIL responses.

Any guidance would be appreciated.

Thank you.

https://dnsviz.net/ would be my recommendation for diagnosing DNS issues, just make sure to select the CAA record in the advanced options (it's an extra type).

Without knowing your full domain name, we can't help you further.

@Jdevasah, welcome to the community! :slightly_smiling_face:

One has to take into account that the route53 DNS service is an anycast service. The view of the service quality might be different from one location than from another one. I do not think that is the reason of your case, tough, I just mentioned for the completeness of the factors.

It certainly can be a contributing factor. Let's Encrypt needs to check CAA for all subparts of the domain name up to the root, for each domain name, and requests that from multiple places. That can lead to there being a lot of requests at once. Some providers might interpret that as some sort of "attack" and perform some rate-limiting, or might just not be able to keep up with the level of traffic and drop some packets in there. And all the names need to validate in order for Let's Encrypt to be able to issue a certificate. I wouldn't expect Route 53 specifically to have problems like that, but you're certainly not the first one who find larger certificates more difficult to implement. And while it's the kind of thing that should "just work", you may find that splitting your requests into multiple smaller certificates makes things more reliable (and may be easier to manage in general).

I do want to emphasize this. We have seen around here multiple times where people thought that their DNS servers were fine, but in fact there were some underlying problems that a lot of resolvers managed to work around, but Let's Encrypt's resolver either couldn't, or couldn't within the timeout limits that their resolution system uses. Route 53 specifically uses a lot of different authoritative name servers, and it's important to make sure that for each domain name the one they've assigned for your hosted zone is both configured in the registrar (for use on the TLD nameservers) as well as configured in the NS record in the hosted zone itself.