Hi Let's Encrypt team,
We are currently investigating repeated ACME CAA validation failures during certificate issuance for a large SAN certificate (~46 SANs) being processed through Akamai CPS using Let's Encrypt.
The errors returned are intermittent SERVFAIL responses during CAA rechecks at finalization time, for example:
"Error finalizing order :: rechecking caa: While processing CAA for app.env.example-domain.io: DNS problem: SERVFAIL looking up CAA for example-domain.io - the domain's nameservers may be malfunctioning"
Current observations:
- Authoritative DNS is hosted on AWS Route53.
- Manual and repeated direct authoritative queries consistently return NOERROR responses.
- DNSSEC validation currently appears healthy.
- Multiple public resolvers (8.8.8.8 / 1.1.1.1 / 9.9.9.9) return healthy responses.
- We performed repeated direct authoritative CAA checks (hundreds of queries across all NS) without reproducing SERVFAIL or timeout behavior.
- Force Early Renewal and re-submission attempts have already been performed multiple times.
- We also tested issuance/renewal for another certificate using the same DNS infrastructure, and that validation completed successfully.
Example successful responses observed during testing:
dig @ns-xxxx.awsdns-xx.net CAA example-domain.io
;; ->>HEADER<<- opcode: QUERY, status: NOERROR
dig @ns-yyyy.awsdns-yy.com CAA app.env.example-domain.io
;; ->>HEADER<<- opcode: QUERY, status: NOERROR
Repeated validation loops across all authoritative NSes also consistently returned NOERROR responses without SERVFAIL, REFUSED, or timeout conditions.
We understand that Let's Encrypt validators are globally distributed and validation behavior may differ from localized manual testing.
We wanted to check whether:
- there are known intermittent resolver behaviors or validation edge conditions that could explain transient SERVFAIL during distributed CAA rechecks,
- whether large SAN counts could contribute to distributed resolver edge behavior,
- or whether there is any additional visibility recommended for troubleshooting cases where authoritative DNS appears healthy but Boulder intermittently receives SERVFAIL responses.
Any guidance would be appreciated.
Thank you.