Since the September 9th we've been occasionally seeing an issue where Let’s Encrypt is failing to issue certificates for wildcard subdomain certs on deno.net because it can’t look up CAA records for the net. TLD itself. The failures come back as urn:ietf:params:acme:error:dns with messages like:
DNS problem: SERVFAIL looking up CAA for net – the domain's nameservers may be malfunctioning
These errors happen before Let’s Encrypt even checks our authoritative nameservers — it appears to be failing at the TLD-level CAA lookup.
This in itself would be fine if the error returned was somehow marked as retryable - but it is not, so we don't know to retry this error, and instead surface it as a configuration error that is not retryable.
So two questions:
Why did this lookup start failing? We first saw the error on September 9th, but we've been issuing certificates with the exact same setup since the start of 2025. We had already issued tens of thousands of certs without ever seeing this before September 9th.
Could you implement retries on the TLD level lookup in boulder, and if that is already the case and this is still not enough, can you return an error code on the order that suggests retrying the order may help?
I've attached all occurrences below (together with approximate timestamps):
2025-09-10 12:31:32+00: urn:ietf:params:acme:error:dns: While processing CAA for *.sabo28.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-09-15 01:10:41+00: urn:ietf:params:acme:error:dns: While processing CAA for *.lokou14.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-10-16 13:23:30+00: urn:ietf:params:acme:error:dns: While processing CAA for *.he.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-10-19 21:20:41+00: urn:ietf:params:acme:error:dns: While processing CAA for *.mytechsupport.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-10-21 00:10:02+00: urn:ietf:params:acme:error:dns: While processing CAA for *.fffngzzj.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-10-23 17:57:32+00: urn:ietf:params:acme:error:dns: While processing CAA for *.bastidood.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-10-29 06:26:49+00: urn:ietf:params:acme:error:dns: While processing CAA for *.seri-f.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-08 13:46:25+00: urn:ietf:params:acme:error:dns: While processing CAA for *.cufeyyc.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-09 10:55:39+00: urn:ietf:params:acme:error:dns: While processing CAA for *.anhnv02.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-10 08:57:42+00: urn:ietf:params:acme:error:dns: While processing CAA for *.0x76.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-14 18:58:10+00: urn:ietf:params:acme:error:dns: While processing CAA for *.1677bb77f047.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-14 21:20:04+00: urn:ietf:params:acme:error:dns: While processing CAA for *.1d49479472ca.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
net to deno.net: Authoritative AAAA records exist for ns-907.awsdns-49.net, but there are no corresponding AAAA glue records. See RFC 1034, Sec. 4.2.2.
I don't think that is the problem here - the warning is about IPv6 glue records, which in practice do not matter because LE is not IPv6 only, but instead does DNS lookups over IPv4 as necessary (and demonstrated by the fact that this does work most of the time).
Also, this is the standard and only supported configuration of AWS Route53, so I do not think this can be related because it would mean this would be happening to everyone on Route53?
No, it is not standard on Route53 to be missing glue records. I use Route53 myself. Not often but we have seen cases before where the DNS records get in a bad state. Not sure if AWS changes the DNS servers for some reason or whether people have copied the wrong ones.
Generally, when there is a fault in the DNS config and you get problems related to DNS queries the first best step is to correct your DNS config.
As to some of your other comments ...
The error message about .net CAA check is odd but Let's Encrypt walks the DNS tree. Problems in that tree can manifest in peculiar ways.
As for saying the missing record must be ok since sometimes it works isn't a fair conclusion either. LE chooses a path in your tree and every path must lead to correct conclusion. LE does IPv4 fallback when sending HTTP challenge requests, the IPv6/v4 choices for DNS queries isn't published.
If .net was failing regularly there would be vast numbers of reported failures and there is not. Similarly, if there was general Route53 problems we would also see vast failures. LE issues over 7 million certs per day. General problems with a commonly used TLD and/or Route53 would cause numerous problems as you might imagine. Yours is the only one reported and it has been 7 days since you reported.
@MikeMcQ I hear you, but we have not made any infrastructure changes in the last 6 months on this system, and yet the CAA lookup for net. (not for deno.net.) started failing on September 10th. This is important, the error is not while looking up the CAA for deno.net, but while doing the CAA lookup for net itself. This lookup is expected, as not just the CAA on deno.net has to allow LE, but also all parent domains do too. It is however not within our control?
This glue record issue also does not seem fixable by us - our NS servers are set up correctly, and deno.net has no glue records (which is correct). The link that you sent also does not provide any further details than "add the NS records to your registrar". The fact that AWS does not have glue records on their domain seems odd, I agree, but I don't see how it could be related here. The glue records should not be consulted here anyway, because the name servers that deno.net points to are not subdomains of deno.net itself, but rather are on other domains. As such no glue records are needed, because no name recursive lookup takes place (glue records would only be needed if deno.com's NS would point to something like ns1.deno.net), right?
If you think I am mistaken, please let me know. But the fact that this started failing without our intervention, randomly, one day, seems odd. We issue hundreds of certs daily, and have done so before September 10th, but it only started occasionally failing on that day.
Sure, I can see how it would. Clearly then something has changed. LE uses unbound for DNS queries. Perhaps something in that changed. Unbound is complex. Doesn't mean it is wrong now but perhaps it is not working for odd cases where it worked before.
It is worth checking with your registrar or AWS about the missing glue records. Perhaps there are other problems with the DNS config and that is only one symptom. It is not normal to see missing glue for Route53.
You said you only started seeing occasional problems about 2.5 months ago. Are these problems all with this same domain? If not, what other domains exhibit failure for the same CAA query problem? Do they fail for some other DNS query problem? Are they all for domains with .net tld?
I understand the other parts of what you say. But, my experience tells me that if you have a DNS query problem the first thing to fix is any config problem in the DNS tree. If nothing else it eliminates a potential cause.
Perhaps some other volunteers with deeper DNS and/or unbound expertise will offer advice.
The problems are all with subdomains of deno.net. Do note though, that over 90% of certificates that we manage are wildcard certificates for <slug>.deno.net subdomains, so the fact that these are failing (rather than other domains), may just be luck rather than correlation.
They all fail with the same SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning.
We'll look more into the glue record thing.
If anyone else has ideas, particularly someone who works at LE and could run a query to see how many orders fail with SERVFAIL looking up CAA for net errors, that'd be super helpful I think.
It is not clear to me whether an error closer to the TLD would be reported if there was a valid CAA record closer to the domain name. I am not proficient in Boulder
@lucacasonato What is your failure rate for that domain? Are we talking like 50% or 1%?
The failure rate between 2025-09-10 12:53:26.845864+00 (first occurance) and 2025-11-25 13:17:23.415427+00 (most recent issued certificate), is ~0.084% of all certificates requested in that period by us and issued. It's ~0.09% of orders involving deno.net.
In addition to the 12 failures I posted in the original post, 4 additional failures have occurred since:
2025-11-19 16:42:20+00: urn:ietf:params:acme:error:dns: While processing CAA for *.digitoolmedia.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-23 19:57:58+00: urn:ietf:params:acme:error:dns: While processing CAA for *.purpleee.deno.net: DNS problem: SERVFAIL looking up CAA for purpleee.deno.net - the domain's nameservers may be malfunctioning
2025-11-24 01:14:01+00: urn:ietf:params:acme:error:dns: While processing CAA for *.luanmm.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
2025-11-24 04:05:04+00: urn:ietf:params:acme:error:dns: While processing CAA for *.israelsantander77.deno.net: DNS problem: SERVFAIL looking up CAA for net - the domain's nameservers may be malfunctioning
Then that's a bad design decission and should be changed. It's creates unnecessary traffic, load and problems. There's just no need to process higher-level CAAs if one exists further down the hierarchy.
It has been that way for over 8 years as noted in the post I linked. Please start a new thread about that if you wish to discuss it further. Although, you'll find the comments in the Boulder code helpful for further background.
Discussing that here is not helpful and will only clutter responses for this person's issues.
All CAA lookups (e.g. for sabo28.deno.net, deno.net, and net) are launched in parallel for the sake of latency. However, the results of those later (closer to the DNS root) queries only matter if the earlier queries return no CAA records.
Does deno.net have CAA records? If yes, that points to a misconfiguration at deno.net (e.g. missing glue records) causing us to sometimes not retrieve those records.
Ok, then it's likely that the issue really does lie with the .net root nameservers. We'll look into our logs and see if we can confirm that other domains under net are having this same problem.
However, you can mitigate this problem by adding a CAA record at deno.net, so we never have to care whether the net lookup is successful.
Yes, we do see this regularly with CAA lookups for .net (and for .com, which is operated by the same authoritative name servers) across many registered domains, not just yours. We see it at somewhat small rates: about 0.04% of our CAA checks for .net result in SERVFAIL. That said, this error rate is much higher than the error rate for .org, which is about 0.002%.
Most clients recover by simply retrying the ACME issuance flow the next time the client wakes up, usually an hour or so later.
Thanks, that’s very helpful. Our acme client retires certain errors, specifically those with a Retry-After. We don’t configure our client to retry this error automatically, because it (generally) points to a user DNS configuration problem.
Ideally LE would retry lookup from the root servers by itself a few times, knowing that the error rate is proportionally so high. That would likely incur the least additional load on LE, because we wouldn’t have to create new orders, etc.
Alternatively, if this specific error (lookup failures on the root name server) returned a different error type or a Retry-After header from LE, we could auto retry it.
Or do you think we should do matching on the error message to try to determine whether the error was this specific tld nameserver error?