I'm seeing random errors renewing certs with many SANs (50-100).
The errors include the somewhat nonsensical "error looking up CAA for de", but also errors looking up A/AAAA records etc. Full response listed below for reference.
This is reminiscent of an error reported in Nov' 23:
Is anyone else seeing these errors? Is there a good mitigation strategy? Every retry seems to result an error in a different domain. Certificates with many domains may hit a rate limit before all the domains work in one go.
For a domain like cmp.daskochrezept.de, we do a CAA lookup for cmp.daskochrezept.de first. If that's not found, we then check daskochrezept.de, and finally check de.
There is an overall timeout, so if your DNS server is a little bit slow, large validations can fail. Because de is last, it does mean previous lookups succeeded, but you're hitting timeouts.
The best mitigation, if you can, is to just add CAA records allowing issuance as far down as you can.
The Nov 23 thread was due to a broken DNS implementation, and probably isn’t related.
Another workaround to consider is having fewer SANs per certificate. In many cases it can be easier to manage more certificate with fewer names each, if only there are fewer things that can go wrong validating, but also because the certificates that need to be sent to each of your users are smaller.
It would be difficult for us to begin splitting SANs up into multiple certs. We have numerous certs at the 100 SAN limit and generally update and renew them without any difficulty.
I'd like to get a better idea of where the timeouts are actually coming from. Some CAA lookups do take over 500ms on a couple domains, but the vast majority are under 200ms.
I've attached 2 graphs showing the milliseconds for lookups for 2 different certs - one the problem cert and one with even more domains renewed jan8th. If anything, the problem cert looks a little better in terms of CAA lookup delays.
The error "DNS problem: networking error looking up CAA for de" that is coming back for the TLD of random domains almost looks like a system error (perror()?) more-so than an application logic level timeout of the overall validation process.
Does anyone have a more granular idea of what the timeouts are on a per domain-level CAA lookup, the overall CAA validation process timeout and the domain validation as a whole timeout?
I'd like to understand which specific domains/domain servers that are to blame. The validation process fails on a different domain each time, it's really hard to nail down which is to blame.
Hmm. Are you saying that some combinations of names are working fine, but other combinations aren't, but it's not clear which TLDs or name servers are giving trouble? Can you try repeating some issuances in the staging environment and see if it's consistent whether a set of names works for you or not?
I don't know if their overall timeout and other "production configuration" settings are public.
I mean yes, that's what they use and their source code is public. Though as i said, the configuration settings it's using might not be. And it looks like they are working on some improvements to their DNS querying and reporting of error messages.
Another workaround you might want to try is using some other CAs. There are several free ones that use ACME (like ZeroSSL and Google) and should be relatively easy to try switching to. It might not fix your issues, but another CA might have different timeouts or at least give you different error messages to explore.
We maintain ~150 letsencrypt certs many of which have 50-100 SANs in them. We haven't run into any repeatable unexpected error in validation until these "DNS problem: networking error looking up CAA for some-tld" started appearing. This error appears for random domains each time while validating.
The error sometimes is reproducible to the point that I don't want to hit the rate-limit. Once in a while a quick retry works.
I was able to generate a cert with pki.goog that LetsEncrypt was failing to generate.
In summary, it appears LetsEncrypt domain ownership validation has an unknown issue(timeout?) validating certs with many domains and where the authoritative servers are maybe a little slow.
I'll be updating our tooling to allow the use of pki.goog in these sorts of situations.
It's a good idea to build in CA fallback in general, although it's a little rare in most tooling the benefit is obviously being able to retry a failure on a different CA. A couple of the existing CAs have had intermittent trouble with their APIs and building for such intermittent failure is increasingly beneficial especially if you are working at scale.