It's going to be hard for people here to diagnose much further without having the actual list of domains you're trying. The error message "query timed out" really just means what it says, that it took too long to get a response. Perhaps your authoritative DNS servers are responding slowly, or are a long way (network-round-trip-time-wise) from where Let's Encrypt's validation servers are checking from. Perhaps your DNS server (or a firewall in front of it) sees a bunch of requests all coming at once from Let's Encrypt's various servers (as they check from 4 locations at once, for each domain name) and interprets it as an "attack" and throttles or discards the traffic.
If having fewer names works, then maybe it's because there is less total traffic for those validation requests. If I understand you correctly, then the entire list of domains names works when a certificate is requests in the Let's Encrypt staging environment (which is what --dry-run
uses), meaning that your servers can respond in time, sometimes. But perhaps as the production Let's Encrypt system is under more load it's a bit less forgiving of longer response times.
The main approaches I think you can take are:
- Look through your firewall and DNS server logs, to see if you can figure out the specifics of if traffic is passing through correctly and how long it is taking to send a response.
- Double-check that your DNS servers are configured correctly (handling EDNS, supporting TCP, not having broken DNSSEC, etc.). As recommended above, DNSViz and UnboundTest can be helpful. You might also want to try the ISC EDNS Compliance Checker and maybe even spinning up a VM from a cloud provider (or several) to run stuff like
dig +dnssec +bufsize=512 CAA «your-domain-name»
to see how long requests are taking. - If your DNS setup is only going to be able to respond for a couple names in time for whatever reason, switch from using one certificate with 40 names to multiple certificates with fewer names each. Most TLS server software handles SNI and presenting the right request for the right hostname really easily, and while it might not help understand the underlying problem it might be a workaround that can get you going for now. (And many people prefer using separate certificates for separate names anyway, for other reasons.)
There have been times Let's Encrypt staff have managed to look at their own logs and traceroutes and whatnot to try to diagnose connectivity issues from their servers (such as this thread a couple months ago where there was some kind of packet loss), that was with a lot of packet captures and information provided on what their DNS servers were seeing. But if you can return with a lot more detail and it starts to look like it might be something like that, perhaps we can then ask them to take a look, too.
I hope that's all somewhat helpful info.