Timeouts when validating multiple domains

When running the commando certbot certonly --apache --apache-server-root /etc/httpd --apache-vhost-root /etc/httpd/conf.d --apache-challenge-location /etc/httpd -d domain#1 -d domain#2 ......

I get a timeout "Detail: DNS problem: query timed out looking up CAA for"

I run this with about 40 domains (so -d is repeated 40 times), When I reduce this to about 20 I dont get the timeout.

As stated all the vhosts/DNS are in place it's just the amount of domains run together that gives me problems.

How can I bypass this?

It's probably not. It might be one single domain with a misbehaving dns that causes that. It can always be an issue when validating multiple domains at once.

2 Likes

Is it always the same domain name failing? Or same registrar / authoritative DNS servers?

Others here may help if you provide actual domain names. Or, try these sites to help correct failing DNS

https://dnsviz.net/
https://unboundtest.com/

2 Likes

I did a dry-run with sets of the domains to pinpoint the domain that may be faulty and every dry-run succeeded.

I tried a dry-run with all the domain together (which without the dry-run fails) and the dry-run succeeds.

But when I remove the dry-run and run the same domains it fails with the same errors I posted in the original post. How is this possible?

It's going to be hard for people here to diagnose much further without having the actual list of domains you're trying. The error message "query timed out" really just means what it says, that it took too long to get a response. Perhaps your authoritative DNS servers are responding slowly, or are a long way (network-round-trip-time-wise) from where Let's Encrypt's validation servers are checking from. Perhaps your DNS server (or a firewall in front of it) sees a bunch of requests all coming at once from Let's Encrypt's various servers (as they check from 4 locations at once, for each domain name) and interprets it as an "attack" and throttles or discards the traffic.

If having fewer names works, then maybe it's because there is less total traffic for those validation requests. If I understand you correctly, then the entire list of domains names works when a certificate is requests in the Let's Encrypt staging environment (which is what --dry-run uses), meaning that your servers can respond in time, sometimes. But perhaps as the production Let's Encrypt system is under more load it's a bit less forgiving of longer response times.

The main approaches I think you can take are:

  1. Look through your firewall and DNS server logs, to see if you can figure out the specifics of if traffic is passing through correctly and how long it is taking to send a response.
  2. Double-check that your DNS servers are configured correctly (handling EDNS, supporting TCP, not having broken DNSSEC, etc.). As recommended above, DNSViz and UnboundTest can be helpful. You might also want to try the ISC EDNS Compliance Checker and maybe even spinning up a VM from a cloud provider (or several) to run stuff like dig +dnssec +bufsize=512 CAA «your-domain-name» to see how long requests are taking.
  3. If your DNS setup is only going to be able to respond for a couple names in time for whatever reason, switch from using one certificate with 40 names to multiple certificates with fewer names each. Most TLS server software handles SNI and presenting the right request for the right hostname really easily, and while it might not help understand the underlying problem it might be a workaround that can get you going for now. (And many people prefer using separate certificates for separate names anyway, for other reasons.)

There have been times Let's Encrypt staff have managed to look at their own logs and traceroutes and whatnot to try to diagnose connectivity issues from their servers (such as this thread a couple months ago where there was some kind of packet loss), that was with a lot of packet captures and information provided on what their DNS servers were seeing. But if you can return with a lot more detail and it starts to look like it might be something like that, perhaps we can then ask them to take a look, too.

I hope that's all somewhat helpful info.

3 Likes

Could you share the dns names mentioned in the "DNS problem" error message, as well as a list of the domains in question?

It is possible that one or more of the domain's nameservers are either very slow to respond, or drop queries when contacted by Let's Encrypts DNS resolver.

--dry-run uses the staging enviroment, which may use different resolver IP's and/or have different load level, different sensitivity to timeouts etc.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.