DNS problem: query timed out

jelmd · November 11, 2020, 2:41am

For some weeks Lets Encrypt certificate renewal changed: it really sucks a lot, is absolutely unreliable. Especially for CNAME lookups we get a lot of DNS related errors like 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: No valid IP
addresses found for ...'. I know for sure, that on our side the DNS service has not been changed, so I guess, the LE validation software is buggy. Please fix it.

webprofusion · November 11, 2020, 2:45am

There could be a bug but this usually mean one of your nameservers is not responding. If you can share which domain/subdomain is having the problem it might be possible to investigate.

petercooperjr · November 11, 2020, 2:48am

There are still well over a million certificates being issued per day, so it's unlikely to be something systematic on the Let's Encrypt side of things. Perhaps your DNS server hasn't changed, but maybe some firewall or network routing in front of it has?

If you're not willing to divulge your domain names here publically, maybe you could try some online tools like dnsviz.net to see if they can spot any issues?

jelmd · November 11, 2020, 2:50am

E.g. cse.cs.uni-magdeburg.de, cse.iks.cs.uni-magdeburg.de, vecs.cs.ovgu.de, vecs.cs.uni-magdeburg.de - order 60838548/6146549384 .

_az · November 11, 2020, 2:51am

There were some issues with DNS validation that were addressed around the start of October, but there haven't been many complaints since then.

I'd echo @webprofusion and say that it would be very useful to also know:

What time of day you are hitting these validation failures, and
Whethere there are any common authoritative nameservers that are involved in the validation failures (for example, Network Solutions?)

jelmd · November 11, 2020, 2:53am

I doubt, that this is FW related: the zone/containers provide a single service, only - so FW rules are static - never need to change.

petercooperjr · November 11, 2020, 2:59am

DNSViz does report some issues, though I'm not enough of a DNS guru to know if they're related to the failures you're experiencing.

jelmd · November 11, 2020, 2:59am

The cronjob runs every day at 0000 UTC. It checks, whether any cert requires renewal and if so it asks for renewal one after another ...

For auth servers, just use dig ti find out.

_az · November 11, 2020, 3:04am

That's useful to know, thanks. If you take a look at one of the previous threads (During secondary validation: No valid IP addresses found) you'll notice there is a pattern of integrators hitting this error at 00:00, 01:00, 02:00 UTC.

There is a thundering herd effect at those times and you can make your renewal more robust by avoiding them.

There is also some advice on this at https://letsencrypt.org/docs/integration-guide/#when-to-renew .

Even though this kind of error is not ideal in general, it might be helpful to see whether moving away from that time of day gives you an improvement.

jelmd · November 11, 2020, 3:05am

ovgu.de delegates the cs subdomain to {ns,ns2}.cs.ovgu.de. Not sure either, why DNSViz thinks, there is a problem ...

jelmd · November 11, 2020, 3:18am

This seems to be a good hint. Moved it to 21:00 - all our LE driven clients run their check cronjob ~ 0300 on $Xday ...

_az · November 11, 2020, 3:20am

Great! Let us know how your next renewals go.

Choosing a minute other than :00 would be even better, if your environment allows for it.

jelmd · November 11, 2020, 3:29am

OK, changed to 21:33. Basically not a problem, because a single service does all the real work, LE clients just redirect and "poll" it for a new cert (FWIW: details are here ...).

webprofusion · November 11, 2020, 3:44am

Yes, great idea, definitely try not to do stuff on the hour. Everyone is generally running in sync with internet time and if they all do stuff on the hour the spikes are going to be pretty brutal. That said, this seems like something that will only get worse so there's possibly a scaling issue which needs to be resolved within LE.

petercooperjr · November 11, 2020, 3:47am

Just to stay within the ovgu.de domain to make the graph a bit simpler, you can look at the one of vecs.cs.ovgu.de:

Just like the other one, it says there are errors with "malformed response" and "invalid RCODE (REFUSED)." It may not be what Let's Encrypt is getting hung up on, but it might be worth looking into.

jelmd · November 11, 2020, 3:51am

Yes. I guess all the people think, the others are so smart and choose a time != *:00 - so it is a good idea to use ...;-). Anyway, a temporary error would be nice, so that a renewal would have the chance to succeed on the next day(s). But IIRC the RFC does not allow this?

_az · November 11, 2020, 3:55am

Yes, I agree. It seems like one of the auth nameservers that is listed:

$ dig +noall +answer @ns.cs.uni-magdeburg.de cs.ovgu.de ns
cs.ovgu.de.             86400   IN      NS      ns2.cs.ovgu.de.
cs.ovgu.de.             86400   IN      NS      ns.cs.ovgu.de.
cs.ovgu.de.             86400   IN      NS      luxator.cs.ovgu.de.

does not act authoritatively and gives REFUSED:

$ dig @luxator.cs.ovgu.de cs.ovgu.de soa

; <<>> DiG 9.16.1-Ubuntu <<>> @luxator.cs.ovgu.de cs.ovgu.de soa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 60626

That might be what dnsviz is complaining about.

However, it seems like Let's Encrypt's resolver is mostly overlooking this error.

Otherwise we would see a "permanent" error rather than an intermittent one.

jelmd · November 11, 2020, 3:58am

Ahh, there are toolbox hints. Hmmm, need to check - the mentioned server is not under my control. Anyway, IIRC that's the purpose of secondaries - if one doesn't answer, just ask the other[s]. Is this a different thing for LE?

rg305 · November 11, 2020, 8:11am

Agreed (DNS is not in a state of agreement):

petercooperjr · November 11, 2020, 1:07pm

Again I'm no DNS guru, but it wouldn't surprise me if a "REFUSED" error was treated differently than a server just not responding. But if changing the renewal time works for you, then this probably isn't the issue.

Topic		Replies	Views
Secondary validation has been failing for almost a week now Help	4	441	March 9, 2021
Validation outage (DNS problem: query timed out): March 28, 2016 Incidents	0	2702	March 31, 2016
DNS problem: query timed out Help	3	1090	September 24, 2017
Error received: Error: DNS problem: query timed out looking up CAA Help	8	6676	March 4, 2017
During secondary validation: DNS problem: query timed out Help	4	1390	December 20, 2022

DNS problem: query timed out

Related topics