DNS problem: query timed out

For some weeks Lets Encrypt certificate renewal changed: it really sucks a lot, is absolutely unreliable. Especially for CNAME lookups we get a lot of DNS related errors like 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: No valid IP
addresses found for ...'. I know for sure, that on our side the DNS service has not been changed, so I guess, the LE validation software is buggy. Please fix it.

1 Like

There could be a bug but this usually mean one of your nameservers is not responding. If you can share which domain/subdomain is having the problem it might be possible to investigate.

1 Like

There are still well over a million certificates being issued per day, so it's unlikely to be something systematic on the Let's Encrypt side of things. Perhaps your DNS server hasn't changed, but maybe some firewall or network routing in front of it has?

If you're not willing to divulge your domain names here publically, maybe you could try some online tools like dnsviz.net to see if they can spot any issues?

1 Like

E.g. cse.cs.uni-magdeburg.de, cse.iks.cs.uni-magdeburg.de, vecs.cs.ovgu.de, vecs.cs.uni-magdeburg.de - order 60838548/6146549384 .

There were some issues with DNS validation that were addressed around the start of October, but there haven't been many complaints since then.

I'd echo @webprofusion and say that it would be very useful to also know:

  • What time of day you are hitting these validation failures, and
  • Whethere there are any common authoritative nameservers that are involved in the validation failures (for example, Network Solutions?)
1 Like

I doubt, that this is FW related: the zone/containers provide a single service, only - so FW rules are static - never need to change.

DNSViz does report some issues, though I'm not enough of a DNS guru to know if they're related to the failures you're experiencing.

The cronjob runs every day at 0000 UTC. It checks, whether any cert requires renewal and if so it asks for renewal one after another ...

For auth servers, just use dig ti find out.

2 Likes

That's useful to know, thanks. If you take a look at one of the previous threads (During secondary validation: No valid IP addresses found) you'll notice there is a pattern of integrators hitting this error at 00:00, 01:00, 02:00 UTC.

There is a thundering herd effect at those times and you can make your renewal more robust by avoiding them.

There is also some advice on this at https://letsencrypt.org/docs/integration-guide/#when-to-renew .

Even though this kind of error is not ideal in general, it might be helpful to see whether moving away from that time of day gives you an improvement.

4 Likes

ovgu.de delegates the cs subdomain to {ns,ns2}.cs.ovgu.de. Not sure either, why DNSViz thinks, there is a problem ...

1 Like

This seems to be a good hint. Moved it to 21:00 - all our LE driven clients run their check cronjob ~ 0300 on $Xday ...

2 Likes

Great! Let us know how your next renewals go.

Choosing a minute other than :00 would be even better, if your environment allows for it.

2 Likes

OK, changed to 21:33. Basically not a problem, because a single service does all the real work, LE clients just redirect and "poll" it for a new cert (FWIW: details are here ...).

4 Likes

Yes, great idea, definitely try not to do stuff on the hour. Everyone is generally running in sync with internet time and if they all do stuff on the hour the spikes are going to be pretty brutal. That said, this seems like something that will only get worse so there's possibly a scaling issue which needs to be resolved within LE.

1 Like

Just to stay within the ovgu.de domain to make the graph a bit simpler, you can look at the one of vecs.cs.ovgu.de:

Just like the other one, it says there are errors with "malformed response" and "invalid RCODE (REFUSED)." It may not be what Let's Encrypt is getting hung up on, but it might be worth looking into.

1 Like

Yes. I guess all the people think, the others are so smart and choose a time != *:00 - so it is a good idea to use ...;-). Anyway, a temporary error would be nice, so that a renewal would have the chance to succeed on the next day(s). But IIRC the RFC does not allow this?

1 Like

Yes, I agree. It seems like one of the auth nameservers that is listed:

$ dig +noall +answer @ns.cs.uni-magdeburg.de cs.ovgu.de ns
cs.ovgu.de.             86400   IN      NS      ns2.cs.ovgu.de.
cs.ovgu.de.             86400   IN      NS      ns.cs.ovgu.de.
cs.ovgu.de.             86400   IN      NS      luxator.cs.ovgu.de.

does not act authoritatively and gives REFUSED:

$ dig @luxator.cs.ovgu.de cs.ovgu.de soa

; <<>> DiG 9.16.1-Ubuntu <<>> @luxator.cs.ovgu.de cs.ovgu.de soa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 60626

That might be what dnsviz is complaining about.

However, it seems like Let's Encrypt's resolver is mostly overlooking this error.

Otherwise we would see a "permanent" error rather than an intermittent one.

4 Likes

Ahh, there are toolbox hints. Hmmm, need to check - the mentioned server is not under my control. Anyway, IIRC that's the purpose of secondaries - if one doesn't answer, just ask the other[s]. Is this a different thing for LE?

1 Like

Agreed (DNS is not in a state of agreement):


image
1 Like

Again I'm no DNS guru, but it wouldn't surprise me if a "REFUSED" error was treated differently than a server just not responding. But if changing the renewal time works for you, then this probably isn't the issue. :slight_smile:

2 Likes