For some weeks Lets Encrypt certificate renewal changed: it really sucks a lot, is absolutely unreliable. Especially for CNAME lookups we get a lot of DNS related errors like 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: DNS problem: query timed out looking up CAA for ...', or 'During secondary validation: No valid IP
addresses found for ...'. I know for sure, that on our side the DNS service has not been changed, so I guess, the LE validation software is buggy. Please fix it.
There could be a bug but this usually mean one of your nameservers is not responding. If you can share which domain/subdomain is having the problem it might be possible to investigate.
There are still well over a million certificates being issued per day, so it's unlikely to be something systematic on the Let's Encrypt side of things. Perhaps your DNS server hasn't changed, but maybe some firewall or network routing in front of it has?
If you're not willing to divulge your domain names here publically, maybe you could try some online tools like dnsviz.net to see if they can spot any issues?
E.g. cse.cs.uni-magdeburg.de, cse.iks.cs.uni-magdeburg.de, vecs.cs.ovgu.de, vecs.cs.uni-magdeburg.de - order 60838548/6146549384 .
There were some issues with DNS validation that were addressed around the start of October, but there haven't been many complaints since then.
I'd echo @webprofusion and say that it would be very useful to also know:
- What time of day you are hitting these validation failures, and
- Whethere there are any common authoritative nameservers that are involved in the validation failures (for example, Network Solutions?)
I doubt, that this is FW related: the zone/containers provide a single service, only - so FW rules are static - never need to change.
DNSViz does report some issues, though I'm not enough of a DNS guru to know if they're related to the failures you're experiencing.
The cronjob runs every day at 0000 UTC. It checks, whether any cert requires renewal and if so it asks for renewal one after another ...
For auth servers, just use dig ti find out.
That's useful to know, thanks. If you take a look at one of the previous threads (During secondary validation: No valid IP addresses found) you'll notice there is a pattern of integrators hitting this error at 00:00, 01:00, 02:00 UTC.
There is a thundering herd effect at those times and you can make your renewal more robust by avoiding them.
There is also some advice on this at https://letsencrypt.org/docs/integration-guide/#when-to-renew .
Even though this kind of error is not ideal in general, it might be helpful to see whether moving away from that time of day gives you an improvement.
ovgu.de delegates the cs subdomain to {ns,ns2}.cs.ovgu.de. Not sure either, why DNSViz thinks, there is a problem ...
This seems to be a good hint. Moved it to 21:00 - all our LE driven clients run their check cronjob ~ 0300 on $Xday ...
Great! Let us know how your next renewals go.
Choosing a minute other than :00 would be even better, if your environment allows for it.
OK, changed to 21:33. Basically not a problem, because a single service does all the real work, LE clients just redirect and "poll" it for a new cert (FWIW: details are here ...).
Yes, great idea, definitely try not to do stuff on the hour. Everyone is generally running in sync with internet time and if they all do stuff on the hour the spikes are going to be pretty brutal. That said, this seems like something that will only get worse so there's possibly a scaling issue which needs to be resolved within LE.
Just to stay within the ovgu.de domain to make the graph a bit simpler, you can look at the one of vecs.cs.ovgu.de:
Just like the other one, it says there are errors with "malformed response" and "invalid RCODE (REFUSED)." It may not be what Let's Encrypt is getting hung up on, but it might be worth looking into.
Yes. I guess all the people think, the others are so smart and choose a time != *:00 - so it is a good idea to use ...;-). Anyway, a temporary error would be nice, so that a renewal would have the chance to succeed on the next day(s). But IIRC the RFC does not allow this?
Yes, I agree. It seems like one of the auth nameservers that is listed:
$ dig +noall +answer @ns.cs.uni-magdeburg.de cs.ovgu.de ns
cs.ovgu.de. 86400 IN NS ns2.cs.ovgu.de.
cs.ovgu.de. 86400 IN NS ns.cs.ovgu.de.
cs.ovgu.de. 86400 IN NS luxator.cs.ovgu.de.
does not act authoritatively and gives REFUSED
:
$ dig @luxator.cs.ovgu.de cs.ovgu.de soa
; <<>> DiG 9.16.1-Ubuntu <<>> @luxator.cs.ovgu.de cs.ovgu.de soa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 60626
That might be what dnsviz is complaining about.
However, it seems like Let's Encrypt's resolver is mostly overlooking this error.
Otherwise we would see a "permanent" error rather than an intermittent one.
Ahh, there are toolbox hints. Hmmm, need to check - the mentioned server is not under my control. Anyway, IIRC that's the purpose of secondaries - if one doesn't answer, just ask the other[s]. Is this a different thing for LE?
Agreed (DNS is not in a state of agreement):
Again I'm no DNS guru, but it wouldn't surprise me if a "REFUSED" error was treated differently than a server just not responding. But if changing the renewal time works for you, then this probably isn't the issue.