Hello,
we are a webhosting provider trying to troubleshoot occasional certificate renewal failures, where the LE API responds with DNS-related error like this one:
{ "type":"urn:ietf:params:acme:error:dns",
"detail":"DNS problem: SERVFAIL looking up TXT for _acme-challenge.lidovapisen.cz - the domain's nameservers may be malfunctioning",
"status":400 }
We tried checking our servers for errors and outages, but in the end - after we came up with nothing - we just got packet dump of port 53 during the LE renewal process. From the packet dump it seems like everything worked correctly on our side, see example for one of the domains below:
Nameserver 1:
No. Time Source Destination Protocol Length Info
3378442 7369.380986 66.133.109.36 91.239.200.243 DNS 101 Standard query 0x5181 TXT _acme-challenge.lidovapisen.cz OPT
3378443 7369.381195 91.239.200.243 66.133.109.36 DNS 157 Standard query response 0x5181 TXT _acme-challenge.lidovapisen.cz TXT OPT
(repeated 3 times)
Nameserver 2:
3952689 7368.768877 66.133.109.36 82.100.6.2 DNS 101 Standard query 0xde97 TXT _acme-challenge.lidovapisen.cz OPT
3952690 7368.769044 82.100.6.2 66.133.109.36 DNS 236 Standard query response 0xde97 TXT _acme-challenge.lidovapisen.cz TXT NS ns1.thinline.cz NS ns2.thinline.cz NS
(repeated 2 times)
Nameserver 3:
3177831 7368.831524 66.133.109.36 91.239.202.18 DNS 101 Standard query 0x06d9 TXT _acme-challenge.lidovapisen.cz OPT
3177832 7368.831902 91.239.202.18 66.133.109.36 DNS 236 Standard query response 0x06d9 TXT _acme-challenge.lidovapisen.cz TXT NS ns1.thinline.cz NS ns2.thinline.cz NS ns3.cesky-hosting.eu OPT
(repeated 4 times)
(All times are in seconds relative to approximately 1:30 UTC on November 30th.)
Additionaly, there were other source IP addresses trying to request the same record, I presume those are different machines belonging to the LE infrastructure trying to check the records from different places.
The failure is not restricted to TXT records, it appears for CAA records as well:
{ "type":"urn:ietf:params:acme:error:dns",
"detail":"DNS problem: SERVFAIL looking up CAA for www.laysedlakova.cz - the domain's nameservers may be malfunctioning",
"status":400 }
Again, from the packet dump it seems like the server replied correctly - both for this variant of the name and for no-www variant laysedlakova.cz as well (shortened listing with nameservers merged):
No. Time Source Destination Protocol Length Info
8819581 18415.323444 3.137.221.195 91.239.200.243 DNS 86 Standard query 0x8f84 CAA LaysedLAKovA.CZ OPT
8819582 18415.323826 91.239.200.243 3.137.221.195 DNS 146 Standard query response 0x8f84 CAA LaysedLAKovA.CZ SOA ns1.thinline.CZ OPT
8822237 18420.615618 66.133.109.36 91.239.200.243 DNS 90 Standard query 0xe991 CAA www.laysedlakova.cz OPT
8822238 18420.615784 91.239.200.243 66.133.109.36 DNS 150 Standard query response 0xe991 CAA www.laysedlakova.cz SOA ns1.thinline.cz OPT
From what I can see, it seems like our nameservers send a reply properly and it gets lost in transit somewhere. If that is the case, it's definitively an intermittent problem, we are generally trying to renew few hundreds of certificates each day and only get these failures for 5-10 of them. Considering that one of the servers is in a different datacenter and the problem appears when checking all nameservers at the same time, it feels like a packet loss on some international line or further, ie. nothing we will be able to fix.
So, few questions:
-
Can you spot something we clearly missed while trying to solve this?
-
Is LE trying to evaluate all nameservers and returning failure when it doesn't get response from all of them? Would it be possible to re-try later?
-
How should we solve this? I mean retrying from our side is an obvious solution to this problem but that feels like doing it at an incorrect end of things. (That said, we are not opposed to retrying - that's what we've been doing to renew affected certificates anyway. But we would like a confirmation from LE side that this is the preferred way of handling the situation.)
Additional notes:
I tried to search the LE forum for similar problems but came up with nothing that felt relevant. The cases I found seemed related to oddly behaving nameservers or randomized capitalization not being supported on the nameserver side. None of these seem to apply in this case.
The machine handling the renewal process is a Debian (Buster, then Bullseye) Linux, client used is
GitHub - unixcharles/acme-client: A Ruby client for the letsencrypt's ACME protocol. (considering the type of failure it feels like this is not relevant but putting it here for completness)
EDIT: formatting