DNS problem - SERVFAIL for (seemingly) correctly replied names

I noticed that one as well but unless tcpdump/wireshark decided to lowercase those particular packets and not others, my conclusion was that for some reason the requests came in lowercased and were (correctly) replied to in lowercase.

Good catch on your side though - now that you pointed it out though I did some digging in that direction. Went over the logs and the packet dump again and looked up packets for 10 failed domains of that batch. Each and every one of them was queried in lowercase for the record that failed. See an example for domain that failed with "SERVFAIL looking up CAA for www.esterpavlu.cz":

No.	Time	Source	Destination	Protocol	Length	Info
6297496	13475.122192	66.133.109.36	91.239.200.243	DNS	104	Standard query 0x8907 TXT _aCME-cHAlleNgE.wWW.ESterpavlu.CZ OPT
6297498	13475.122562	91.239.200.243	66.133.109.36	DNS	160	Standard query response 0x8907 TXT _aCME-cHAlleNgE.wWW.ESterpavlu.CZ TXT OPT
6297528	13475.198045	54.201.180.224	91.239.200.243	DNS	88	Standard query 0x1b8b CAA www.eSTErPaVLu.CZ OPT
6297529	13475.198247	91.239.200.243	54.201.180.224	DNS	148	Standard query response 0x1b8b CAA www.eSTErPaVLu.CZ SOA ns1.thinline.CZ OPT
6299746	13479.040015	66.133.109.36	91.239.200.243	DNS	88	Standard query 0x3144 CAA www.esterpavlu.cz OPT
6299747	13479.040178	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0x3144 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT
6300480	13480.697110	66.133.109.36	91.239.200.243	DNS	88	Standard query 0xef70 CAA www.esterpavlu.cz OPT
6300481	13480.697187	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0xef70 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT
6300642	13481.005659	66.133.109.36	91.239.200.243	DNS	88	Standard query 0xa356 CAA www.esterpavlu.cz OPT
6300643	13481.005773	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0xa356 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT

TXT record is resolved with case randomization by 66.133.109.36 (outbound1.letsencrypt.org). Same goes for CAA record from IP 54.201.180.224 (unnamed AWS machine). However, 66.133.109.36 did query for the CAA record in all-lowercase, multiple times.

This behaviour matches the error message returned by LE API in all 10 cases - the record that is resolved in all-lowercase is the record the message from the API names as unresolvable.

Opposed to that, I randomly picked few domains that got their certificate without any error and searched for packets from the same IP. An example:

No.	Time	Source	Destination	Protocol	Length	Info
6464370	13805.054637	66.133.109.36	91.239.200.243	DNS	106	Standard query 0xdbbe TXT _ACME-chALLEngE.wwW.foXmarKetINg.cZ OPT
6464371	13805.055080	91.239.200.243	66.133.109.36	DNS	162	Standard query response 0xdbbe TXT _ACME-chALLEngE.wwW.foXmarKetINg.cZ TXT OPT
6464435	13805.198432	66.133.109.36	91.239.200.243	DNS	86	Standard query 0xa69b CAA fOxmaRkeTinG.CZ OPT
6464436	13805.198641	91.239.200.243	66.133.109.36	DNS	146	Standard query response 0xa69b CAA fOxmaRkeTinG.CZ SOA ns1.thinline.CZ OPT

Randomized case, single attempt for both records, certificate issued.

I mean... if the CAA record has 15 characters, then with one bit per character there's a reasonable 1 in 30000 chance to get all-lowercase after randomization. And since the client knows it is using randomization, it might be rejecting reply to such a request based on the lowercase only, without checking what went out? Thing is, even if that was the case, getting all-lowercase randomly with TXT record (13 bits just in acme-challenge) is far less likely and getting it in 10 cases during one night should be statistically impossible.

I have only wild theories at this point... LE DNS client just randomly deciding to not randomize but to expect randomized reply? Some cleverbox en route randomly altering DNS packets?

If I remember correctly, this has been a problem for past year, maybe two? I think it sort of crept in with number of failures slowly increasing over time from something that could be dealt with manually to daily time consuming annoyance. It is certainly possible that the failure rate is constant over time but the number of certificates grew.

The setup is in use since LE started. Except bi-yearly Debian upgrades no other change comes to mind.

5 Likes