Widespread SERVFAIL problem related to DNS 0x20


#1

A particular set of name servers fail DNS 0x20 only when queried over TCP, and typically also cause unbound’s capsforid fallback strategy to fail (not sure why, but it might be related to the different case returned in DNSSEC records by the particular servers).

These servers are in the 194.0.1.0/24, 194.0.2.0/24, 2001:678:4::/48, 2001:678:5::/48 (anycast) address space, apparently operated by CommunityDNS.

Authoritative name servers for .be, .pl, .gr and perhaps other ccTLDs can be found in this space, so this potentially affects any domain therein. The probability for SERVFAIL increases when the delegation path points to name servers also under an affected TLD.

This is an example query that fails DNS 0x20 (using pydig for convenience):

;; TCP response from ('2001:678:4::a', 53), 644 bytes, in 0.089 sec
;; 0x20-hack qname: <Name: youtU.Be.>
;; rcode=0(NOERROR), id=64443
;; qr=1 opcode=0 aa=0 tc=0 rd=1 ra=0 z=0 ad=0 cd=0
;; question=1, answer=0, authority=8, additional=1
;; Size query=37, response=644, amp1=17.41 amp2=7.13

;; QUESTION SECTION:
youtu.be.	IN	NS
*** WARNING: Answer didn't match question!


;; AUTHORITY SECTION:
youtu.be.	86400	IN	NS	ns1.google.com.
youtu.be.	86400	IN	NS	ns2.google.com.
youtu.be.	86400	IN	NS	ns3.google.com.
youtu.be.	86400	IN	NS	ns4.google.com.
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	NSEC3	1 1 5 1a4e9b6c BA175A6M75ITNTD2DO5RIQLCVM45GSMR NS SOA RRSIG DNSKEY NSEC3PARAM
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	RRSIG	NSEC3 8 2 600 20190126161101 20190117165115 2478 be. EwIccnpBcEtGMPPkaIz1bW2I7FIhEtEZ+D8RL7JRkICXk2nZobgdKcVyTDD2fIth+5ZmLzzCkK5pyX/TpUNzVjvHlI3G5W3+Ui+BhMv3jAY+2qkuwr4/IRqy9spmSfhgi2ZbEJcMc0UojeisP8ERnTsVAGuLRD9qtDXPKIWDkeY=
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	NSEC3	1 1 5 1a4e9b6c JQ7RT3IFRO588SF81JDET9H3LLMBCU9K NS DS RRSIG
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	RRSIG	NSEC3 8 2 600 20190202170011 20190123164453 2478 be. SworU9I5MQUy0hty//rVo//yG916wuZFJyZb1O1/ii/Ueo4EZUZ5lzQ3XQkI6qmZMBMmFINebbAS7gJgVKNmbaVj4vJiZ2eeurnvmGTKXwHu4MYI/OPjoUOnNwo7KokhDCCbbCqRzVe1+BHWRJyZmdppp3awVzLD4ZZ4h5lWQ48=

;; ADDITIONAL SECTION:
;; OPT: edns_version=0, udp_payload=4096, flags=do, ercode=0(NOERROR)

The same query over UDP does not manifest the same issue:

;; UDP response from ('2001:678:4::a', 53, 0, 0), 644 bytes, in 0.044 sec
;; 0x20-hack qname: <Name: yOUtu.bE.>
;; rcode=0(NOERROR), id=38167
;; qr=1 opcode=0 aa=0 tc=0 rd=1 ra=0 z=0 ad=0 cd=0
;; question=1, answer=0, authority=8, additional=1
;; Size query=37, response=644, amp1=17.41 amp2=7.13

;; QUESTION SECTION:
yOUtu.bE.	IN	NS

;; AUTHORITY SECTION:
yOUtu.bE.	86400	IN	NS	ns1.google.com.
yOUtu.bE.	86400	IN	NS	ns2.google.com.
yOUtu.bE.	86400	IN	NS	ns3.google.com.
yOUtu.bE.	86400	IN	NS	ns4.google.com.
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	NSEC3	1 1 5 1a4e9b6c BA175A6M75ITNTD2DO5RIQLCVM45GSMR NS SOA RRSIG DNSKEY NSEC3PARAM
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	RRSIG	NSEC3 8 2 600 20190126161101 20190117165115 2478 be. EwIccnpBcEtGMPPkaIz1bW2I7FIhEtEZ+D8RL7JRkICXk2nZobgdKcVyTDD2fIth+5ZmLzzCkK5pyX/TpUNzVjvHlI3G5W3+Ui+BhMv3jAY+2qkuwr4/IRqy9spmSfhgi2ZbEJcMc0UojeisP8ERnTsVAGuLRD9qtDXPKIWDkeY=
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	NSEC3	1 1 5 1a4e9b6c JQ7RT3IFRO588SF81JDET9H3LLMBCU9K NS DS RRSIG
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	RRSIG	NSEC3 8 2 600 20190202170011 20190123164453 2478 be. SworU9I5MQUy0hty//rVo//yG916wuZFJyZb1O1/ii/Ueo4EZUZ5lzQ3XQkI6qmZMBMmFINebbAS7gJgVKNmbaVj4vJiZ2eeurnvmGTKXwHu4MYI/OPjoUOnNwo7KokhDCCbbCqRzVe1+BHWRJyZmdppp3awVzLD4ZZ4h5lWQ48=

;; ADDITIONAL SECTION:
;; OPT: edns_version=0, udp_payload=4096, flags=do, ercode=0(NOERROR)

Considering the widespread impact of this problem I think Let’s Encrypt perhaps should consider getting in touch with CDNS and/or investigating whether an IP blacklist can be implemented in unbound, similar to caps-whitelist but for servers.


Renew problems after updating certbot
#2

Incidentally, CommunityDNS also operates ns3.gratisdns.dk, which causes capsforid fallback failures by serving weird records with a weird class.


#3

Yes I noticed that as well.


#4

Also, for context, TCP fallback increased a couple months ago.


#5

This is interesting, we’ll take a look! Thanks for the detailed report. How did this come to your attention? Are you the administrator for an affected domain name?


#6

Yes. A user report last week drew our attention to the problem:

https://letsdebug.net/snf-45288.vm.okeanos-global.grnet.gr/16112

I noticed the issue with gr-c.ics.forth.gr and reported it to the ccTLD operator.

I later realized the same issue affects a number of TLDs served by CDNS. I contacted them myself in order to explain the situation.


#7

Thanks for reaching out to them. If you’re able to find out what DNS software they’re using, that would be super useful.


#8

They have their own proprietary DNS server, according to their website.

http://www.cdns.net/CommunityDNS-Leaders-in-Security.html


#9

I know nothing further myself.


#10

Further tests showed 35 TLDs affected by this breakage. I posted the list to DNS-OARC, hoping someone from CDNS will take notice.

https://lists.dns-oarc.net/pipermail/dns-operations/2019-January/018359.html


DNS problem: SERVFAIL
#11

CommunityDNS has confirmed (by email) they are aware of the problem and working on a fix.


#12

As of today, the problem appears to have been resolved, for the most part (only 5 TLDs still affected, rather than 35 previously).

https://lists.dns-oarc.net/pipermail/dns-operations/2019-February/018373.html

Thanks to whoever worked towards getting this fixed.


#13

Also worth noting:

@_az discovered and reported the bug triggering capsforid fallback failure, leading to SERVFAIL in these cases.

https://nlnetlabs.nl/pipermail/unbound-users/2019-January/011335.html

And then it was fixed:

https://nlnetlabs.nl/pipermail/unbound-users/2019-February/011349.html


#14

The fix made it into Unbound 1.9.0, which was just released.

https://nlnetlabs.nl/projects/unbound/download/

@jsha Howww quickly can Let’s Encrypt upgrade Unbound? I don’t know what the process is for testing and baking a new major release, buuut this issue is also serious.

Edit: This post is wrong. See @_az below.


#15

It should be noted that the fix apparently has not been rolled out yet to all anycast instances of affected name servers. So depending on which instance your DNS request is routed to, you may still see this problem, hopefully not for long.

In order to test this I created a RIPE Atlas DNS measurement, TCP to 194.0.1.25 (gr-c.ics.forth.gr, NS for gr) with a mixed case qname:

Parsing the results, decoding abuf and looking for an all-lowercase qname in the answer shows two probes (out of 50) reach instances which manifest the 0x20 breakage.


#16

It did not. 1.9.0 is re-labelled 1.9.0rc1, the fix came in a later commit.

Let’s Encrypt will have to wait for 1.9.1 or patch manually.

Well, that explains all the weirdness I experienced yesterday while testing haha.


#17

Argh! Thanks. Sorry for the misinformation. I thought I’d checked the changelog.


#18

Thanks for the clarification. Our current plan is to wait for 1.9.1, but I’m interested to hear your evaluation of the severity of this issue. Do you think it’s worth an out-of-band deploy? It sounds like CommunityDNS is well on their way to fixing the issue on their end, and I haven’t heard of other servers with the same problem.


#19

I think that patching or disabling capsforid until a patched release is deployed would be the responsible thing to do.

CDNS may have started rolling out a fix, but it is hard to tell when that will be done. 0x20 and especially the fallback still feel very fragile, no wonder it is still tagged as experimental. I am positive there is a lot more 0x20 breakage out there in the wild. Debugging it is also quite hard IMHO; unbound debug log does not really help – more verbosity such as what @_az added to showcase the fallback sorting bug would come in handy.


#20

Hard to judge with anycast :man_shrugging: . From my perspective (2 locations), it’s been fully fixed by CDNS in the past 48h, but whether it still affects the two Let’s Encrypt facilities is probably a question for the VA logs.