Widespread SERVFAIL problem related to DNS 0x20

zmousm · January 23, 2019, 6:38pm

A particular set of name servers fail DNS 0x20 only when queried over TCP, and typically also cause unbound’s capsforid fallback strategy to fail (not sure why, but it might be related to the different case returned in DNSSEC records by the particular servers).

These servers are in the 194.0.1.0/24, 194.0.2.0/24, 2001:678:4::/48, 2001:678:5::/48 (anycast) address space, apparently operated by CommunityDNS.

Authoritative name servers for .be, .pl, .gr and perhaps other ccTLDs can be found in this space, so this potentially affects any domain therein. The probability for SERVFAIL increases when the delegation path points to name servers also under an affected TLD.

This is an example query that fails DNS 0x20 (using pydig for convenience):

;; TCP response from ('2001:678:4::a', 53), 644 bytes, in 0.089 sec
;; 0x20-hack qname: <Name: youtU.Be.>
;; rcode=0(NOERROR), id=64443
;; qr=1 opcode=0 aa=0 tc=0 rd=1 ra=0 z=0 ad=0 cd=0
;; question=1, answer=0, authority=8, additional=1
;; Size query=37, response=644, amp1=17.41 amp2=7.13

;; QUESTION SECTION:
youtu.be.	IN	NS
*** WARNING: Answer didn't match question!


;; AUTHORITY SECTION:
youtu.be.	86400	IN	NS	ns1.google.com.
youtu.be.	86400	IN	NS	ns2.google.com.
youtu.be.	86400	IN	NS	ns3.google.com.
youtu.be.	86400	IN	NS	ns4.google.com.
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	NSEC3	1 1 5 1a4e9b6c BA175A6M75ITNTD2DO5RIQLCVM45GSMR NS SOA RRSIG DNSKEY NSEC3PARAM
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	RRSIG	NSEC3 8 2 600 20190126161101 20190117165115 2478 be. EwIccnpBcEtGMPPkaIz1bW2I7FIhEtEZ+D8RL7JRkICXk2nZobgdKcVyTDD2fIth+5ZmLzzCkK5pyX/TpUNzVjvHlI3G5W3+Ui+BhMv3jAY+2qkuwr4/IRqy9spmSfhgi2ZbEJcMc0UojeisP8ERnTsVAGuLRD9qtDXPKIWDkeY=
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	NSEC3	1 1 5 1a4e9b6c JQ7RT3IFRO588SF81JDET9H3LLMBCU9K NS DS RRSIG
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	RRSIG	NSEC3 8 2 600 20190202170011 20190123164453 2478 be. SworU9I5MQUy0hty//rVo//yG916wuZFJyZb1O1/ii/Ueo4EZUZ5lzQ3XQkI6qmZMBMmFINebbAS7gJgVKNmbaVj4vJiZ2eeurnvmGTKXwHu4MYI/OPjoUOnNwo7KokhDCCbbCqRzVe1+BHWRJyZmdppp3awVzLD4ZZ4h5lWQ48=

;; ADDITIONAL SECTION:
;; OPT: edns_version=0, udp_payload=4096, flags=do, ercode=0(NOERROR)

The same query over UDP does not manifest the same issue:

;; UDP response from ('2001:678:4::a', 53, 0, 0), 644 bytes, in 0.044 sec
;; 0x20-hack qname: <Name: yOUtu.bE.>
;; rcode=0(NOERROR), id=38167
;; qr=1 opcode=0 aa=0 tc=0 rd=1 ra=0 z=0 ad=0 cd=0
;; question=1, answer=0, authority=8, additional=1
;; Size query=37, response=644, amp1=17.41 amp2=7.13

;; QUESTION SECTION:
yOUtu.bE.	IN	NS

;; AUTHORITY SECTION:
yOUtu.bE.	86400	IN	NS	ns1.google.com.
yOUtu.bE.	86400	IN	NS	ns2.google.com.
yOUtu.bE.	86400	IN	NS	ns3.google.com.
yOUtu.bE.	86400	IN	NS	ns4.google.com.
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	NSEC3	1 1 5 1a4e9b6c BA175A6M75ITNTD2DO5RIQLCVM45GSMR NS SOA RRSIG DNSKEY NSEC3PARAM
ba141snrnoe1rc9mddgrest23g657rir.be.	600	IN	RRSIG	NSEC3 8 2 600 20190126161101 20190117165115 2478 be. EwIccnpBcEtGMPPkaIz1bW2I7FIhEtEZ+D8RL7JRkICXk2nZobgdKcVyTDD2fIth+5ZmLzzCkK5pyX/TpUNzVjvHlI3G5W3+Ui+BhMv3jAY+2qkuwr4/IRqy9spmSfhgi2ZbEJcMc0UojeisP8ERnTsVAGuLRD9qtDXPKIWDkeY=
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	NSEC3	1 1 5 1a4e9b6c JQ7RT3IFRO588SF81JDET9H3LLMBCU9K NS DS RRSIG
jq78bsrkbnnvjo7nor8f2i20vl9k8cgo.be.	600	IN	RRSIG	NSEC3 8 2 600 20190202170011 20190123164453 2478 be. SworU9I5MQUy0hty//rVo//yG916wuZFJyZb1O1/ii/Ueo4EZUZ5lzQ3XQkI6qmZMBMmFINebbAS7gJgVKNmbaVj4vJiZ2eeurnvmGTKXwHu4MYI/OPjoUOnNwo7KokhDCCbbCqRzVe1+BHWRJyZmdppp3awVzLD4ZZ4h5lWQ48=

;; ADDITIONAL SECTION:
;; OPT: edns_version=0, udp_payload=4096, flags=do, ercode=0(NOERROR)

Considering the widespread impact of this problem I think Let’s Encrypt perhaps should consider getting in touch with CDNS and/or investigating whether an IP blacklist can be implemented in unbound, similar to caps-whitelist but for servers.

mnordhoff · January 23, 2019, 6:46pm

Incidentally, CommunityDNS also operates ns3.gratisdns.dk, which causes capsforid fallback failures by serving weird records with a weird class.

zmousm · January 23, 2019, 6:48pm

Yes I noticed that as well.

mnordhoff · January 23, 2019, 6:55pm

Also, for context, TCP fallback increased a couple months ago.

jsha · January 23, 2019, 8:07pm

This is interesting, we’ll take a look! Thanks for the detailed report. How did this come to your attention? Are you the administrator for an affected domain name?

zmousm · January 23, 2019, 9:27pm

Yes. A user report last week drew our attention to the problem:

https://letsdebug.net/snf-45288.vm.okeanos-global.grnet.gr/16112

I noticed the issue with gr-c.ics.forth.gr and reported it to the ccTLD operator.

I later realized the same issue affects a number of TLDs served by CDNS. I contacted them myself in order to explain the situation.

jsha · January 23, 2019, 9:41pm

Thanks for reaching out to them. If you’re able to find out what DNS software they’re using, that would be super useful.

mnordhoff · January 23, 2019, 9:42pm

They have their own proprietary DNS server, according to their website.

http://www.cdns.net/CommunityDNS-Leaders-in-Security.html

zmousm · January 23, 2019, 11:02pm

I know nothing further myself.

zmousm · January 28, 2019, 10:27pm

Further tests showed 35 TLDs affected by this breakage. I posted the list to DNS-OARC, hoping someone from CDNS will take notice.

https://lists.dns-oarc.net/pipermail/dns-operations/2019-January/018359.html

gryphius · January 29, 2019, 12:39pm

CommunityDNS has confirmed (by email) they are aware of the problem and working on a fix.

zmousm · February 4, 2019, 5:41pm

As of today, the problem appears to have been resolved, for the most part (only 5 TLDs still affected, rather than 35 previously).

https://lists.dns-oarc.net/pipermail/dns-operations/2019-February/018373.html

Thanks to whoever worked towards getting this fixed.

zmousm · February 4, 2019, 5:48pm

Also worth noting:

@_az discovered and reported the bug triggering capsforid fallback failure, leading to SERVFAIL in these cases.

https://nlnetlabs.nl/pipermail/unbound-users/2019-January/011335.html

And then it was fixed:

https://nlnetlabs.nl/pipermail/unbound-users/2019-February/011349.html

mnordhoff · February 5, 2019, 1:13pm

~~The fix made it into Unbound 1.9.0, which was just released.~~

https://nlnetlabs.nl/projects/unbound/download/

@jsha ~~Howww quickly can Let’s Encrypt upgrade Unbound? I don’t know what the process is for testing and baking a new major release, buuut this issue is also serious.~~

Edit: This post is wrong. See @_az below.

zmousm · February 5, 2019, 6:24pm

It should be noted that the fix apparently has not been rolled out yet to all anycast instances of affected name servers. So depending on which instance your DNS request is routed to, you may still see this problem, hopefully not for long.

In order to test this I created a RIPE Atlas DNS measurement, TCP to 194.0.1.25 (gr-c.ics.forth.gr, NS for gr) with a mixed case qname:

Parsing the results, decoding abuf and looking for an all-lowercase qname in the answer shows two probes (out of 50) reach instances which manifest the 0x20 breakage.

_az · February 5, 2019, 7:39pm

It did not. 1.9.0 is re-labelled 1.9.0rc1, the fix came in a later commit.

Let's Encrypt will have to wait for 1.9.1 or patch manually.

Well, that explains all the weirdness I experienced yesterday while testing haha.

mnordhoff · February 5, 2019, 7:41pm

Argh! Thanks. Sorry for the misinformation. I thought I'd checked the changelog.

jsha · February 5, 2019, 10:32pm

Thanks for the clarification. Our current plan is to wait for 1.9.1, but I'm interested to hear your evaluation of the severity of this issue. Do you think it's worth an out-of-band deploy? It sounds like CommunityDNS is well on their way to fixing the issue on their end, and I haven't heard of other servers with the same problem.

zmousm · February 5, 2019, 10:44pm

I think that patching or disabling capsforid until a patched release is deployed would be the responsible thing to do.

CDNS may have started rolling out a fix, but it is hard to tell when that will be done. 0x20 and especially the fallback still feel very fragile, no wonder it is still tagged as experimental. I am positive there is a lot more 0x20 breakage out there in the wild. Debugging it is also quite hard IMHO; unbound debug log does not really help – more verbosity such as what @_az added to showcase the fallback sorting bug would come in handy.

_az · February 5, 2019, 10:56pm

Hard to judge with anycast . From my perspective (2 locations), it’s been fully fixed by CDNS in the past 48h, but whether it still affects the two Let’s Encrypt facilities is probably a question for the VA logs.

Topic		Replies	Views
SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally Help	46	4441	September 6, 2018
SERVFAIL looking up CAA, but I see NOERROR myself Help	25	7277	September 7, 2017
SERVFAIL while renewing Help	11	1702	January 28, 2019
False CAA failure when issuing certs Issuance Tech	35	4185	August 9, 2018
SERVFAIL from authoritative DNS server (0x20 case randomization issue) Help	2	969	August 31, 2018

Widespread SERVFAIL problem related to DNS 0x20

Related topics