SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally

Hi @cpu, thanks for going through those - what command are you using to show that the servers are not handling 0x20 randomization?

I attempted to verify ns1.meganameservers.eu. is having mixed case issues using a little python.

First establishing a baseline with all lower case.

for i in range(100):
    call(['dig www.faerykisses.co.uk @ns1.meganameservers.eu. | grep -e "SERVFAIL\|Query time"'], shell=True)

The server typically responds within 90-99 msec (from Iowa) with no SERVFAIL

Then attempting mixed case:

for i in range(100):
    call(['dig '+''.join(x.upper() if random.randint(0,1) else x for x in domain)+' @ns1.meganameservers.eu. | grep -e "SERVFAIL\|Query time"'], shell=True)

The server typically responds within 90-99 msec (from Iowa) with no SERVFAIL

I cannot reproduce the 0x20 randomization issues you’re seeing @cpu. Can you share your methodology please? Thanks!

The SERVFAIL isn't from the authoritative nameserver, its from a recursive resolver enforcing 0x20 case randomization. You won't be able to see the SERVFAIL response in this case by using dig against the nameserver directly.

I was using dig to query the authoritative nameservers, manually permutating the case in my query and observing the case in the returned answers.

For one example, try dig @ns1.meganameservers.eu BrIgHtGeN.CoM A. You'll see the answer section is returned with a lowercase domain name:

...
brightgen.com.		86400	IN	A	23.185.0.1
...

As compared to a nameserver respecting 0x20 case randomization: dig @ns1.binaryparadox.net bInArYpArAdOx.NeT A. You'll see the answer section is returned matching case:

...
bInArYpArAdOx.NeT.	86400	IN	A	198.50.205.8
...

brightgen.com is okay – for Unbound’s purposes, only the question section of the response has to reflect the random capitalization. It’s okay for the answer, authority and additional sections to be in lowercase.

$ dig @ns1.meganameservers.eu +norecurse BRIGHTGEN.COM
...
;; QUESTION SECTION:
;BRIGHTGEN.COM.                 IN      A

;; ANSWER SECTION:
brightgen.com.          86400   IN      A       23.185.0.1
...

Compare “dig @ns1.yahoo.com +norecurse YAHOO.COM”, which works the same way.

Edit: At least with Unbound 1.6.8, which is what I run.

Hi,

As @mnordhoff said, the random capitalization only must appear on QUESTION section. One thing about the authoritative nameservers in the domains you checked… all of them are using DNS Cookies and seems they are using the same DNS server, at least all of them answer with “DNS SERVER” that I suppose it is BIND. I’m not saying this is something causing the errors but it is curious ;).

$ dig @ns1.meganameservers.eu www.faerykisses.co.UK +norec | grep COOKIE
; COOKIE: 44eb65e3bd9ac550bbbb7f355b61e432bd0d4b0df2156ddb (good)

$ dig @ns1.meganameservers.eu ch txt version.bind +short
"DNS SERVER"

$  dig @ns1.att-websites.com expresstrailer.nEt +norec | grep COOKIE
; COOKIE: a2b09dfb5f34af5b440a840b5b61e577ed77d7c635f8b8e6 (good)

$ dig @ns1.att-websites.com ch txt version.bind +short
"DNS SERVER"

$ dig @ns1.aplus.net www.thechartstore.BiZ +norec | grep COOKIE
; COOKIE: 0cb06504b9ce6d6a0a2dbfb65b61e5d668e163d74331f974 (good)

$ dig @ns1.aplus.net ch txt version.bind +short
"DNS SERVER"

$ dig @dns1.earthlink.net jrudman.CoM +norec | grep COOKIE
; COOKIE: d8be4c2231ea1681084a62e65b61e62bab857e59d5b53ab2 (good)

$ dig @dns1.earthlink.net ch txt version.bind +short
"DNS SERVER"

Cheers,
sahsanu

TIL. With that knowledge in hand I'm not certain what the problem is, only that it seems related to the authoritative nameservers and that this 0x20 behaviour is a commonality shared across the affected domains that doesn't appear to be shared by other non-problematic domains I've spot checked.

Another one:

www.westcoastfamilycentres.org

SOA ns1.meganameservers.com.

There's probably limited value in additional examples from the same family of authoritative nameservers since we can assume whatever problem is affecting these authoritative nameservers will likely affect all domains in zones they host.

Have you seen any additional SERVFAILs that you weren't able to debug with the usual means that are not using authoritative nameservers provided by meganameservers.eu, att-websites.com, aplus.net or earthlink.net?

@cpu
Thank you for finding the reason why LE is refusing issuance.

When was the requirement added that authoritative name servers respond with case from the original query? We've been using LE for quite some time and haven't observed this issue until the past few days. Unrelated to our system, @_az has reported a similar experience.

I tried to see if these name servers are breaking spec, and it appears returning non-matching case is allowed by RFC 1035. draft-vixie-dnsext-dns0x20-00 - Use of Bit 0x20 in DNS Labels to Improve Transaction Identity attempted to change that but was never ratified. If I understand RFC 4343 - Domain Name System (DNS) Case Insensitivity Clarification correctly, it clarifies that name servers can be expected not to preserve case when employing name compression.

I recognize it's common practice to return the query's case in the answer section, but as you can see, not all name servers choose that implementation and this (new?) requirement is causing a lot of our customers to experience failure with LE (who were previously seeing success).

When we first integrated with LE, we documented the requirements for our customers here Configure DNS and Provision HTTPS | Pantheon Global CDN

At the time, the relevant requirement was

Authoritative Name Servers must serve mixed-case lookups, and must not fail CAA lookups

Was the change that required authoritative name server responses to be in matching case documented somewhere? Do you think LE may choose to roll it back?

Thank you for your time and efforts to secure the internet.

@kf6nux As far as I know, Let’s Encrypt has enabled the case randomization setting in Unbound since day 1. Certainly it’s been on a very long time.

Unbound’s random capitalization mode doesn’t exactly require authoritative nameservers to preserve the case of the query name. If they don’t, it has a fallback mode: It repeats the query to all the nameservers and compares the responses. If they’re the same, it accepts the response(s). If they’re different, it returns SERVFAIL.

With most DNS servers preserving case, and fallback often working when they don’t, resolution usually works.

(Common reasons for fallback to fail might be that the zone has so many distant nameservers that the process takes too long, that different servers have different versions of the zone, or that they’re a load balancing or CDN platform that returns different IPs on purpose. That last is particularly unfortunate because load balancers and CDNs are often the source of the kind of third-rate hack-job DNS servers that cause problems.)

Edit: To conclude, whatever’s going on recently probably isn’t (directly) because of case randomization.

I agree.

Unless anyone has contrary evidence I still believe this is a problem with the authoritative nameserver providers mentioned in this thread since no unexplained failures have been reported with other nameserver providers. I don't know what the problem with these providers is - I thought the mixed case response may be it but have been convinced otherwise by @mnordhoff and @sahsanu.

1 Like

sahsanu’s discovery that they have the same version.bind seems like a smoking gun.

meganameservers.eu, att-websites.com, aplus.net and earthlink.net all seem to be related to carrierzone, whose website describes it as having a “robust system of abuse management technologies and services”.

For example, almost all of them use nameservers with IPs in the same blocks as ns{1..4}.carrierzone.com.

earthlink.net is an outlier in that regard, but dns1.earthlink.net’s reverse DNS zone 149.29.64.in-addr.arpa uses carrierzone.com for its own nameservers, and if you whois the earthlink.net IPs, they belong to InternetNamesForBusiness like most of the others.

Maybe there’s a routing issue between Let’s Encrypt and carrierzone? Or they added your IPs to a block list? Or there’s some interoperability bug between recent versions of Unbound and their platform?

Edit:

ns1.carrierzone.com.     (insecure)  337    A     66.175.41.100
ns1.carrierzone.com.     (insecure)  337    AAAA  2001:1810:9980:3::10
ns1.att-websites.com.    (insecure)  305    A     66.175.41.73
ns3.meganameservers.eu.  (insecure)  431    A     66.175.41.102
ns3.meganameservers.eu.  (insecure)  431    AAAA  2001:1810:4000:7::10
ns2.aplus.net.           (insecure)  274    A     66.175.41.121
ns2.aplus.net.           (insecure)  274    AAAA  2001:1810:4000:4::10

dns1.earthlink.net.      (insecure)  766    A     64.29.149.110

149.29.64.in-addr.arpa.  (insecure)  86392  NS    ns1.carrierzone.com.
149.29.64.in-addr.arpa.  (insecure)  86392  NS    ns2.carrierzone.com.
149.29.64.in-addr.arpa.  (insecure)  86392  NS    ns3.carrierzone.com.
149.29.64.in-addr.arpa.  (insecure)  86392  NS    ns4.carrierzone.com.
1 Like

I don't know whether LE unbound resolver supports/is using EDNS COOKIES but as all those authoritative names servers support it, if LE is not using it, maybe they are applying some kind of rate limit to prevent DoS attacks or DNS spoofing that is affecting requests from LE side... who knows :wink:

1 Like

I’m pretty sure Unbound doesn’t support cookies. I can’t find any evidence that it does, anyway.

1 Like

For what it’s worth, we’re having the same issue apparently.

domain www.myelement.org

Constantly gets these 2 errors from LE:

  • urn:acme:error:dns :: DNS problem: SERVFAIL looking up A for www.myelement.org
  • urn:acme:error:dns :: DNS problem: query timed out looking up A for www.myelement.org

The customer seems to be suffering serious business pains over this outage, and I believe we’re left to the mercy of LE or some DNS providers to resolve on their end? Not sure what else I can do.

Let’s Debug is no longer failing, for at least some domains:

https://letsdebug.net/brightgen.com/3182
https://letsdebug.net/www.myelement.org/3183

Has this issue magically been resolved?

1 Like

Yes, it seems to magically resolve after many hours of retrying, sometimes > 6. This is what we’ve see so far.

Well that’s good news. Kind of.

1 Like

I asked our operations team to investigate a few days ago and they seemed to see some network inconsistencies yesterday. I'm hopeful they can provide an update in thread shortly but I do not believe anything has changed on our end to address the problem.

1 Like

Yes, this domain uses aplus.net for authoritative DNS, one of the providers mentioned in the thread.

1 Like