SERVFAIL A/AAAA letsencrypt, letsdebug and unboundtest

thisisbroken · October 8, 2020, 8:58pm

Having an issue with these 3 tools being able to resolve A/AAAA records for some of our domains and not others. One of our hosting providers (pantheon) uses these tools to generate certs for the domains they host for us. However, they are stating that they are unable to renew them because of SERVFAIL responses. All of our domains are hosted on the exact same NSs (NIOS) with nearly the exact same configuration, some work and some do not but appears to be limited to your environment. The only pattern that I noticed is that anything 'infoblox' fails while everything else works.

Not working:
infoblox.com
infoblox.ch
infoblox.se
blogs.infoblox.com 

Working:
dnsadvisor.com
snaproute.com
infobloxfederal.com

I have tried for a couple days to figure out what the problem is to no avail. For anything infoblox.* I never see the requests reach our NS. The same infoblox.* domains from unboundtest seem to get stuck in some referral loop and eventually return a SERVFAIL. But the fact remains they all fail across all 3 of these tools and don't appear to fail anywhere else.

I worked with @rg305 and he too couldn't find anything painfully obvious. Hoping one of the engineers here @jsha or @JamesLE can provide some insight as to what the problem is between these tools and these specific domains.

jsha · October 8, 2020, 9:00pm

Hi @thisisbroken,

Could you please share the exact error message you are receiving from Pantheon?

Thanks,
Jacob

thisisbroken · October 8, 2020, 9:03pm

Hey man, they never really gave me one, just kept giving me links to your tools and screenshots like such. To me irrelevant because we don't utilize CAA records so that should be a non-issue.

blogs.infoblox.com: `acme: authorization error
for blogs.infoblox.com: 400 urn:ietf:params:acme:error:dns: DNS problem:
SERVFAIL looking up CAA for infoblox.com - the domain's nameservers may be
malfunctioning`

https://unboundtest.com/m/CAA/infoblox.com/L6YULUQY - Again complaining about CAA records but if you run this tool for A/AAAA it returns the same result.

jsha · October 8, 2020, 9:26pm

Thanks for sharing the message. That definitely helps a lot. Also it's really interesting that this fails on unboundtest consistently. That at least makes it easier to figure out than an inconsistent failure.

BTW, it's worth reading https://letsencrypt.org/docs/caa/ for details about why we look up CAA even for domains that don't use CAA.

Looking at https://unboundtest.com/m/CAA/infoblox.com/L6YULUQY, it seems that Unbound is repeatedly trying to resolve your nameservers (ns1.infoblox.com etc) and getting what it considers a referral.

However, it shouldn't need to resolve your nameservers, because they are in-domain, and the referral from <.com> contains glue for them:

$ dig a iNfobloX.com @c.gtld-servers.net

; <<>> DiG 9.16.1-Ubuntu <<>> a iNfobloX.com @c.gtld-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4870
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 9
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;iNfobloX.com.                  IN      A

;; AUTHORITY SECTION:
iNfobloX.com.           172800  IN      NS      ns1.iNfobloX.com.
iNfobloX.com.           172800  IN      NS      ns2.iNfobloX.com.
iNfobloX.com.           172800  IN      NS      ns3.iNfobloX.com.
iNfobloX.com.           172800  IN      NS      ns4.iNfobloX.com.
iNfobloX.com.           172800  IN      NS      ns5.iNfobloX.com.
iNfobloX.com.           172800  IN      NS      ns6.iNfobloX.com.

;; ADDITIONAL SECTION:
ns1.iNfobloX.com.       172800  IN      A       207.47.7.140
ns2.iNfobloX.com.       172800  IN      A       205.234.19.211
ns2.iNfobloX.com.       172800  IN      AAAA    2620:10a:6001:fffe::11
ns3.iNfobloX.com.       172800  IN      A       205.234.19.10
ns3.iNfobloX.com.       172800  IN      AAAA    2620:10a:6001:fffe::10
ns4.iNfobloX.com.       172800  IN      A       207.47.7.139
ns5.iNfobloX.com.       172800  IN      A       52.21.154.140
ns6.iNfobloX.com.       172800  IN      A       23.99.82.199

My next thought was that some component might not be honoring caps-for-id, which our Unbound config uses, but that's clearly not the case as you can see above. Another thought is a DNSSEC misconfiguration, which can lead to SERVFAILs. But according to https://dnsviz.net/d/infoblox.com/dnssec/ your DNSSEC looks fine.

My next recommended step would be to do some winnowing down of the problem: Set up an Unbound instance with the default config, and see if it resolves your hostname. If it doesn't, we may have found a general problem with Unbound. If it does, the difference is probably between the default Unbound config and the config on unboundtest.com (which is pretty close to our prod config, modulo performance tweaks). Try changing individual settings and re-testing to see if you get a successful resolution. I'll let you know if I think of anything else. This is a bit of a stumper, I'm afraid.

_az · October 8, 2020, 9:38pm

Is this maybe because of Let's Encrypt setting the EDNS Buffer Size to 512? It's preventing the glue records being sent from the gTLD nameservers.

The difference between (glue sent):

dig +norecurse +dnssec  @j.gtld-servers.net infoblox.com

and (no glue):

dig +norecurse +dnssec +bufsize=512 @j.gtld-servers.net infoblox.com

According to my own log and packet capture, Unbound spends its time chasing its tail for those exact glue records, though I'm wary of misunderstanding what Unbound's intent is.

Maybe most domains avoid hitting this problem because they don't have 6 nameservers, so the glue fits in the response.

jsha · October 8, 2020, 9:43pm

Ooh, excellent find! Yes, it looks like that is probably a big part of it. I think it's not the nameservers that push the AUTHORITY sections past the limit, but the DNSSEC records.

Normally, if a response is larger than edns-buffer-size, the authoritative server is supposed to set the TC (truncated) bit in its response. That would cause Unbound to fall back to TCP, which would succeed.

However, some authoritative servers have what, in my mind is a bug: If the only records falling over the line are in the AUTHORITY section, the server will truncate them without setting the TC bit. Depending on your reading of the spec, this could be argued to be correct or incorrect, but it definitely breaks things.

jsha · October 8, 2020, 9:46pm

So, @thisisbroken, a quick fix for you would probably be to configure at least one of your nameservers to be under a different domain. I realize this is probably not your preferred solution, but it would be (I'm guessing) one way to get things working quickly.

thisisbroken · October 8, 2020, 9:59pm

@jsha DNSSEC and packet size may be an issue with infoblox.com and unbound but infoblox.ch doesn't have DNSSEC enabled and yet it returns a SERVFAIL as well. I know NS1 is missing from the delegation for infoblox.ch as well but would that cause a SERVFAIL from your end?

Are you saying that by changing the domain of one of my name servers it may allow lets encrypt to successfully resolve these domains? I dont use lets encrypt or unbound at all and I certainly dont want to be changing my environment because these tools don't work.

Where is this problem is exactly? Within lets encrypt unbound config?

_az · October 8, 2020, 10:35pm

com does, though, and that's what drives the problem.

infoblox.ch is delegated to ns#.infoblox.com, so they're both affected.

thisisbroken · October 8, 2020, 10:41pm

@_az I agree DNSSEC with .com but didn't realize that domains not doing DNSSEC would suffer the same fate. So the problem is with unbound and I assume its a bug? Hence why @jsha recommended I change my environment to get this working? I just need to know for sure so I can go to my boss and explain the situation.

_az · October 8, 2020, 10:47pm

I think it's less of a bug and more an unfortunate confluence of circumstances, including Let's Encrypt choice of EDNS buffer size, the behavior of the gTLD nameservers, and the number of records in your infoblox.com delegation.

You are the collateral damage of that.

If I may make some other suggestions, not having 4 DS records, and not having 6 NS records could also get you to an acceptable response size.

If you are really intent on not changing anything about your DNS, consider that there are other free ACME CAs available who may not have this issue with their recursors, but you'd need to convince Pantheon to use them.

rg305 · October 8, 2020, 10:50pm

It does seem that all the affected domain names are using the exact same set of authoritative servers.

I first thought to discount the six as being an issue; as I have seen several domains with as many as twelve name servers and all use DNSSEC and have never encountered the problem seen here.
BUT (and this is the great catch @_az) they have all used name servers from varied TLDs - so Ihaven't seen any single root server systems replying with more than two glue entries (four if you count A & AAAA).

So this may very well be the straw that broke that DNS camels back:
Six NS entries, all from the same TLD, all with both A & AAAA records and using DNSSEC.
[Can there even be anything else added to such an order? YEAH - make it fit into 512 bytes!]

_az · October 8, 2020, 10:56pm

Let's see.

infoblox.com is at 664 bytes right now, so they need to recover 152 bytes somehow.

Dropping 3x DS records and 2x NS records would do it for sure. Probably that's too cutting too many by 1 or 2 records.

But, I think the original suggestion, which is to mix up the domains of one or two of the nameservers, is much less invasive.

Osiris · October 8, 2020, 11:01pm

Or just set the TC flag when you're truncating stuff

_az · October 8, 2020, 11:02pm

Yes, I suppose bitching about it to https://lists.dns-oarc.net/pipermail/dns-operations is an option too .

rg305 · October 8, 2020, 11:07pm

Most that I see lately are moving in this direction:

microsoft.com   nameserver = ns1-205.azure-dns.com
microsoft.com   nameserver = ns2-205.azure-dns.net
microsoft.com   nameserver = ns3-205.azure-dns.org
microsoft.com   nameserver = ns4-205.azure-dns.info

att.com nameserver = ns1.attdns.com
att.com nameserver = ns2.attdns.com
att.com nameserver = ns3.attdns.com
att.com nameserver = ns4.attdns.com
att.com nameserver = ns5.attdns.net
att.com nameserver = ns6.attdns.net

The worst case (and most outdated is):

yahoo.com       nameserver = ns1.yahoo.com
yahoo.com       nameserver = ns2.yahoo.com
yahoo.com       nameserver = ns3.yahoo.com
yahoo.com       nameserver = ns4.yahoo.com
yahoo.com       nameserver = ns5.yahoo.com

Which is only five.
~~Wait, then we now have a new worst case = SIX~~
[That was short lived]
We have a new worst case = EIGHT

fox.com nameserver = eur1.akam.net
fox.com nameserver = eur3.akam.net
fox.com nameserver = use2.akam.net
fox.com nameserver = usw1.akam.net
fox.com nameserver = usw7.akam.net
fox.com nameserver = asia2.akam.net
fox.com nameserver = ns1-40.akam.net
fox.com nameserver = ns1-75.akam.net

griffin · October 8, 2020, 11:24pm

8 akamai servers

jsha · October 9, 2020, 12:01am

infoblox.ch's nameservers are ns1.infoblox.com etc, so looking up those nameservers fails for the same reason above.

Probably! I'll outline a few options below.

It sounds like you're using Let's Encrypt via Pantheon.

@_az summarized this quite well. There's no bug in Unbound, but a confluence of settings between Let's Encrypt, the .com nameservers, and your DNS zone. Together, those settings are producing the error.

Some options for you:

As suggested above, change to some third-party nameserver for at least one of your nameservers.
Reduce the size of your zone. For instance:

2a) Reduce the number of nameservers in your zone, or

2b) Remove one of your DNSSEC DS records. According to infoblox.com | DNSViz, you have records for two different keys: 49879 and 33613. 33613 is not currently in use. Sometimes that happens because you're migrating towards or away from a given key. In either case, finishing the migration and removing the excess records should help.
Weirdly, increasing the size of your responses should help. Right now, the .com nameservers can get their answer below 512 bytes by dropping the additional section, so they don't set TC. However, if the answer was bigger than 512 bytes even after dropping the additional section, I believe they would set the TC bit, which would cause TCP fallback and everything to work happily. For instance, you could probably achieve this by adding even more unused DS records.

By the way, I see according to https://crt.sh/?q=infoblox.com that you've been getting Let's Encrypt certificates for a while now. That suggests to me that you are only seeing this issue now because you happen to be in the middle of a DNSSEC key rollover while you're trying to renew, and the extra records introduced by the rollover push you right up to the edge of 512. That's what leads me to believe that (2b) is the most straightforward workaround.

thisisbroken · October 9, 2020, 12:13am

I guess we are just ahead of the game.

I went ahead and removed the 2 old DS keys and 2 NSs from infoblox.com and unboundtest and letsdebug both work now. I assume lets encrypt will as well? I should have deleted the 2 DS keys and then tested but I didn't. Both infoblox.ch (6 NS) and infoblox.se (4 NS) both resolve as well so I will assume this is fixed. I will add the 2 NS back and test again to see if whether it makes any difference.

As far as why there were 4 DS keys its because I usually leave the old 2 for a couple days when adding the new ones but simply forgot to remove them. That along with the 2 additional NSs seems to have pushed this over the limit.

Thanks @_az for figuring this out and thanks @jsha and @rg305 for all of your help.

Response:
;; opcode: QUERY, status: NOERROR, id: 44209
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;infoblox.com.	IN	 A

;; ANSWER SECTION:
infoblox.com.	0	IN	A	23.185.0.3

----- Unbound logs -----
Oct 09 00:07:44 unbound[15759:0] notice: init module 0: validator
Oct 09 00:07:44 unbound[15759:0] notice: init module 1: iterator
Oct 09 00:07:44 unbound[15759:0] info: start of service (unbound 1.10.1).
Oct 09 00:07:45 unbound[15759:0] info: 127.0.0.1 infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: resolving infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: priming . IN NS
Oct 09 00:07:45 unbound[15759:0] info: response for . NS IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <.> 192.203.230.10#53
Oct 09 00:07:45 unbound[15759:0] info: query response was ANSWER
Oct 09 00:07:45 unbound[15759:0] info: priming successful for . NS IN
Oct 09 00:07:45 unbound[15759:0] info: response for infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <.> 2001:500:2d::d#53
Oct 09 00:07:45 unbound[15759:0] info: query response was REFERRAL
Oct 09 00:07:45 unbound[15759:0] info: response for infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <com.> 2001:500:d937::30#53

jsha · October 9, 2020, 12:26am

Congrats! Glad we were able to help you figure it out!

Topic		Replies	Views
Intermittent SERVFAIL for DNS Validation in staging and always fails in production Help	20	1390	December 17, 2023
SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally Help	46	4584	September 6, 2018
CAA requests resulting in SERVFAIL since Dec 12th Help	22	1275	January 19, 2024
Spurious CAA SERVFAIL responses during finalize Help	40	1860	January 4, 2024
False CAA failure when issuing certs Issuance Tech	35	4260	August 9, 2018

SERVFAIL A/AAAA letsencrypt, letsdebug and unboundtest

Related topics