Having an issue with these 3 tools being able to resolve A/AAAA records for some of our domains and not others. One of our hosting providers (pantheon) uses these tools to generate certs for the domains they host for us. However, they are stating that they are unable to renew them because of SERVFAIL responses. All of our domains are hosted on the exact same NSs (NIOS) with nearly the exact same configuration, some work and some do not but appears to be limited to your environment. The only pattern that I noticed is that anything 'infoblox' fails while everything else works.
Not working:
infoblox.com
infoblox.ch
infoblox.se
blogs.infoblox.com
Working:
dnsadvisor.com
snaproute.com
infobloxfederal.com
I have tried for a couple days to figure out what the problem is to no avail. For anything infoblox.* I never see the requests reach our NS. The same infoblox.* domains from unboundtest seem to get stuck in some referral loop and eventually return a SERVFAIL. But the fact remains they all fail across all 3 of these tools and don't appear to fail anywhere else.
I worked with @rg305 and he too couldn't find anything painfully obvious. Hoping one of the engineers here @jsha or @JamesLE can provide some insight as to what the problem is between these tools and these specific domains.
Hey man, they never really gave me one, just kept giving me links to your tools and screenshots like such. To me irrelevant because we don't utilize CAA records so that should be a non-issue.
blogs.infoblox.com: `acme: authorization error
for blogs.infoblox.com: 400 urn:ietf:params:acme:error:dns: DNS problem:
SERVFAIL looking up CAA for infoblox.com - the domain's nameservers may be
malfunctioning`
Thanks for sharing the message. That definitely helps a lot. Also it's really interesting that this fails on unboundtest consistently. That at least makes it easier to figure out than an inconsistent failure.
However, it shouldn't need to resolve your nameservers, because they are in-domain, and the referral from <.com> contains glue for them:
$ dig a iNfobloX.com @c.gtld-servers.net
; <<>> DiG 9.16.1-Ubuntu <<>> a iNfobloX.com @c.gtld-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4870
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 9
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;iNfobloX.com. IN A
;; AUTHORITY SECTION:
iNfobloX.com. 172800 IN NS ns1.iNfobloX.com.
iNfobloX.com. 172800 IN NS ns2.iNfobloX.com.
iNfobloX.com. 172800 IN NS ns3.iNfobloX.com.
iNfobloX.com. 172800 IN NS ns4.iNfobloX.com.
iNfobloX.com. 172800 IN NS ns5.iNfobloX.com.
iNfobloX.com. 172800 IN NS ns6.iNfobloX.com.
;; ADDITIONAL SECTION:
ns1.iNfobloX.com. 172800 IN A 207.47.7.140
ns2.iNfobloX.com. 172800 IN A 205.234.19.211
ns2.iNfobloX.com. 172800 IN AAAA 2620:10a:6001:fffe::11
ns3.iNfobloX.com. 172800 IN A 205.234.19.10
ns3.iNfobloX.com. 172800 IN AAAA 2620:10a:6001:fffe::10
ns4.iNfobloX.com. 172800 IN A 207.47.7.139
ns5.iNfobloX.com. 172800 IN A 52.21.154.140
ns6.iNfobloX.com. 172800 IN A 23.99.82.199
My next thought was that some component might not be honoring caps-for-id, which our Unbound config uses, but that's clearly not the case as you can see above. Another thought is a DNSSEC misconfiguration, which can lead to SERVFAILs. But according to https://dnsviz.net/d/infoblox.com/dnssec/ your DNSSEC looks fine.
My next recommended step would be to do some winnowing down of the problem: Set up an Unbound instance with the default config, and see if it resolves your hostname. If it doesn't, we may have found a general problem with Unbound. If it does, the difference is probably between the default Unbound config and the config on unboundtest.com (which is pretty close to our prod config, modulo performance tweaks). Try changing individual settings and re-testing to see if you get a successful resolution. I'll let you know if I think of anything else. This is a bit of a stumper, I'm afraid.
According to my own log and packet capture, Unbound spends its time chasing its tail for those exact glue records, though I'm wary of misunderstanding what Unbound's intent is.
Maybe most domains avoid hitting this problem because they don't have 6 nameservers, so the glue fits in the response.
Ooh, excellent find! Yes, it looks like that is probably a big part of it. I think it's not the nameservers that push the AUTHORITY sections past the limit, but the DNSSEC records.
Normally, if a response is larger than edns-buffer-size, the authoritative server is supposed to set the TC (truncated) bit in its response. That would cause Unbound to fall back to TCP, which would succeed.
However, some authoritative servers have what, in my mind is a bug: If the only records falling over the line are in the AUTHORITY section, the server will truncate them without setting the TC bit. Depending on your reading of the spec, this could be argued to be correct or incorrect, but it definitely breaks things.
So, @thisisbroken, a quick fix for you would probably be to configure at least one of your nameservers to be under a different domain. I realize this is probably not your preferred solution, but it would be (I'm guessing) one way to get things working quickly.
@jsha DNSSEC and packet size may be an issue with infoblox.com and unbound but infoblox.ch doesn't have DNSSEC enabled and yet it returns a SERVFAIL as well. I know NS1 is missing from the delegation for infoblox.ch as well but would that cause a SERVFAIL from your end?
Are you saying that by changing the domain of one of my name servers it may allow lets encrypt to successfully resolve these domains? I dont use lets encrypt or unbound at all and I certainly dont want to be changing my environment because these tools don't work.
Where is this problem is exactly? Within lets encrypt unbound config?
@_az I agree DNSSEC with .com but didn't realize that domains not doing DNSSEC would suffer the same fate. So the problem is with unbound and I assume its a bug? Hence why @jsha recommended I change my environment to get this working? I just need to know for sure so I can go to my boss and explain the situation.
I think it's less of a bug and more an unfortunate confluence of circumstances, including Let's Encrypt choice of EDNS buffer size, the behavior of the gTLD nameservers, and the number of records in your infoblox.com delegation.
You are the collateral damage of that.
If I may make some other suggestions, not having 4 DS records, and not having 6 NS records could also get you to an acceptable response size.
If you are really intent on not changing anything about your DNS, consider that there are other free ACME CAs available who may not have this issue with their recursors, but you'd need to convince Pantheon to use them.
It does seem that all the affected domain names are using the exact same set of authoritative servers.
I first thought to discount the six as being an issue; as I have seen several domains with as many as twelve name servers and all use DNSSEC and have never encountered the problem seen here.
BUT (and this is the great catch @_az) they have all used name servers from varied TLDs - so Ihaven't seen any single root server systems replying with more than two glue entries (four if you count A & AAAA).
So this may very well be the straw that broke that DNS camels back:
Six NS entries, all from the same TLD, all with both A & AAAA records and using DNSSEC.
[Can there even be anything else added to such an order? YEAH - make it fit into 512 bytes!]
infoblox.ch's nameservers are ns1.infoblox.com etc, so looking up those nameservers fails for the same reason above.
Probably! I'll outline a few options below.
It sounds like you're using Let's Encrypt via Pantheon.
@_az summarized this quite well. There's no bug in Unbound, but a confluence of settings between Let's Encrypt, the .com nameservers, and your DNS zone. Together, those settings are producing the error.
Some options for you:
As suggested above, change to some third-party nameserver for at least one of your nameservers.
Reduce the size of your zone. For instance:
2a) Reduce the number of nameservers in your zone, or
2b) Remove one of your DNSSEC DS records. According to infoblox.com | DNSViz, you have records for two different keys: 49879 and 33613. 33613 is not currently in use. Sometimes that happens because you're migrating towards or away from a given key. In either case, finishing the migration and removing the excess records should help.
Weirdly, increasing the size of your responses should help. Right now, the .com nameservers can get their answer below 512 bytes by dropping the additional section, so they don't set TC. However, if the answer was bigger than 512 bytes even after dropping the additional section, I believe they would set the TC bit, which would cause TCP fallback and everything to work happily. For instance, you could probably achieve this by adding even more unused DS records.
By the way, I see according to https://crt.sh/?q=infoblox.com that you've been getting Let's Encrypt certificates for a while now. That suggests to me that you are only seeing this issue now because you happen to be in the middle of a DNSSEC key rollover while you're trying to renew, and the extra records introduced by the rollover push you right up to the edge of 512. That's what leads me to believe that (2b) is the most straightforward workaround.
I went ahead and removed the 2 old DS keys and 2 NSs from infoblox.com and unboundtest and letsdebug both work now. I assume lets encrypt will as well? I should have deleted the 2 DS keys and then tested but I didn't. Both infoblox.ch (6 NS) and infoblox.se (4 NS) both resolve as well so I will assume this is fixed. I will add the 2 NS back and test again to see if whether it makes any difference.
As far as why there were 4 DS keys its because I usually leave the old 2 for a couple days when adding the new ones but simply forgot to remove them. That along with the 2 additional NSs seems to have pushed this over the limit.
Thanks @_az for figuring this out and thanks @jsha and @rg305 for all of your help.
Response:
;; opcode: QUERY, status: NOERROR, id: 44209
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;infoblox.com. IN A
;; ANSWER SECTION:
infoblox.com. 0 IN A 23.185.0.3
----- Unbound logs -----
Oct 09 00:07:44 unbound[15759:0] notice: init module 0: validator
Oct 09 00:07:44 unbound[15759:0] notice: init module 1: iterator
Oct 09 00:07:44 unbound[15759:0] info: start of service (unbound 1.10.1).
Oct 09 00:07:45 unbound[15759:0] info: 127.0.0.1 infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: resolving infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: priming . IN NS
Oct 09 00:07:45 unbound[15759:0] info: response for . NS IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <.> 192.203.230.10#53
Oct 09 00:07:45 unbound[15759:0] info: query response was ANSWER
Oct 09 00:07:45 unbound[15759:0] info: priming successful for . NS IN
Oct 09 00:07:45 unbound[15759:0] info: response for infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <.> 2001:500:2d::d#53
Oct 09 00:07:45 unbound[15759:0] info: query response was REFERRAL
Oct 09 00:07:45 unbound[15759:0] info: response for infoblox.com. A IN
Oct 09 00:07:45 unbound[15759:0] info: reply from <com.> 2001:500:d937::30#53