Widespread SERVFAIL problem related to DNS 0x20

As of today, the problem appears to have been resolved, for the most part (only 5 TLDs still affected, rather than 35 previously).

https://lists.dns-oarc.net/pipermail/dns-operations/2019-February/018373.html

Thanks to whoever worked towards getting this fixed.

Also worth noting:

@_az discovered and reported the bug triggering capsforid fallback failure, leading to SERVFAIL in these cases.

https://nlnetlabs.nl/pipermail/unbound-users/2019-January/011335.html

And then it was fixed:

https://nlnetlabs.nl/pipermail/unbound-users/2019-February/011349.html

2 Likes

The fix made it into Unbound 1.9.0, which was just released.

https://nlnetlabs.nl/projects/unbound/download/

@jsha Howww quickly can Let’s Encrypt upgrade Unbound? I don’t know what the process is for testing and baking a new major release, buuut this issue is also serious.

Edit: This post is wrong. See @_az below.

3 Likes

It should be noted that the fix apparently has not been rolled out yet to all anycast instances of affected name servers. So depending on which instance your DNS request is routed to, you may still see this problem, hopefully not for long.

In order to test this I created a RIPE Atlas DNS measurement, TCP to 194.0.1.25 (gr-c.ics.forth.gr, NS for gr) with a mixed case qname:

Parsing the results, decoding abuf and looking for an all-lowercase qname in the answer shows two probes (out of 50) reach instances which manifest the 0x20 breakage.

4 Likes

It did not. 1.9.0 is re-labelled 1.9.0rc1, the fix came in a later commit.

Let's Encrypt will have to wait for 1.9.1 or patch manually.

Well, that explains all the weirdness I experienced yesterday while testing haha.

3 Likes

Argh! Thanks. Sorry for the misinformation. I thought I'd checked the changelog.

1 Like

Thanks for the clarification. Our current plan is to wait for 1.9.1, but I'm interested to hear your evaluation of the severity of this issue. Do you think it's worth an out-of-band deploy? It sounds like CommunityDNS is well on their way to fixing the issue on their end, and I haven't heard of other servers with the same problem.

I think that patching or disabling capsforid until a patched release is deployed would be the responsible thing to do.

CDNS may have started rolling out a fix, but it is hard to tell when that will be done. 0x20 and especially the fallback still feel very fragile, no wonder it is still tagged as experimental. I am positive there is a lot more 0x20 breakage out there in the wild. Debugging it is also quite hard IMHO; unbound debug log does not really help – more verbosity such as what @_az added to showcase the fallback sorting bug would come in handy.

Hard to judge with anycast :man_shrugging: . From my perspective (2 locations), it’s been fully fixed by CDNS in the past 48h, but whether it still affects the two Let’s Encrypt facilities is probably a question for the VA logs.

We probably won't disable capsforid - we've been running with it for three years now, and while it's occasionally unearthed some resolvers that break, it's been rare, and most of them have been willing to roll out fixes. It seems like in this case, it was actually the rollout of our edns-buffer-size: 512 change that triggered this bug in Unbound. If anything we might consider increasing that somewhat to a number that is still lower than most Internet path MTUs, but it's not immediately clear to me that this issue warrants that.

My understanding is that even before CommunityDNS rolled out their change, this was an intermittent error, so one attempt might fail but the next attempt might succeed. In a typical setup with renewal attempts starting at 30 days and retrying twice a day, I would expect that the vast majority of renewals would eventually succeed. Do you have any examples of certificates that are consistently failing renewal?

The validation logs from Boulder show a pretty typical base rate of SERVFAILs related to domains ending in ".be" over the last 14 days -- about 5 per hour.

1 Like

Is this increasing 5 per hour? Or is that a revolving pool of names?
[where some of those names (eventually) get served and drop off]

Nope, not increasing. Mostly it’s a revolving pool of names, but they are failing for unrelated reasons (e.g. authoritative resolver returns SERVFAIL for CAA).

1 Like

I dunno... I was worried about one user with a multi-domain .pl certificate. They were able to renew 2 days after posting here, but that was about 12 hours after their previous certificate expired.

1 Like

Thanks for flagging that thread - I missed it previously, and I’m bummed to hear they had a 12 hour outage. If there are others that are still stuck on repeated retries, please do let me know.

A friend said they had to (manually) try 5 times before the renewal would pass, and this matched what we were seeing with letsdebug.net for .gr domains, but that was before CDNS started rolling out a fix. I don't have real data though and I would certainly trust your logs showing (presumably) a similarly low failure rate for .gr domains. Yet I doubt most affected users would come forward with such a problem, as they get the impression (also from the LE article on CAA) it is due to something on their side -- GRNET has seen reports from users (falsely) claiming we had even disabled LE.

In any case I do get your point about capsforid; I was only suggesting that, knowing the fallback suffers from trivial breakage, 0x20 could be disabled until you get the patch.

1 Like

Sorry your users have been blaming you for the issues - I know that’s got to be frustrating!

We were notified that per CDNS the 0x20 fix has been rolled out to all instances as of 20:30 UTC yesterday.

2 Likes

Okay now 1.9.1rc1 is out which really has the fix.

https://nlnetlabs.nl/pipermail/unbound-users/2019-March/011403.html

(I hope.)

2 Likes

We've been tracking this release's progress and doing some smoke testing as well :+1:

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.