Widespread SERVFAIL problem related to DNS 0x20

jsha · February 6, 2019, 12:34am

We probably won't disable capsforid - we've been running with it for three years now, and while it's occasionally unearthed some resolvers that break, it's been rare, and most of them have been willing to roll out fixes. It seems like in this case, it was actually the rollout of our edns-buffer-size: 512 change that triggered this bug in Unbound. If anything we might consider increasing that somewhat to a number that is still lower than most Internet path MTUs, but it's not immediately clear to me that this issue warrants that.

My understanding is that even before CommunityDNS rolled out their change, this was an intermittent error, so one attempt might fail but the next attempt might succeed. In a typical setup with renewal attempts starting at 30 days and retrying twice a day, I would expect that the vast majority of renewals would eventually succeed. Do you have any examples of certificates that are consistently failing renewal?

The validation logs from Boulder show a pretty typical base rate of SERVFAILs related to domains ending in ".be" over the last 14 days -- about 5 per hour.

rg305 · February 6, 2019, 1:04am

Is this increasing 5 per hour? Or is that a revolving pool of names?
[where some of those names (eventually) get served and drop off]

jsha · February 6, 2019, 2:26am

Nope, not increasing. Mostly it’s a revolving pool of names, but they are failing for unrelated reasons (e.g. authoritative resolver returns SERVFAIL for CAA).

mnordhoff · February 6, 2019, 3:40am

I dunno... I was worried about one user with a multi-domain .pl certificate. They were able to renew 2 days after posting here, but that was about 12 hours after their previous certificate expired.

jsha · February 6, 2019, 4:06am

Thanks for flagging that thread - I missed it previously, and I’m bummed to hear they had a 12 hour outage. If there are others that are still stuck on repeated retries, please do let me know.

zmousm · February 6, 2019, 6:38am

A friend said they had to (manually) try 5 times before the renewal would pass, and this matched what we were seeing with letsdebug.net for .gr domains, but that was before CDNS started rolling out a fix. I don't have real data though and I would certainly trust your logs showing (presumably) a similarly low failure rate for .gr domains. Yet I doubt most affected users would come forward with such a problem, as they get the impression (also from the LE article on CAA) it is due to something on their side -- GRNET has seen reports from users (falsely) claiming we had even disabled LE.

In any case I do get your point about capsforid; I was only suggesting that, knowing the fallback suffers from trivial breakage, 0x20 could be disabled until you get the patch.

jsha · February 6, 2019, 6:41am

Sorry your users have been blaming you for the issues - I know that’s got to be frustrating!

zmousm · February 6, 2019, 9:18am

We were notified that per CDNS the 0x20 fix has been rolled out to all instances as of 20:30 UTC yesterday.

mnordhoff · March 6, 2019, 10:56pm

Okay now 1.9.1rc1 is out which really has the fix.

https://nlnetlabs.nl/pipermail/unbound-users/2019-March/011403.html

(I hope.)

cpu · March 7, 2019, 2:14pm

We've been tracking this release's progress and doing some smoke testing as well

system · April 6, 2019, 2:14pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally Help	46	4439	September 6, 2018
SERVFAIL looking up CAA, but I see NOERROR myself Help	25	7271	September 7, 2017
SERVFAIL while renewing Help	11	1702	January 28, 2019
False CAA failure when issuing certs Issuance Tech	35	4184	August 9, 2018
SERVFAIL from authoritative DNS server (0x20 case randomization issue) Help	2	969	August 31, 2018

Widespread SERVFAIL problem related to DNS 0x20

Related topics