Widespread SERVFAIL problem related to DNS 0x20


#21

We probably won’t disable capsforid - we’ve been running with it for three years now, and while it’s occasionally unearthed some resolvers that break, it’s been rare, and most of them have been willing to roll out fixes. It seems like in this case, it was actually the rollout of our edns-buffer-size: 512 change that triggered this bug in Unbound. If anything we might consider increasing that somewhat to a number that is still lower than most Internet path MTUs, but it’s not immediately clear to me that this issue warrants that.

My understanding is that even before CommunityDNS rolled out their change, this was an intermittent error, so one attempt might fail but the next attempt might succeed. In a typical setup with renewal attempts starting at 30 days and retrying twice a day, I would expect that the vast majority of renewals would eventually succeed. Do you have any examples of certificates that are consistently failing renewal?

The validation logs from Boulder show a pretty typical base rate of SERVFAILs related to domains ending in “.be” over the last 14 days – about 5 per hour.


#22

Is this increasing 5 per hour? Or is that a revolving pool of names?
[where some of those names (eventually) get served and drop off]


#23

Nope, not increasing. Mostly it’s a revolving pool of names, but they are failing for unrelated reasons (e.g. authoritative resolver returns SERVFAIL for CAA).


#24

I dunno… I was worried about one user with a multi-domain .pl certificate. They were able to renew 2 days after posting here, but that was about 12 hours after their previous certificate expired.

https://crt.sh/?q=mail.majchrowski.waw.pl


#25

Thanks for flagging that thread - I missed it previously, and I’m bummed to hear they had a 12 hour outage. If there are others that are still stuck on repeated retries, please do let me know.


#26

A friend said they had to (manually) try 5 times before the renewal would pass, and this matched what we were seeing with letsdebug.net for .gr domains, but that was before CDNS started rolling out a fix. I don’t have real data though and I would certainly trust your logs showing (presumably) a similarly low failure rate for .gr domains. Yet I doubt most affected users would come forward with such a problem, as they get the impression (also from the LE article on CAA) it is due to something on their side – GRNET has seen reports from users (falsely) claiming we had even disabled LE.

In any case I do get your point about capsforid; I was only suggesting that, knowing the fallback suffers from trivial breakage, 0x20 could be disabled until you get the patch.


#27

Sorry your users have been blaming you for the issues - I know that’s got to be frustrating!


#28

We were notified that per CDNS the 0x20 fix has been rolled out to all instances as of 20:30 UTC yesterday.