DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers

jsha · February 20, 2020, 10:38pm

Thanks for the additional detail about your system! That is helpful to understand.

mnordhoff · February 20, 2020, 10:44pm

As a short-term measure, if a certificate has (say) 20 non-NetSol domains and 5 failing NetSol domains, could you start splitting it into:

One certificate with the 20 non-NetSol domains, and
Five certificates with one NetSol domain each.

It would reduce the risk for your non-NetSol customers.

If usually some NetSol domains successfully validate and some fail, the successful ones would get certificates.

It would increase the number of certificates you have to manage, but not an extreme amount… It would significantly increase number of renewals you’re processing now, though. (And every ~60 days forever after.)

kf6nux · February 20, 2020, 10:45pm

Our system is similar to lancedolan’s. Fortunately for us, we caught the behavior in time to prevent our customers from losing HTTPS, and it was pretty easy for us to modify our system to increase the number of retries.

It’s definitely affected some of our Service Level Objectives though (specifically, time to provision new certificates). This has had the largest impact on the portion of our customers that wait until the last minute (right before they want to launch their new website) to attempt to get a cert. Many of them have come to expect they’ll reliably have a cert issued and deployed under an hour. (I know we should do a better job of setting expectations lower, but that’s been a difficult conversation with my product owner).

lancedolan · February 21, 2020, 6:35pm

@mnordhoff 's solution is the worst-case scenario solution I’ve been considering. It would take a lot of custom development on our end. Also, those “netsol-only” certs would only be renewed (or new net-sol domains added to them when we get new customers) by a process of rigorous retrying which is too dangerous to do with our normal LE account for rate-limit reasons. The only way I know for managing separate LE accounts is to have separate certbot servers, meaning we need to reconfigure our routing for /.well-known traffic each time we run one application server instead of the other… It gets really ugly really fast.

What we really need is for web.com to play ball nicely with letsencrypt, like every other set of nameservers

mnordhoff · February 21, 2020, 6:51pm

I'm not encouraging this, but FYI, Certbot can use multiple accounts. The sticking point is that Certbot won't voluntarily create multiple accounts. You can do something like:

Temporarily rename /etc/letsencrypt/accounts/. (certbot renew won't work until you fix it!)
Do something that will make an account like certbot register or creating a certificate.
Merge the two accounts directories together.

When creating certificates, Certbot will interactively prompt you to choose an account, or you can use the --account command line option (with one of the 128-bit hashes Certbot uses as local account IDs, I think) to pick one.

(certbot renew remembers which account it should use for each cert.)

lancedolan · February 21, 2020, 7:00pm

Thanks for the help

I actually forgot about the --account toggle. We did that once before during a major rate limit outtage, but it is definitely a hack; To manipulate underlying directory structure based on knowledge of implementation details appears to me a violation of the client-service contract between the certbot CLI API and the user. If I’m right about that, I really don’t want a production system, especially of our volume, operating on a potentially brittle hack that isn’t forward compatible as we upgrade certbot in the future.

Perhaps I can learn that the certbot dev team happily supports this style of usage, despite their not providing a CLI API for it.

We’re trying to come up with a long term solution that covers for the case that this network flakiness between LE and some problematic name servers goes unsolved long term, or even for it cropping up again with different/new name servers.

kf6nux · February 21, 2020, 8:18pm

@lancedolan, if your LE account is owned by a large provider, have you tried getting your rate limit increased via https://goo.gl/forms/plqRgFVnZbdGhE9n1 ? (from https://letsencrypt.org/docs/rate-limits/ ) Did you already get an increase and are still bumping into the rate limit?

lancedolan · February 21, 2020, 8:20pm

I was unaware of this form. Thanks I’ll look into it!

lancedolan · February 21, 2020, 11:40pm

It appears that form only allows for growing the max certs per week rate limit, which we’re not concerned about. My concern is that we’ll be failing during frequent retries and get locked out. We’ve been locked out for 7 days twice now, and each time had major business fallout, and both times were due to rate limits created while failing and retrying (pending authz with acme 1).

Perhaps because we’re acme 2 now, with modern certbot, that’s not a concern at all and we can start retrying every couple hours for these web.com domains without paranoia. I’ve started a separate forum post to confirm.

JamesLE · February 22, 2020, 1:37am

Since the problem is sporadic, it’s been difficult to collect useful data. This does not look like a general network/routing issue; we may indeed be being rate limited. We’re in touch with both Web.com (Network Solutions / Register.com) and F5 Silverline, their DDoS protection provider. We’ll keep investigating and trying to get this resolved.

jsha · February 22, 2020, 1:57am

@lancedolan It occurred to me that, if rate limiting turns out to be the issue, the style of 100-SAN certificates you issue may be exacerbating the problem. Because Certbot validates all the challenges for a single certificate at once, each 100-SAN certificate generates a bit over 200 DNS queries (one per hostname for HTTP-01, plus at least one per hostname for CAA, depending on how much of the DNS parent tree overlaps between the names being validated). That seems more likely to trigger rate limiting that a series of smaller certificates.

I think I’ve given you this feedback before, but: 100-SAN certificates introduce a lot of issues that smaller certificates don’t. I think you’d really benefit from spending engineering time on a migration to smaller certificates, rather than on mitigations to problems with 100-SAN certificates.

lancedolan · February 22, 2020, 2:30am

Thanks for thinking of us jsha!

Our maximum number of certs increased greatly recently and new certs are typically being generated with 10 or less domains The problem occurs when renewing older certs that still have 70 or 80 certs in a SAN. I assumed network traffic on LE side is a function of number of domains, as you described, so we did some goofy stuff to manually move some web.com domains into small SANs but we don’t have a sustainable process yet.

jsha · February 22, 2020, 2:33am

Congrats! (I assume you meant decreased? )

lancedolan · February 23, 2020, 8:47pm

No no, the max certs allowed to be installed by our CDN, where our SSL termination is, increased. So, less domains per cert. Our new domains are given fresh new certs rather than being added to existing certs that already have between 70 and 99 domains in them.

knas · February 24, 2020, 8:30pm

Just to give a status update, we are seeing even less approvals, At one point I was able to to force a small percentage through but that seems to have stopped.

jsha · February 24, 2020, 10:28pm

Thanks for the update!

JamesLE · February 25, 2020, 3:08am

The behavior we’re seeing looks consistent with rate limiting. We’re still collecting data and working with Web.com and F5 Silverline. If anyone has contacts inside either company who could help get more eyes on this issue, that would be great! Please PM me.

kf6nux · February 25, 2020, 3:28am

I’ve asked around, but I don’t have any contacts at NetSol to share at the moment.

Separately, while I was exploring a failed cert issuance, I noticed something odd. It seems worldnic is claiming it’s the SOA and NS of io. All other resolvers I’ve used say otherwise.

dig CAA www.aUdUbon.org @ns49.worldnic.com
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61707
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
;; ANSWER SECTION:
www.aUdUbon.org.	7200	IN	CNAME	live-nas-national.pantheonsite.io.
;; AUTHORITY SECTION:
io.			3600	IN	SOA	ns41.worldnic.com. dns.worldnic.com. 2016010801 3600 600 1209600 3600

When querying that NS, it does in fact claim (non-authoritatively) it’s the SOA and NS for io.

dig +short SOA io. @ns41.worldnic.com 
ns41.worldnic.com. dns.worldnic.com. 2016010801 3600 600 1209600 3600
dig +short NS io. @ns41.worldnic.com 
ns41.worldnic.com.

All public resolvers say otherwise

for resolver in 1.1.1.1 8.8.8.8 64.6.64.6 208.67.222.222; do dig +short SOA io. @$resolver; done
a0.nic.io. noc.afilias-nst.info. 1497788345 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788345 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900

And of course, all NS provided by rootservers agree.

for ns in a0.nic.io. a2.nic.io. b0.nic.io. c0.nic.io.; do dig +short SOA io. @$ns; done
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900

Could this erroneous unsolicited SOA record be causing trouble with Boulder or its resolver?

mnordhoff · February 25, 2020, 3:46am

It shouldn’t cause any problems. Issues like that, accidental or malicious, are common, and resolvers are designed to be careful and discard records like that.

kf6nux · February 25, 2020, 10:10pm

Does Let’s Encrypt’s resolver set the z flag? I noticed netsol NS timeout when zflag is set:

dig +zflag A ns31.worldnic.com @ns1.netsol.com
;; connection timed out; no servers could be reached