DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers

Did this eventually just resolve itself? I’m seeing the same issue.

@knas No, we’re still seeing issues with issuances for worldnic domains.

We’re facing the exact same issue here, also with a domain using the worldnic NS.

What I’ve also noticed:
DNS entries are not propagated to dns.watch (maybe for the same reason?)

And “Your nameserver do not include A records when asked for your NS records.” + a “mismatched glue” is reported by this tool: https://www.dnsqueries.com/en/domain_check.php

Same issue here, for several days now.

We have dozens of thousands of domains. We’ve had several failures crop up recently which all have these things in common:

  • Previously successfully generated a cert for them
  • DNS appears correctly set up
  • fails with message urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up A for <some-domain> - the domain's nameservers may be malfunctioning
  • uses registrar Network Solutions

For those who are having trouble, we have had success by just retrying every 15 minutes or so until it’s successful.

I hope the error rate goes down, but I’m preparing myself to accept it as the new normal :crying_cat_face:

1 Like

Note that some DNS failures for issuance that was previously successful could be a result of Let’s Encrypt’s new multiperspective validation.

(This isn’t a likely explanation for all of the problems that were mentioned on this thread, but it’s something that’s good to be aware of, especially if you’ve seen the behavior change very recently.)

1 Like

This is interesting. Thanks for the detailed reports. I’ll ask our SRE team to look for any routing issues between us and worldnic.

3 Likes

We have a surprising number of NS customers. This is becoming impactful. We’ve held off of renewal for several days now to prevent customers being dropped from their SAN cert during renewal.

This should be impacting lots of big name SaaS providers right? Zendesk etc?

1 Like

We also have a large number of customers using Network Solutions. I’ve been somewhat successful in getting a small portion of these to renew by just retrying but its definitely not keeping up. Maybe 10% are renewing after some time?

1 Like

Special Request
@jsha Could you share details on what exactly is failing between your system and Network Solutions, so that we can contact them and get corrective action moving? Right now we don’t understand the problem well enough to inform them on what to correct.

Reason For Urgency
We have Network Solutions customers who’ve lost SSL (and ability to take payments and do business) at this very moment and dozens (if not hundreds) of customers that will be in the same boat within a week or 2 if we don’t find a solution.

1 Like

If NetSol/Web.com is rate limiting queries from Let’s Encrypt’s resolvers, it’s only going to get worse as more and more users retry frequently and further increase traffic. :grimacing:

Hi @lancedolan,

I’m afraid we don’t have a diagnosis yet, but if you have a contact at Network Solutions you can put us in touch with, that might be a help.

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

1 Like

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

We do renew at 30 days, but we remove failing hostnames from their 100 domain SAN cert at 25 days in order to force the renewal to succeed and maintain at least a 25 day window. This is usually customers leaving us and is never a problem. Some of our NetSol customers were removed from their SAN cert by this process before I noticed. I’ve temporarily changed this “force renewal by stripping bad domains” threshold to 15 days.

As it stands we have less than a dozen NetSol customers who lost their cert, and dozens (maybe hundreds?) more set to lose their SSL in 7 days, when we hit this 15 day threshold.

For a system as large as ours, and prone to rate limits, I get very nervous about going much further than 15 days. If we let ourselves go down to 0 days and then start chugging through renewals, I’m afraid we’ll get rate limited and not only lose ability to cert new customers, but if we fail to renew 7 days worth of certs before getting rate-limited then we could be forced into expiring live certs.

2 Likes

Thanks for the additional detail about your system! That is helpful to understand.

As a short-term measure, if a certificate has (say) 20 non-NetSol domains and 5 failing NetSol domains, could you start splitting it into:

  1. One certificate with the 20 non-NetSol domains, and
  2. Five certificates with one NetSol domain each.

It would reduce the risk for your non-NetSol customers.

If usually some NetSol domains successfully validate and some fail, the successful ones would get certificates.

It would increase the number of certificates you have to manage, but not an extreme amount… It would significantly increase number of renewals you’re processing now, though. (And every ~60 days forever after.)

1 Like

Our system is similar to lancedolan’s. Fortunately for us, we caught the behavior in time to prevent our customers from losing HTTPS, and it was pretty easy for us to modify our system to increase the number of retries.

It’s definitely affected some of our Service Level Objectives though (specifically, time to provision new certificates). This has had the largest impact on the portion of our customers that wait until the last minute (right before they want to launch their new website) to attempt to get a cert. Many of them have come to expect they’ll reliably have a cert issued and deployed under an hour. (I know we should do a better job of setting expectations lower, but that’s been a difficult conversation with my product owner).

1 Like

@mnordhoff 's solution is the worst-case scenario solution I’ve been considering. It would take a lot of custom development on our end. Also, those “netsol-only” certs would only be renewed (or new net-sol domains added to them when we get new customers) by a process of rigorous retrying which is too dangerous to do with our normal LE account for rate-limit reasons. The only way I know for managing separate LE accounts is to have separate certbot servers, meaning we need to reconfigure our routing for /.well-known traffic each time we run one application server instead of the other… It gets really ugly really fast.

What we really need is for web.com to play ball nicely with letsencrypt, like every other set of nameservers

I’m not encouraging this, but FYI, Certbot can use multiple accounts. The sticking point is that Certbot won’t voluntarily create multiple accounts. You can do something like:

  • Temporarily rename /etc/letsencrypt/accounts/. (certbot renew won’t work until you fix it!)
  • Do something that will make an account like certbot register or creating a certificate.
  • Merge the two accounts directories together.

When creating certificates, Certbot will interactively prompt you to choose an account, or you can use the --account command line option (with one of the 128-bit hashes Certbot uses as local account IDs, I think) to pick one.

(certbot renew remembers which account it should use for each cert.)

2 Likes

Thanks for the help :smiley:

I actually forgot about the --account toggle. We did that once before during a major rate limit outtage, but it is definitely a hack; To manipulate underlying directory structure based on knowledge of implementation details appears to me a violation of the client-service contract between the certbot CLI API and the user. If I’m right about that, I really don’t want a production system, especially of our volume, operating on a potentially brittle hack that isn’t forward compatible as we upgrade certbot in the future.

Perhaps I can learn that the certbot dev team happily supports this style of usage, despite their not providing a CLI API for it.

We’re trying to come up with a long term solution that covers for the case that this network flakiness between LE and some problematic name servers goes unsolved long term, or even for it cropping up again with different/new name servers.

@lancedolan, if your LE account is owned by a large provider, have you tried getting your rate limit increased via https://goo.gl/forms/plqRgFVnZbdGhE9n1 ? (from https://letsencrypt.org/docs/rate-limits/ ) Did you already get an increase and are still bumping into the rate limit?