DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers

kf6nux · February 14, 2020, 9:52pm

Example authz:
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2818997246
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2819045748
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2819039333

We use a custom client to issue certs on our platform.

We have retried many of these authz/domains up to 5 times for more than an hour.

Most of these domains have previously successfully been authorized by LE.

I find no DNSSEC issues, no common NS, no SERVFAILs on public resolvers or authoritative NS.

_az · February 14, 2020, 9:57pm

Well, kinda.

They all share the same Network Solutions .worldnic.com nameservers.

This could suggest that something is going wrong between the respective networks of Network Solutions and Let's Encrypt.

Edit: If I pick a totally random domain from a different .worldnic.com nameserver (but not one used by your domains) such as royalgazette.com, it also produces a DNS failure: https://acme-v02.api.letsencrypt.org/acme/authz-v3/2819689481

I'd probably be hitting up Network Solutions' support.

kf6nux · February 14, 2020, 10:10pm

Sorry I missed that. Thanks! Any idea why it would just be LE that’s having difficulty getting a response from those NS?

_az · February 14, 2020, 10:13pm

Don’t really know.

I think that Let’s Encrypt’s resolvers tend to send a lot of traffic to authoritative nameservers (compared to normal resolvers) because they keep practically zero cache and query multiple record types at once. That could trigger some kind of rate limiting or firewall behavior on the Network Solutions side.

Or it might just be a regular old routing ****up between Viawest and Network Solutions.

Edit: I just tried again for the domain and it worked. Can you retry?

kf6nux · February 14, 2020, 10:58pm

I looked at https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820079854 and don’t see worldnic

dig NS +short www.weatherizationservices-wi.com

dig NS +short weatherizationservices-wi.com
dns2.registeredsite.com.
dns3.registeredsite.com.
dns1.registeredsite.com.

for i in {1..3}; do dig NS +short dns$i.registeredsite.com.; done

dig NS +short registeredsite.com
1.ns.web.com.
2.ns.web.com.
3.ns.web.com.

for i in {1..3}; do dig NS +short $i.ns.web.com.; done

dig NS +short ns.web.com.

dig NS +short web.com.
1.ns.web.com.
2.ns.web.com.

Any thoughts?

_az · February 14, 2020, 11:35pm

I believe web.com is still just Network Solutions’ network, but I’m on phone rn and can’t check.

kf6nux · February 15, 2020, 12:07am

You’re right they’re related. Web.com bought Network Solutions.

knas · February 18, 2020, 9:14pm

Did this eventually just resolve itself? I’m seeing the same issue.

kf6nux · February 19, 2020, 12:18am

@knas No, we’re still seeing issues with issuances for worldnic domains.

Elmervc · February 19, 2020, 8:42am

We’re facing the exact same issue here, also with a domain using the worldnic NS.

What I’ve also noticed:
DNS entries are not propagated to dns.watch (maybe for the same reason?)

And “Your nameserver do not include A records when asked for your NS records.” + a “mismatched glue” is reported by this tool: https://www.dnsqueries.com/en/domain_check.php

lancedolan · February 19, 2020, 3:47pm

Same issue here, for several days now.

We have dozens of thousands of domains. We’ve had several failures crop up recently which all have these things in common:

Previously successfully generated a cert for them
DNS appears correctly set up
fails with message urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up A for <some-domain> - the domain's nameservers may be malfunctioning
uses registrar Network Solutions

kf6nux · February 19, 2020, 6:44pm

For those who are having trouble, we have had success by just retrying every 15 minutes or so until it’s successful.

I hope the error rate goes down, but I’m preparing myself to accept it as the new normal

schoen · February 19, 2020, 7:04pm

Note that some DNS failures for issuance that was previously successful could be a result of Let's Encrypt's new multiperspective validation.

(This isn't a likely explanation for all of the problems that were mentioned on this thread, but it's something that's good to be aware of, especially if you've seen the behavior change very recently.)

jsha · February 20, 2020, 1:30am

This is interesting. Thanks for the detailed reports. I’ll ask our SRE team to look for any routing issues between us and worldnic.

lancedolan · February 20, 2020, 2:31pm

We have a surprising number of NS customers. This is becoming impactful. We’ve held off of renewal for several days now to prevent customers being dropped from their SAN cert during renewal.

This should be impacting lots of big name SaaS providers right? Zendesk etc?

knas · February 20, 2020, 3:52pm

We also have a large number of customers using Network Solutions. I’ve been somewhat successful in getting a small portion of these to renew by just retrying but its definitely not keeping up. Maybe 10% are renewing after some time?

lancedolan · February 20, 2020, 9:09pm

Special Request
@jsha Could you share details on what exactly is failing between your system and Network Solutions, so that we can contact them and get corrective action moving? Right now we don’t understand the problem well enough to inform them on what to correct.

Reason For Urgency
We have Network Solutions customers who’ve lost SSL (and ability to take payments and do business) at this very moment and dozens (if not hundreds) of customers that will be in the same boat within a week or 2 if we don’t find a solution.

mnordhoff · February 20, 2020, 9:24pm

If NetSol/Web.com is rate limiting queries from Let’s Encrypt’s resolvers, it’s only going to get worse as more and more users retry frequently and further increase traffic.

jsha · February 20, 2020, 9:31pm

Hi @lancedolan,

I’m afraid we don’t have a diagnosis yet, but if you have a contact at Network Solutions you can put us in touch with, that might be a help.

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

lancedolan · February 20, 2020, 10:37pm

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

We do renew at 30 days, but we remove failing hostnames from their 100 domain SAN cert at 25 days in order to force the renewal to succeed and maintain at least a 25 day window. This is usually customers leaving us and is never a problem. Some of our NetSol customers were removed from their SAN cert by this process before I noticed. I've temporarily changed this "force renewal by stripping bad domains" threshold to 15 days.

As it stands we have less than a dozen NetSol customers who lost their cert, and dozens (maybe hundreds?) more set to lose their SSL in 7 days, when we hit this 15 day threshold.

For a system as large as ours, and prone to rate limits, I get very nervous about going much further than 15 days. If we let ourselves go down to 0 days and then start chugging through renewals, I'm afraid we'll get rate limited and not only lose ability to cert new customers, but if we fail to renew 7 days worth of certs before getting rate-limited then we could be forced into expiring live certs.

Topic		Replies	Views
Potential networking / client changes on DNS Challenges Help	44	2318	December 29, 2023
DNS problem: SERVFAIL for DNSSEC signed domain Issuance Tech	9	6311	May 18, 2016
DNS Problem: SERVFAIL Help	8	1731	June 11, 2021
DNS failing only for letsencrypt, but not others Help	6	2007	June 13, 2019
DNS SERVFAIL errors from Let's Encrypt Help	3	106	August 23, 2024

DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers

Related topics