Numerous inexplicable challenge failures across disparate domains with unreproducable SERVFAILs

Example authz:
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2818997246
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2819045748
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2819039333

We use a custom client to issue certs on our platform.

We have retried many of these authz/domains up to 5 times for more than an hour.

Most of these domains have previously successfully been authorized by LE.

I find no DNSSEC issues, no common NS, no SERVFAILs on public resolvers or authoritative NS.

1 Like

Well, kinda.

They all share the same Network Solutions .worldnic.com nameservers.

This could suggest that something is going wrong between the respective networks of Network Solutions and Let’s Encrypt.

Edit: If I pick a totally random domain from a different .worldnic.com nameserver (but not one used by your domains) such as royalgazette.com, it also produces a DNS failure: https://acme-v02.api.letsencrypt.org/acme/authz-v3/2819689481

I’d probably be hitting up Network Solutions’ support.

3 Likes

Sorry I missed that. Thanks! Any idea why it would just be LE that’s having difficulty getting a response from those NS?

Don’t really know.

I think that Let’s Encrypt’s resolvers tend to send a lot of traffic to authoritative nameservers (compared to normal resolvers) because they keep practically zero cache and query multiple record types at once. That could trigger some kind of rate limiting or firewall behavior on the Network Solutions side.

Or it might just be a regular old routing ****up between Viawest and Network Solutions.

Edit: I just tried again for the domain and it worked. Can you retry?

2 Likes

Here are the results from our following cycle of authz. All seem to be timeouts/SERVFAIL:
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820166571
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820164939
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820157033
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820153144
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820143675
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820137190
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820125872
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820110414
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820107657
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820090154
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820089293
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820088527
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820088094
https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820079854

I looked at https://acme-v01.api.letsencrypt.org/acme/authz-v3/2820079854 and don’t see worldnic

dig NS +short www.weatherizationservices-wi.com

dig NS +short weatherizationservices-wi.com
dns2.registeredsite.com.
dns3.registeredsite.com.
dns1.registeredsite.com.

for i in {1..3}; do dig NS +short dns$i.registeredsite.com.; done

dig NS +short registeredsite.com
1.ns.web.com.
2.ns.web.com.
3.ns.web.com.

for i in {1..3}; do dig NS +short $i.ns.web.com.; done

dig NS +short ns.web.com.

dig NS +short web.com.
1.ns.web.com.
2.ns.web.com.

Any thoughts?

1 Like

I believe web.com is still just Network Solutions’ network, but I’m on phone rn and can’t check.

2 Likes

You’re right they’re related. Web.com bought Network Solutions. :thinking:

2 Likes

Did this eventually just resolve itself? I’m seeing the same issue.

@knas No, we’re still seeing issues with issuances for worldnic domains.

We’re facing the exact same issue here, also with a domain using the worldnic NS.

What I’ve also noticed:
DNS entries are not propagated to dns.watch (maybe for the same reason?)

And “Your nameserver do not include A records when asked for your NS records.” + a “mismatched glue” is reported by this tool: https://www.dnsqueries.com/en/domain_check.php

Same issue here, for several days now.

We have dozens of thousands of domains. We’ve had several failures crop up recently which all have these things in common:

  • Previously successfully generated a cert for them
  • DNS appears correctly set up
  • fails with message urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up A for <some-domain> - the domain's nameservers may be malfunctioning
  • uses registrar Network Solutions

For those who are having trouble, we have had success by just retrying every 15 minutes or so until it’s successful.

I hope the error rate goes down, but I’m preparing myself to accept it as the new normal :crying_cat_face:

1 Like

Note that some DNS failures for issuance that was previously successful could be a result of Let’s Encrypt’s new multiperspective validation.

(This isn’t a likely explanation for all of the problems that were mentioned on this thread, but it’s something that’s good to be aware of, especially if you’ve seen the behavior change very recently.)

1 Like

This is interesting. Thanks for the detailed reports. I’ll ask our SRE team to look for any routing issues between us and worldnic.

3 Likes

We have a surprising number of NS customers. This is becoming impactful. We’ve held off of renewal for several days now to prevent customers being dropped from their SAN cert during renewal.

This should be impacting lots of big name SaaS providers right? Zendesk etc?

1 Like

We also have a large number of customers using Network Solutions. I’ve been somewhat successful in getting a small portion of these to renew by just retrying but its definitely not keeping up. Maybe 10% are renewing after some time?

1 Like

Special Request
@jsha Could you share details on what exactly is failing between your system and Network Solutions, so that we can contact them and get corrective action moving? Right now we don’t understand the problem well enough to inform them on what to correct.

Reason For Urgency
We have Network Solutions customers who’ve lost SSL (and ability to take payments and do business) at this very moment and dozens (if not hundreds) of customers that will be in the same boat within a week or 2 if we don’t find a solution.

1 Like

If NetSol/Web.com is rate limiting queries from Let’s Encrypt’s resolvers, it’s only going to get worse as more and more users retry frequently and further increase traffic. :grimacing:

Hi @lancedolan,

I’m afraid we don’t have a diagnosis yet, but if you have a contact at Network Solutions you can put us in touch with, that might be a help.

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.

We do renew at 30 days, but we remove failing hostnames from their 100 domain SAN cert at 25 days in order to force the renewal to succeed and maintain at least a 25 day window. This is usually customers leaving us and is never a problem. Some of our NetSol customers were removed from their SAN cert by this process before I noticed. I’ve temporarily changed this “force renewal by stripping bad domains” threshold to 15 days.

As it stands we have less than a dozen NetSol customers who lost their cert, and dozens (maybe hundreds?) more set to lose their SSL in 7 days, when we hit this 15 day threshold.

For a system as large as ours, and prone to rate limits, I get very nervous about going much further than 15 days. If we let ourselves go down to 0 days and then start chugging through renewals, I’m afraid we’ll get rate limited and not only lose ability to cert new customers, but if we fail to renew 7 days worth of certs before getting rate-limited then we could be forced into expiring live certs.

2 Likes