I think that Let’s Encrypt’s resolvers tend to send a lot of traffic to authoritative nameservers (compared to normal resolvers) because they keep practically zero cache and query multiple record types at once. That could trigger some kind of rate limiting or firewall behavior on the Network Solutions side.
Or it might just be a regular old routing ****up between Viawest and Network Solutions.
Edit: I just tried again for the domain and it worked. Can you retry?
We have dozens of thousands of domains. We’ve had several failures crop up recently which all have these things in common:
Previously successfully generated a cert for them
DNS appears correctly set up
fails with message urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up A for <some-domain> - the domain's nameservers may be malfunctioning
Note that some DNS failures for issuance that was previously successful could be a result of Let's Encrypt's new multiperspective validation.
(This isn't a likely explanation for all of the problems that were mentioned on this thread, but it's something that's good to be aware of, especially if you've seen the behavior change very recently.)
We have a surprising number of NS customers. This is becoming impactful. We’ve held off of renewal for several days now to prevent customers being dropped from their SAN cert during renewal.
This should be impacting lots of big name SaaS providers right? Zendesk etc?
We also have a large number of customers using Network Solutions. I’ve been somewhat successful in getting a small portion of these to renew by just retrying but its definitely not keeping up. Maybe 10% are renewing after some time?
Special Request @jsha Could you share details on what exactly is failing between your system and Network Solutions, so that we can contact them and get corrective action moving? Right now we don’t understand the problem well enough to inform them on what to correct.
Reason For Urgency
We have Network Solutions customers who’ve lost SSL (and ability to take payments and do business) at this very moment and dozens (if not hundreds) of customers that will be in the same boat within a week or 2 if we don’t find a solution.
If NetSol/Web.com is rate limiting queries from Let’s Encrypt’s resolvers, it’s only going to get worse as more and more users retry frequently and further increase traffic.
I’m afraid we don’t have a diagnosis yet, but if you have a contact at Network Solutions you can put us in touch with, that might be a help.
How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.
How long has this problem been manifesting for you? I would expect that we’d have 30 days from beginning of the problem before we started to see certificates expire.
We do renew at 30 days, but we remove failing hostnames from their 100 domain SAN cert at 25 days in order to force the renewal to succeed and maintain at least a 25 day window. This is usually customers leaving us and is never a problem. Some of our NetSol customers were removed from their SAN cert by this process before I noticed. I've temporarily changed this "force renewal by stripping bad domains" threshold to 15 days.
As it stands we have less than a dozen NetSol customers who lost their cert, and dozens (maybe hundreds?) more set to lose their SSL in 7 days, when we hit this 15 day threshold.
For a system as large as ours, and prone to rate limits, I get very nervous about going much further than 15 days. If we let ourselves go down to 0 days and then start chugging through renewals, I'm afraid we'll get rate limited and not only lose ability to cert new customers, but if we fail to renew 7 days worth of certs before getting rate-limited then we could be forced into expiring live certs.