DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers

Thanks for the additional detail about your system! That is helpful to understand.

As a short-term measure, if a certificate has (say) 20 non-NetSol domains and 5 failing NetSol domains, could you start splitting it into:

  1. One certificate with the 20 non-NetSol domains, and
  2. Five certificates with one NetSol domain each.

It would reduce the risk for your non-NetSol customers.

If usually some NetSol domains successfully validate and some fail, the successful ones would get certificates.

It would increase the number of certificates you have to manage, but not an extreme amountā€¦ It would significantly increase number of renewals youā€™re processing now, though. (And every ~60 days forever after.)

1 Like

Our system is similar to lancedolanā€™s. Fortunately for us, we caught the behavior in time to prevent our customers from losing HTTPS, and it was pretty easy for us to modify our system to increase the number of retries.

Itā€™s definitely affected some of our Service Level Objectives though (specifically, time to provision new certificates). This has had the largest impact on the portion of our customers that wait until the last minute (right before they want to launch their new website) to attempt to get a cert. Many of them have come to expect theyā€™ll reliably have a cert issued and deployed under an hour. (I know we should do a better job of setting expectations lower, but thatā€™s been a difficult conversation with my product owner).

1 Like

@mnordhoff 's solution is the worst-case scenario solution Iā€™ve been considering. It would take a lot of custom development on our end. Also, those ā€œnetsol-onlyā€ certs would only be renewed (or new net-sol domains added to them when we get new customers) by a process of rigorous retrying which is too dangerous to do with our normal LE account for rate-limit reasons. The only way I know for managing separate LE accounts is to have separate certbot servers, meaning we need to reconfigure our routing for /.well-known traffic each time we run one application server instead of the otherā€¦ It gets really ugly really fast.

What we really need is for web.com to play ball nicely with letsencrypt, like every other set of nameservers

I'm not encouraging this, but FYI, Certbot can use multiple accounts. The sticking point is that Certbot won't voluntarily create multiple accounts. You can do something like:

  • Temporarily rename /etc/letsencrypt/accounts/. (certbot renew won't work until you fix it!)
  • Do something that will make an account like certbot register or creating a certificate.
  • Merge the two accounts directories together.

When creating certificates, Certbot will interactively prompt you to choose an account, or you can use the --account command line option (with one of the 128-bit hashes Certbot uses as local account IDs, I think) to pick one.

(certbot renew remembers which account it should use for each cert.)

2 Likes

Thanks for the help :smiley:

I actually forgot about the --account toggle. We did that once before during a major rate limit outtage, but it is definitely a hack; To manipulate underlying directory structure based on knowledge of implementation details appears to me a violation of the client-service contract between the certbot CLI API and the user. If Iā€™m right about that, I really donā€™t want a production system, especially of our volume, operating on a potentially brittle hack that isnā€™t forward compatible as we upgrade certbot in the future.

Perhaps I can learn that the certbot dev team happily supports this style of usage, despite their not providing a CLI API for it.

Weā€™re trying to come up with a long term solution that covers for the case that this network flakiness between LE and some problematic name servers goes unsolved long term, or even for it cropping up again with different/new name servers.

@lancedolan, if your LE account is owned by a large provider, have you tried getting your rate limit increased via https://goo.gl/forms/plqRgFVnZbdGhE9n1 ? (from https://letsencrypt.org/docs/rate-limits/ ) Did you already get an increase and are still bumping into the rate limit?

I was unaware of this form. Thanks Iā€™ll look into it!

2 Likes

It appears that form only allows for growing the max certs per week rate limit, which weā€™re not concerned about. My concern is that weā€™ll be failing during frequent retries and get locked out. Weā€™ve been locked out for 7 days twice now, and each time had major business fallout, and both times were due to rate limits created while failing and retrying (pending authz with acme 1).

Perhaps because weā€™re acme 2 now, with modern certbot, thatā€™s not a concern at all and we can start retrying every couple hours for these web.com domains without paranoia. Iā€™ve started a separate forum post to confirm.

Since the problem is sporadic, itā€™s been difficult to collect useful data. This does not look like a general network/routing issue; we may indeed be being rate limited. Weā€™re in touch with both Web.com (Network Solutions / Register.com) and F5 Silverline, their DDoS protection provider. Weā€™ll keep investigating and trying to get this resolved.

2 Likes

@lancedolan It occurred to me that, if rate limiting turns out to be the issue, the style of 100-SAN certificates you issue may be exacerbating the problem. Because Certbot validates all the challenges for a single certificate at once, each 100-SAN certificate generates a bit over 200 DNS queries (one per hostname for HTTP-01, plus at least one per hostname for CAA, depending on how much of the DNS parent tree overlaps between the names being validated). That seems more likely to trigger rate limiting that a series of smaller certificates.

I think Iā€™ve given you this feedback before, but: 100-SAN certificates introduce a lot of issues that smaller certificates donā€™t. I think youā€™d really benefit from spending engineering time on a migration to smaller certificates, rather than on mitigations to problems with 100-SAN certificates.

Thanks for thinking of us jsha!

Our maximum number of certs increased greatly recently and new certs are typically being generated with 10 or less domains :blush: The problem occurs when renewing older certs that still have 70 or 80 certs in a SAN. I assumed network traffic on LE side is a function of number of domains, as you described, so we did some goofy stuff to manually move some web.com domains into small SANs but we donā€™t have a sustainable process yet.

1 Like

Congrats! (I assume you meant decreased? :blush:)

No no, the max certs allowed to be installed by our CDN, where our SSL termination is, increased. So, less domains per cert. Our new domains are given fresh new certs rather than being added to existing certs that already have between 70 and 99 domains in them.

1 Like

Just to give a status update, we are seeing even less approvals, At one point I was able to to force a small percentage through but that seems to have stopped.

3 Likes

Thanks for the update!

1 Like

The behavior weā€™re seeing looks consistent with rate limiting. Weā€™re still collecting data and working with Web.com and F5 Silverline. If anyone has contacts inside either company who could help get more eyes on this issue, that would be great! Please PM me.

2 Likes

Iā€™ve asked around, but I donā€™t have any contacts at NetSol to share at the moment.

Separately, while I was exploring a failed cert issuance, I noticed something odd. It seems worldnic is claiming itā€™s the SOA and NS of io. All other resolvers Iā€™ve used say otherwise.

dig CAA www.aUdUbon.org @ns49.worldnic.com
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61707
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
;; ANSWER SECTION:
www.aUdUbon.org.	7200	IN	CNAME	live-nas-national.pantheonsite.io.
;; AUTHORITY SECTION:
io.			3600	IN	SOA	ns41.worldnic.com. dns.worldnic.com. 2016010801 3600 600 1209600 3600

When querying that NS, it does in fact claim (non-authoritatively) itā€™s the SOA and NS for io.

dig +short SOA io. @ns41.worldnic.com 
ns41.worldnic.com. dns.worldnic.com. 2016010801 3600 600 1209600 3600
dig +short NS io. @ns41.worldnic.com 
ns41.worldnic.com.

All public resolvers say otherwise

for resolver in 1.1.1.1 8.8.8.8 64.6.64.6 208.67.222.222; do dig +short SOA io. @$resolver; done
a0.nic.io. noc.afilias-nst.info. 1497788345 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788345 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900

And of course, all NS provided by rootservers agree.

for ns in a0.nic.io. a2.nic.io. b0.nic.io. c0.nic.io.; do dig +short SOA io. @$ns; done
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788347 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900
a0.nic.io. noc.afilias-nst.info. 1497788348 10800 3600 2764800 900

Could this erroneous unsolicited SOA record be causing trouble with Boulder or its resolver?

1 Like

It shouldnā€™t cause any problems. Issues like that, accidental or malicious, are common, and resolvers are designed to be careful and discard records like that.

2 Likes

Does Letā€™s Encryptā€™s resolver set the z flag? I noticed netsol NS timeout when zflag is set:

dig +zflag A ns31.worldnic.com @ns1.netsol.com
;; connection timed out; no servers could be reached
1 Like