We’re a hosting provider with several thousand sites, and up until this point we’ve created separate Let’s Encrypt accounts for each of our customers. We recently updated to a single account as a hosting provider, but ran into an issue with rate limits on failed challenges early this morning during a migration.
Early this morning I was switching from AWS to GCP. It involved a name server change, which I believe caused a percentage of DNS challenges to fail. Most would retry and go through.
However, we don’t save/reuse challenges, so I believe the number of unsolved challenges has grown in our account to the point where we hit a rate limit wall. I looked at the rate limit increase form for hosting providers, and it mentions allowing up to 300 unsolved challenges if you operate over 250,000 FQDN.
I believe this is the limit we hit based on what I saw in the logs:
We recently (April 2017) introduced a Failed Validation limit of 5 failures per account, per hostname, per hour. This limit will be higher on staging so you can use staging to debug connectivity problems.
But at some point, all certificates stopped being issued. It sounds like the above rate limit is limited to hostname so I’m not sure why no new certificates were able to be issued.
I’m curious about:
Is there a way to coordinate with the Let’s Encrypt team on the specific account and understand specifically what went wrong?
Is the way to prevent the number of challenges from ballooning by saving + reusing challenges? Maybe a basic question but it didn’t occur to me when going through the updated integration.
Do those unsolved challenges ever expire? I.e. let’s say we accidentally send over a wrong domain (typo/bug) that never gets resolved. Some of our customers have domains with registrars that aren’t pointed to us and it causes the DNS to fail (the customer may end up never pointing to us, leaving us with an unresolvable challenge). These are rare I just want to understand the best practices.
How can I best resolve the current situation to get out from under these rate limits quickly? It may depend on #1 and #2, but not sure how to finish this migration that is currently stalled or prevent this from happening in the future.