We are using the xenolf/lego library. We previously had some domains pointing to us but before this migration required all of our clients to transfer those to our registrar so we're in control of the nameservers. I was asking about some other scenarios to gain a better understanding of those pending challenges.
Lego has a default timeout after 60 seconds, and I'm seeing
NXDOMAIN errors. In looking through the logs, I am seeing we had identical authz urls on subsequent DNS challenge retries so it doesn't seem like new challenges are generated each time. I think what may have happened is the following:
- After queueing up domains in small increments, I queued up 1500 since things looked good. We have throttling in place so at most would have 20 concurrent requests (which I increased for this migration).
- In enough cases the xenolf/lego timeout of 60 seconds occurred before Let's Encrypt could verify all of the challenges, so the domain(s) go back on the message queue to be retried in ~5-10 minutes.
- While those domains are being retried, new domains started the auth process and while some may have made it through I'm assuming most encountered the same cycle until we accumulated 300 pending authorizations.
This is the error I saw after stepping away for a brief nap:
acme: Error 429 - urn:acme:error:rateLimited - Error creating new authz :: too many currently pending authorizations
This is one of the domains that hit the rate limit: thediscounters-place.com
If that sounds plausible to you, I think it makes sense to check the number of domains in the authorization phase and ensure that we don't start the "obtain certificate" flow unless that number is below 250 let's say.
Lastly, is there a way for you to provide the list of domains with pending authorizations on the account via email? If I can determine the 300 that are pending, I can run those through then proceed with the rest using the strategy above.
Thanks so much for the prompt and detailed replies!