As of right now we have dozens of customers awaiting SSL cert for nearly a week, but more importantly, we’re now 14 days away from expiring production certs on thousands of existing customers.
It’s typical for us to sometimes run into this message with our old acme v1 certbot client:
There were too many requests of a given type :: Error creating new authz :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/
It usually can be solved with the clear-authz script. However, our logs seemed to have rolled away any of the relevant log messages necessary for clear-authz to function successfully. Therefore, we cannot clear these ourselves, and must wait for the full 7 day duration.
However, it’s now been 8 days.
We’ve been getting this too many currently pending steadily since Dec 31st, without any apparent break. Today is Jan 8th. That is 8 full days without a break from this rate-limit. When we noticed it and tried to run our clear-authz script, it was already Jan 6th, and the necessary logs for clear-authz had rolled away.
You will undoubtedly criticize our use of an old acme v1 client, which is good criticism. We have finally gotten internal approval to seriously enhance or completely replace this with a modern client as a Q1 goal for 2020, however right now we’re in serious trouble and just need to get past this rate-limit.
Note: we are hitting this rate-limit ONLY in your staging environment. Our system uses your staging environment for our production solution (yes yes, another thing we’ll have to change very soon). We’re currently scrambling to get it to use production only.
My primary befuddlement right now is that we’ve been locked out due to rate-limit, non-stop, for (apparently) 8 days, which should simply never happen. I’m wondering if something has changed in the logic calculating that rate-limit in the staging server recently.
... however, our logs seemed to have rolled away any of the relevant log messages necessary for clear-authz to function successfully
I'm no longer certain that clear-authz is failing due logs being rolled away. We've just had a partially-successful run which results in logs which I would expect clear-authz to pick up on, and it does not.
Has anything changed in the last year+ that would cause clear-authz to no longer work or correctly parse logs??
It's possible. There was a change to authz URLs some time ago which could have broken the regex the tool uses. I don't have an active ACME v1 account so I couldn't tell you whether that's the case.
You might have to update this line to say /authz-v3/ rather than /authz/.
Thank you, that got us to a point where it's successfully parsing our logs and finding authz. However, we're unsure if the rest of the steps in this tool are working.
Our output right now is simply:
Checking 199 authzs to see if they are pending ...
And then clear-authz completes. I expected more output, so I'm not confident about how many of those authz are pending, etc
There have been no changes to how the pending authorization rate limit is calculated in staging or production.
Sorry, but it sounds like there isn't anything that can be done here. You'll need to transition to ACME v2 or address the authz leak and wait 7 days for your existing pending authorizations to expire. Good luck,
You’ll need to ... wait 7 days for your pending authorizations to expire
That was our plan a week ago. I made this forum post on the 8th day. This is why I was so concerned. It appeared very likely from our end that your staging environment wasn't obeying the 7 day rate limit, though I'm very open minded that there may be another explanation involving a fault in our system. We grepped all of our logs during the past several days to watch for our rate-limit to open up and potentially re-close again due to new failures and new pending authz, and never saw that happen. However, we did lose the first couple days worth of logs, so it may have happened therein.
Anyhow, last night we lucked out and found a very old, unused Acme V1 account sitting dormant in an unused server, and switched to that in production. We've now created certs for all our new customers, but haven't yet tested renew. We've got 13 days to get that sorted out and will start testing today.
I have retried with staging again today and it is still blocked with “too many currently pending” message.
Today is January 9th and this started on Dec 31st. We did manually parse all of our logs all the way back through January 4th or 3rd (can’t recall and logs rolled away from us since then) and confirmed that no more pending authz were “leaked” during that time frame.
It would seem the most likely thing is one of these
Our quota opened up and we leaked again between jan 1st and jan 3rd but can’t determine via logs
Our manual log parsing technique (used while clear-authz was broken) was flawed.
If the unlikely happens and we’re still rate limited in a few days, I’ll swing back in. Until then, we’re hitting production now and potentially using up that quote, which is scary for us, like riding on a reserve parachute
By January 15th, our staging account had become available again. Only reasonable explanation is that during those 2 or 3 days in which we had no logs, we just happened to leak further authz and re-establish a 7-day wait (for a total of ~15 days of rate limit lockout).
We were able to survive by running on production account only, and running clear-authz script after EVERY run, rather than only when rate-limit is observed.
We’re beginning our effort to move to acme v2 this month.