Are rate limits ever spread across different accounts?

Hi,

I've got a user who is seeing urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/ on all of their requests across multiple servers and hundreds of domains.

I believe the issue was probably caused by an old version of the acme client (Certify) erroring somewhere in the process between starting a new order and before submitting challenges for validation. It could also be caused by things like running our of disk space or permissions changes etc.

My question is: if each ACME account is created independently, but they share the same email address, is there any pooling of rate limits? They're currently struggling to clear the limits and they're apparently also seeing the limits across multiple unrelated machines.

I can probably provide them a tool to just go through all recent orders and submit any pending challenges but the weird thing is that the rate limits appears to be spanning multiple accounts.

3 Likes

Also, I'm assuming pending authorizations clear after some time? If not, how do you fix an account in a situation where you can't submit the challenges?

This has been a problem maybe twice in 5 years with this app (and hundreds of thousands of users), so I'm lacking any pre-baked strategies :slight_smile:

2 Likes

Is it feasible that if all the machines in the environment are using effectively the same "bad" config, they'd all be suffering the same rate limits regardless of the account each one is using (as opposed to some sort of rate limit pooling across accounts)?

4 Likes

Yes I can see from the logs that the account id's are different but otherwise my detective skills are failing me and I only have a limited set of logs to go on. If the pending auths don't get cleared in the next couple of days they're going to have to jump ship to ZeroSSL, which is a bit unnecessary as up until recently it was all OK.

One of my theories is when POST-as-GET became mandatory the v2 API they hit a bunch of exceptions on an old version of the app which then stopped the challenges getting submitted, but I don't currently see evidence of that and Certify has been using POST-as-GET for a long time. That would have been the easiest explanation for the original fault, but I don't think I'm going to get off that easily.

2 Likes

No, we don't pool rate limits in that way. It does sound like something with their integration is causing them to accumulate pending authzs, as @rmbolger suggested.

Yes, pending authzs have a lifetime of one week.

4 Likes

Cool, thanks for the clarification.

General auth rate limit question for you: If an order has one failed authorization (so the whole order will not be valid) do the remaining pending authorizations (if not yet submitted) still need to be submitted or do they no longer count against the account rate limit, even though they weren't attempted because one of the others already failed?

So basically, do all non-submitted authz count against the rate limit or just ones where the order is still pending (no auths failed)?

2 Likes

If I read the Boulder source code correctly:

It just counts all authz in the pending state of the currently used account. The SQL query doesn't list anything about the order and also the code which is calling that function does not seem to have any order related thing.

5 Likes

Very interesting! Thanks.

2 Likes

I'd thought that the requirement for POST-as-GET in production was postponed indefinitely anyway, so it's probably not part of the problem here.

2 Likes

I vaguely recall a conversation with @jvanasco about pending authorizations being recycled across failed orders in my early testing. I think jsha chimed-in on that topic and provided the necessary clarity. Wish I could recall enough to find that topic. That part of Boulder's code might be a little more challenging to locate.

Update:

I found it! :grinning:

2 Likes

You need to clear those pending authz by deleting or triggering them.

The authz are reused in future submissions for that user in 99.9% of cases. It’s easier on Boulder to do that.

Users with many domains and servers, like me and your client, are the only people who really have an issue with this. A smaller user will fix the issue and just retry the order immediately and clear them out. Larger users will move to another order in their batch, leaving them active. These errors compound upon one another and the limit is reached.

Usually this happens because the client does not have a cleanup routine after failed orders (to cancel challenges) or it does but that routine has a bug. It’s one of the more annoying things to handle in tests, because you have to fail a multi-domain order in many ways to ensure it triggers and runs correctly and on the right number of domains.

4 Likes

@webprofusion

Here is a tool to clear pending authorizations:

https://tools.letsdebug.net/clear-authz

1 Like

Thanks! I've built a tool (doing pretty much the same thing) into the Certify CLI and given that to the user, so I think as long as they still have the logs it'll be fine.

@jvanasco yes, I think this is the problem indeed. There were still some situations where Certify would quit out of the order before attempting the authz and they must have hit that. It's still weird that they report the same issue occurring across multiple servers without sharing accounts but that must just be a coincidence and I suspect they put lots of SANs into one cert and let it fail a few times in a row, then moved on to try it on the next server.

3 Likes

It is really hard to get this right, because as you mentioned — the app can unexpectedly exit or crash even when you do program things right. One of my solutions was to log things into an autocommit sql to make detection and recovery easier. It’s not perfect but allows for reconciliation much easier.

I am fairly certain this is what happened to you. This issue will never happen to 99.999% of users, and is almost a race condition issue to the rest that only triggers under a conflux of bad circumstances.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.