Can ARI-conforming clients be granted exemptions to relevant rate limits?

Osiris · March 31, 2023, 8:44pm

The window is paramount if ARI is used to smooth out loads on the ACME servers side?

mholt · March 31, 2023, 8:49pm

That's what rate limits are for, I guess. Since ARI isn't actually enforced.

So that's my plan -- if the window changes, assume a revocation or some other sort of problem (or expected problem), so might as well renew right away to get ahead of it, or to have a longer period of time in which to ensure service.

I'm happy to alter my plan if the CA would like to ensure we can get the cert in that window (i.e. RL exemption). I'm just going based on what has been said in this thread.

Osiris · March 31, 2023, 8:58pm

If the window changes, it might be just as simple as a shift in load on the ACME server. As far as I know, it doesn't mean impeding doom...

Nummer378 · March 31, 2023, 9:19pm

Yes, this is true. But the window also won't change dramatically - a CA saying via ARI "renew now" when you still have lots of lifetime left is a sure indicator that something is wrong, whatever that may be.

If I were in @mholt's situation I would probably consider something like this:

If current time is before the suggested renewal window start: Renewal has low priority, but could be considered, if the client deems it necessary for whatever reason (i.e. spreading out renewals)
If current time is inside the suggested renewal window: Renewal has medium priority. It should be done, if there is capacity for it.
If current time is after the suggested renewal window: Renewal has highest priority. Renewal should be done at the next available opportunity.

This is not exactly conforming to ARI-specification, but it might work better in rate-limited scenarios described by @mholt. If this is still triggering rate limits/spread becomes excessive, a rate limit override* is probably preferable.

*A multi-CA fallback is of course another viable option

mholt · March 31, 2023, 9:34pm

Sure. But if they're expecting load, I want to get in before the load or risk not getting service and having to retry for a longer period of time. But there's only so long before the cert expires.

Might as well try sooner to ensure service.

mholt · March 31, 2023, 9:36pm

That's a better approach. But the problem with that algorithm is you don't know the future. So you want to avoid getting yourself into a situation where you can't renew in time.

And yes, we already do multi CA fallback

Osiris · March 31, 2023, 10:03pm

False. The draft RFC is relatively clear about this scenario:

Conforming clients MUST attempt renewal at a time of their choosing based on the suggested renewal window.

If I were a RFC writer, I would make it even less ambiguous, but I assume they mean that you MUST choose a time WITHIN the window. I'm not sure why they use the words "based on".

Subjectively, with the current wording, one could also reason that if the client always starts renewing 10 days before the suggested time window, it has chosen a time based on the suggested window and thus conforms to the "MUST" of the RFC. However, I'm preeeeeetty sure that's not what was envisioned.

I believe it's even forbidden by the ARI draft as specified above.

If you renew early, you might increase load on the systems. The way I see it, your cert has gotten a "slot". And other certificates also have "slots". Some after, some before your slot. So if you renew EARLY, you're messing up the entire "slotting" system of the ACMEs server ARI system.

mholt · March 31, 2023, 10:33pm

You can't say that's "false" like that. You even say:

I think it's fair to say that we can make decisions that are outside the window.

Only if we don't get blocked by rate limits and load, seeing as the window is optional anyway.

mholt · March 31, 2023, 10:37pm

Where? The draft itself says the algorithm is a recommendation not a requirement.

(Sorry for spurious edits, I'm having trouble on mobile)

Osiris · March 31, 2023, 10:49pm

Fair enough, I meant: "In my opinion that's false."

What's the purpose of the "MUST" in the (draft) RFC if it's optional? Right, it wouldn't say "MUST". I'm quite sure "optional" is not what the RFC editors meant.

Their suggested example algorithm is all within the window, not outside. So staying within the window is mandatory IMO. That's based on the usage of the "MUST" terminology before the example.

orangepizza · March 31, 2023, 11:20pm

looks like he doesn't willing to following draft as-is (thinks windows is too small to fit all the certificates he started), but still want to watch ARI to look out impending revocation event

mholt · March 31, 2023, 11:28pm

Exact wording:

Conforming clients MUST attempt renewal at a time of their choosing
based on the suggested renewal window.

So, fair enough. I will choose a time based on the suggested window as it states I "MUST" do. Not necessarily within, since I cannot be guaranteed to get the cert within that window. But the spec does not say within.

Exactly. As-is, ARI seems to be a great early renewal canary, so I will use it as such; but the precise window will be untenable for large deployments unless the CA is willing to guarantee I can get the cert within that window.

I'll try to spell this out as crystal-clear as I can. Try to keep up:

Large deployments may have more certificates to renew per day than LE rate limits will allow (I mean any/all rate limits combined, but I will yet again emphasize the "New Orders" rate limit since I mentioned it on my very first post but some readers clearly did not see that). And yes we have real, actual production experience with this. None of this "theoretical" stuff mentioned above by those who admit they do not have experience with it.
Current behavior is to spread out renewals over more time so that we can renew all of them before they expire. This means starting some earlier than 30 days out.
Implementing ARI implies a narrower window in which to renew certificates, ironically in an effort to spread out the load more. This is problematic for a few reasons:
- Server load will increase. Clients that were once spreading out and backing off their loads will be scheduled into narrower slots instead, increasing the server loads at those times, rather than letting clients spread out the load like they were doing.
- Clients will have less time to renew the same amount of certificates (and likely even _more) certs, as businesses grow) in the fixed timeframe of a cert's lifetime. This means higher bump-ins with rate limits, which block cert issuance, the very thing ARI is attempting to prevent.
- With less time to renew, there is less time to recover from errors, capacity limitations, and exceeded rate limits. If renewal starts earlier, there is more time to back off and retry gracefully.
The ARI spec recommends an algorithm that renews within the window, but CA policies such as rate limits can prevent certs from being renewed in that window. Fortunately, the spec's language states that renewal scheduling decisions "MUST" at least be "based on" the "suggested" window. Since we cannot be assured a new cert within that window, we can at least use ARI as an early warning signal for upcoming or predicted problems: revocation, availability, maintenance, etc. Now, let's try to figure out a best course of action based on possible window changes for a hypothetical certificate we just obtained:
- Hypothetical A: The window moves forward (earlier start time). LE does not offer configurable cert lifetimes, so there is a precise upper bound on how many certificates will need renewing by the time ours does. The CA ought to know at issuance time about how much congestion to expect as the cert approaches expiration, so the window should not change simply because the CA is like, "Oh actually more certs need to be renewed than we thought" later on. But there are some plausible scenarios I can imagine for moving a window earlier:
  - 1. Anticipated revocation. And if you know the cert will be revoked, it's as good as revoked. (Why continue to trust it when you know there's already a reason to stop trusting it?) Might as well renew right away.
  - 1. Anticipated maintenance. In this case, it's likely that everyone's window is moving up, and overlapping with others as more renewals are squished into the narrower time carved out by the maintenance window. Expect higher loads and connection errors no matter how perfectly clients behave. Plus, a portion of your backoff-and-retry window is eliminated due to the maintenance. Thus, start renewing earlier to give yourself a higher chance of success.
  - 1. Unexpected incident. Maybe an incident took down issuance during some people's renewal windows. Thus, there will be more clients trying to renew in a narrow timeframe. The CA is trying to spread out load, anticipating more congestion now that issuance is lagging, and it just so happens our window got moved up. Some others got moved back. Might as well try to renew early since there's congestion and we want the best chances of getting a cert before expiration.
- Hypothetical B: The window moves backward (later start time). Again, with fixed-lifetime certs, it's not like there's suddenly an influx of renewals that need to happen at your original window. And this change wouldn't make any sense for a revocation event. No... if the window moves back, I can only imagine it's the CA spreading out their load and this cert was one of the unlucky ones to be caught in scenario 3 above: some got moved forward, others got moved back, but either way, the CA is anticipating increased load and possibly availability outages. Higher chance of being denied service (due to connection errors), and with a standard, courteous exponential backoff algorithm, you'll have significantly less time to recover from errors and have a successful renewal before expiration. Thus try to renew as soon as possible to maximize chances of success.
These next two scenerios could happen in concert with the first two:
- Hypothetical C: The window expands in size. The CA is relying on the recommended algorithm in the spec to be correctly implemented by a significant majority of clients to spread out the load, as that algorithm recommends a uniform random time within the window.
- Hypothetical D: The window narrows in size. The CA is NOT relying on the recommended algorithm in the spec, maybe due to concerns that not a significant majority of clients properly implement it, and is instead micromanaging the clients a little more to spread out load. The CA does its own uniform random calculations to spread out clients into narrower time slots and sets the windows accordingly. I am having a hard time thinking of a reason why this action makes sense, since it implies both low confidence in the clients' scheduling abilities, but also high confidence in their scheduling abilities, at the same time.

Having large-scale production experience with lots of certs and LE rate limits, the thing that makes the most sense for me to do with ARI is to use it as an early warning signal. If the window changes, there is clearly some doubt as to system availability or certificate validity. Thus start trying to renew the cert right away so the graceful backoff has as much chance to procure a certificate before expiration (or revocation ) This way if we hit rate limits, it will backoff and try again later, without running out of time because we started early enough.

Now... if this part I said:

Since we cannot be assured a new cert within that window

could be changed -- i.e. if we COULD in fact be assured a cert within that window -- then everything gets better. If that window basically meant, "You can for sure have the cert anytime within this window. (But just once.)" then I would absolutely have no problem renewing all certificates precisely within that window.

This means the client must be assured that no rate limits would block them from getting a cert within the window.

If we cannot be assured that, we have to do our best to get a cert on our own, which means starting renewals earlier.

Hence, if we want ARI to be useful, I am requesting a rate limit exemption for clients adhering to the ARI window.

Osiris · March 31, 2023, 11:37pm

As I've already stated earlier, the draft doesn't literally state that indeed, but I believe we must assume that's what was intended. Hopefully a new revision of the draft can make this more clear. (@aarongable)

Even if the window is DAYS long? What the...???

jsha · April 1, 2023, 12:33am

Thanks for bringing this up, @mholt! I think the question of how to spread load during a mass revocation event is key to ARI. I'm not sure what the exact answer should be here, but to this point:

We were actually just discussing how the Renewal Exemption should probably apply to the New Orders rate limit, regardless of whether a mass revocation is underway. It currently only applies to the Certificates per Registered Domain limit. But the same rationale applies to the New Orders limit: in general, if you already have a certificate, we want to prioritize allowing that certificate to be renewed.

For other rate limits, it has been the case in the past and we expect it to be the case in the future, that when we expect mass reissuance, we will manually raise various rate limits to make sure issuance can happen on time.

mholt · April 1, 2023, 12:46am

Hi Jacob, thanks for the reply! (I just finished adding very detailed thoughts to my most recent post above, and for some reason the forum didn't show your post until after I submitted my edits.)

That would be wonderful, and on behalf of many many site owners I thank you for considering this

I was getting a little frustrated by others saying that a problem doesn't even exist -- from people who admit to not having production experience. :-/

Because yeah, like I just said in the edit to my post above (I didn't see any other replies while I was editing... maybe I should have just made it a new post), if I can absolutely trust that we can get the cert in that window, I will strive to honor that window in the implementation.

It would be good for the spec to spell this out explicitly IMO -- for other CAs to follow -- as this simply will not work if clients can't be guaranteed a new certificate within the ARI window.

PS. You won't need to completely lift the rate limits, but just enough for 1 successful issuance of that cert during the window is all that should be needed. So you can continue to count additional ones against rate limits.

WouterTinus · April 2, 2023, 12:39pm

Just to chime in here, for my client win-acme (of which I'm not only the developer, put also a heavy user) - I also intend to follow the logic that renewals can happen earlier due to ARI suggestions, but I won't make them happen later. This is because my client has implemented several guardrails and a user-configurable random window, to spread load, but more importantly to prevent it from hitting rate limits, that I don't want to break for the sake of following the spec to the letter.

I would consider adding a "strict" ARI mode if there could be some guarantees like the ones being discussed here.

jvanasco · April 4, 2023, 4:28pm

First, I 100% get everything that @mholt has posted above and agree with him. I honestly don't understand the pushback and criticism from some people above. When you are dealing with managing certificates at scale, you are already floating dangerously close to hitting rate limits - the result of the ARI query can easily create issues.

My main concern as I look into ARI support is the timing and load of the ARI query itself. Going back to the original comment and the situation with Certify the Web – concerns over rate limits and load are a bit crippling to large installations, as that can require tens of thousands of checks per day.

The concern I have for @mholt's exception request, is that client identification does not seem like a smart path as it could easily be forged/spoofed. It would make more sense to me for this to be account based, and perhaps there is some way to automate that - perhaps by an endpoint/service that allows a user to register their usage of a specific client to their account and request an exception. Basically, this would be a way to automate the current rate limit request form - using a client to send all the required information in a structured format to ISRG, who then gets it in Boulder and can then programmatically or manually review the request with full access to the account's history right there. That's my 2¢ on how I'd handle this.

WouterTinus · April 4, 2023, 6:02pm

Agreed that any excemption shouldn't be granted for a specific client, but I also don't think that the solution needs to be quite so complicated.

Rather I feel the logic should be that:

If a new order is created
And the account creating the order has an existing certificate with the same domains
And that certificate is currently within its ARI suggested renewal window
And this is the first order that matches those conditions
Then that specific order should not count towards the rate limit.

Alternatively (if this would entail too many database lookups, worsening load issues), the ARI GET-request might return some server signed, once-usable token (JWT?) which could be submitted along with the order to allow it to bypass the limit.

mholt · April 4, 2023, 6:06pm

Hey Jonathan, thanks for chiming in.

I'm not totally sure how client ID can be forged/spoofed, since ACME requests are authenticated with account private keys. (Maybe I'm misunderstanding though.)

To clarify, what I'm suggesting requires no extra endpoint or infrastructure. Before enforcing a rate limit during an ACME certificate issuance, the CA server does the following checks:

Is the certificate being obtained within the ARI window we're advertising?
Is the authenticated account on this request the same as the account for the certificate this is replacing (most recent with same set of SANs)?
Is this the first time the certificate is being renewed since advertising the ARI window?

If the answer to all those questions is yes, rate limits should be effectively ignored, and the client should be allowed to get a certificate right away.

jvanasco · April 4, 2023, 6:32pm

This is the disconnect:

You are talking about Account/Subscriber ID.
I am talking about an identified Client Application, such as Certify The Web, Caddy, or CertMagic.

Having worked on large APIs with multiple clients that each support larger numbers of subscribers, after reading your post I immediately started to think of how ISRG could best identify all users of version X of Caddy and extend the increased rate limits to them. That would have the benefits of both supporting conforming clients, and also encourage other client developers to become conforming clients.

This can't rely on user-agent headers or similar, because that can easily be spoofed by lesser clients. A potential workaround is to allow subscribers to post to an endpoint notifying ISRG that a subscriber (AccountKey) is using a conforming client and requests a ratelimit exemption. If posted to Boulder, ISRG could potentially review the request programmatically - taking into account the number of domains/certificates on the account, past ARI activity, and other relevant metrics – which could remove the manual review/decision process for these rate-limit exemption requests. This could also be designed in an extensible way that allows for any account-based rate limits exemptions to be submitted via API for programmatic or a streamlined manual review.

Topic		Replies	Views
ARI Rate limits Client dev	4	652	June 23, 2023
Too many failed authorizations - rate limit Help	3	1120	June 4, 2020
Soft rate limit Issuance Policy	19	1959	May 9, 2021
Hitting rate limit after renewing certs Issuance Policy	19	7201	July 24, 2017
Hitting a rate limit of 5 errors in 1 hour Issuance Policy	2	746	June 24, 2018

Can ARI-conforming clients be granted exemptions to relevant rate limits?

Related topics