Can ARI-conforming clients be granted exemptions to relevant rate limits?

Exact wording:

Conforming clients MUST attempt renewal at a time of their choosing
based on the suggested renewal window.

So, fair enough. I will choose a time based on the suggested window as it states I "MUST" do. Not necessarily within, since I cannot be guaranteed to get the cert within that window. But the spec does not say within.

Exactly. As-is, ARI seems to be a great early renewal canary, so I will use it as such; but the precise window will be untenable for large deployments unless the CA is willing to guarantee I can get the cert within that window.


I'll try to spell this out as crystal-clear as I can. Try to keep up:

  • Large deployments may have more certificates to renew per day than LE rate limits will allow (I mean any/all rate limits combined, but I will yet again emphasize the "New Orders" rate limit since I mentioned it on my very first post but some readers clearly did not see that). And yes we have real, actual production experience with this. None of this "theoretical" stuff mentioned above by those who admit they do not have experience with it.

  • Current behavior is to spread out renewals over more time so that we can renew all of them before they expire. This means starting some earlier than 30 days out.

  • Implementing ARI implies a narrower window in which to renew certificates, ironically in an effort to spread out the load more. :man_shrugging: This is problematic for a few reasons:

    • Server load will increase. Clients that were once spreading out and backing off their loads will be scheduled into narrower slots instead, increasing the server loads at those times, rather than letting clients spread out the load like they were doing.
    • Clients will have less time to renew the same amount of certificates (and likely even _more) certs, as businesses grow) in the fixed timeframe of a cert's lifetime. This means higher bump-ins with rate limits, which block cert issuance, the very thing ARI is attempting to prevent.
    • With less time to renew, there is less time to recover from errors, capacity limitations, and exceeded rate limits. If renewal starts earlier, there is more time to back off and retry gracefully.
  • The ARI spec recommends an algorithm that renews within the window, but CA policies such as rate limits can prevent certs from being renewed in that window. Fortunately, the spec's language states that renewal scheduling decisions "MUST" at least be "based on" the "suggested" window. Since we cannot be assured a new cert within that window, we can at least use ARI as an early warning signal for upcoming or predicted problems: revocation, availability, maintenance, etc. Now, let's try to figure out a best course of action based on possible window changes for a hypothetical certificate we just obtained:

    • Hypothetical A: The window moves forward (earlier start time). LE does not offer configurable cert lifetimes, so there is a precise upper bound on how many certificates will need renewing by the time ours does. The CA ought to know at issuance time about how much congestion to expect as the cert approaches expiration, so the window should not change simply because the CA is like, "Oh actually more certs need to be renewed than we thought" later on. But there are some plausible scenarios I can imagine for moving a window earlier:
        1. Anticipated revocation. And if you know the cert will be revoked, it's as good as revoked. (Why continue to trust it when you know there's already a reason to stop trusting it?) Might as well renew right away.
        1. Anticipated maintenance. In this case, it's likely that everyone's window is moving up, and overlapping with others as more renewals are squished into the narrower time carved out by the maintenance window. Expect higher loads and connection errors no matter how perfectly clients behave. Plus, a portion of your backoff-and-retry window is eliminated due to the maintenance. Thus, start renewing earlier to give yourself a higher chance of success.
        1. Unexpected incident. Maybe an incident took down issuance during some people's renewal windows. Thus, there will be more clients trying to renew in a narrow timeframe. The CA is trying to spread out load, anticipating more congestion now that issuance is lagging, and it just so happens our window got moved up. Some others got moved back. Might as well try to renew early since there's congestion and we want the best chances of getting a cert before expiration.
    • Hypothetical B: The window moves backward (later start time). Again, with fixed-lifetime certs, it's not like there's suddenly an influx of renewals that need to happen at your original window. And this change wouldn't make any sense for a revocation event. No... if the window moves back, I can only imagine it's the CA spreading out their load and this cert was one of the unlucky ones to be caught in scenario 3 above: some got moved forward, others got moved back, but either way, the CA is anticipating increased load and possibly availability outages. Higher chance of being denied service (due to connection errors), and with a standard, courteous exponential backoff algorithm, you'll have significantly less time to recover from errors and have a successful renewal before expiration. Thus try to renew as soon as possible to maximize chances of success.

    These next two scenerios could happen in concert with the first two:

    • Hypothetical C: The window expands in size. The CA is relying on the recommended algorithm in the spec to be correctly implemented by a significant majority of clients to spread out the load, as that algorithm recommends a uniform random time within the window.
    • Hypothetical D: The window narrows in size. The CA is NOT relying on the recommended algorithm in the spec, maybe due to concerns that not a significant majority of clients properly implement it, and is instead micromanaging the clients a little more to spread out load. The CA does its own uniform random calculations to spread out clients into narrower time slots and sets the windows accordingly. I am having a hard time thinking of a reason why this action makes sense, since it implies both low confidence in the clients' scheduling abilities, but also high confidence in their scheduling abilities, at the same time. :thinking_face:

Having large-scale production experience with lots of certs and LE rate limits, the thing that makes the most sense for me to do with ARI is to use it as an early warning signal. If the window changes, there is clearly some doubt as to system availability or certificate validity. Thus start trying to renew the cert right away so the graceful backoff has as much chance to procure a certificate before expiration (or revocation :exclamation: ) This way if we hit rate limits, it will backoff and try again later, without running out of time because we started early enough.

Now... if this part I said:

Since we cannot be assured a new cert within that window

could be changed -- i.e. if we COULD in fact be assured a cert within that window -- then everything gets better. If that window basically meant, "You can for sure have the cert anytime within this window. (But just once.)" then I would absolutely have no problem renewing all certificates precisely within that window.

This means the client must be assured that no rate limits would block them from getting a cert within the window.

If we cannot be assured that, we have to do our best to get a cert on our own, which means starting renewals earlier.

Hence, if we want ARI to be useful, I am requesting a rate limit exemption for clients adhering to the ARI window.

1 Like