As I am implementing ARI I have another concern/question I was hoping to get some reassurance on.
Some Caddy/CertMagic instances manage tens of thousands of certificates, similar to Chris' situation with Certify the Web:
With our current logic we're able to mostly manage this many certificates within rate limits, but it involves spreading out renewals over a longer period of time to squeeze all comfortably within the 90-day window (60 if possible).
My concern with ARI is that there is the potential that certificates will be unable to be renewed before expiration due to waiting for the ARI window before starting to try renewal. Of course we back off and retry when there are errors, but this can sometimes last for days and weeks as rate limits are hit. Hence, starting renewals sooner to spread them out more. But if we conform to ARI we cannot start sooner.
Also, retries will often end up going outside the ARI window anyway, as sometimes they take several days or weeks before succeeding. I know the spec says clients should follow our normal backoff and retry logic, but then what's the point of the "end" timestamp? If we're going to go outside the window, we might as well have started renewals earlier with higher chance of getting the cert before expiration.
I can think of a couple possibilities so far:
Ignore ARI and start renewing certificates as early as needed to be able to spread them out enough.
Get a guarantee from the CA that the first successful cert for a name within the ARI window will be allowed regardless of relevant rate limits (new orders for example).
As a client dev, I'd prefer the latter. Since the point of ARI is to make sure the CA isn't overwhelmed in the first place, I don't see the value in the New Orders rate limit for the first (successful) issuance of a cert within the ARI window. Basically, clients should be rewarded for conforming to ARI, not punished, especially when operating at scale, where load smoothing really matters and we are doing the CA a favor.
ARI can equal = I'm busy right now, please come back "later".
Well, then "later" could just become an even busier time to get all the certs done in time.
Things can only be put off for tomorrow so much - eventually "tomorrow" comes.
I know what ARI entails. But I'm preeeetty sure the Let's Encrypt validation servers don't want to amass all certificate renewals till a later time: Boulder wants all the certs renewed too, in an orderly fashion and as soon as possible I assume.
So the interests of the ACME client are in the interests of the ACME server too, it's just some load displacement when loads are high.
Personally I don't really see the fuss, but I might be blind for it.
Ideally ARI wouldn't defer an entire accounts' requested certs for any significant amount of time.
I would hope no more than minutes [not even hours] of gaps.
And even then: renewals are exempted from the "certs per registered domain per week" rate limit. The only rate limit relevant for renewals is the duplicate rate limit of 5 per week. Which isn't really an issue with regard to ARI.
Like I said in the topic, the main rate limit that gets in the way of large deployments is New Orders. 300 per 3 hours (or ~100/hour) only allows 2400 per day, assuming no backoff (which there is, to be nice). This is problematic because new certificates aren't always spaced out so perfectly. Hence the start-early-and-backoff logic.
A domain name. And by "first" I mean the only renewal you ought to need within the given window. Repeated renewals can still be subject to rate limits to prevent abuse or bugs.
Why would you want multiple certs issued for the same (set of) domain name(s) anyway? That's just wrong in my opinion.
Anyway, my guts tell me this is more of a theoretical issue than an actually practical one, as ARI is used to smooth things out on the ACME server side and I don't really see how that would interfer much with the ACME clients side. If you look at the example in draft-ietf-acme-ari you see a window of 4 days. With a recommendation (so not a MUST or SHOULD) to renew at a random time within that window. But as stated there's nothing wrong with developing your own algorithm to accomodate appropriate renewal within the suggested ARI time window. That said, those 4 days are just an example, Let's Encrypt could recommend a single hour as the window for example..
If you're talking about "first", I assume there's gonna be a second? I just don't really understand what you were talking about earlier with the whole "First successful cert for a name". As it implies a second cert.. And were were talking about renewals. So that would be a duplicate cert?
Only 1 is needed in that window, since there is only 1 cert that window applies to. If there's a second, that's either a bug or the beginning of abuse. Hence the "first" should be exempted from rate limits.
Yeah, no, nevermind. I still don't get the actual issue and from which rate limit that "first" (i.e.: regular) renewal should be exempted from. I'll leave it to other users to discuss
During past mass-revocation events, Let's Encrypt has temporarily adjusted or removed rate limits.
For example, during the TLS-ALPN revocations in January 2022, the New Orders per 3 hours was raised to 1000 orders per 3 hours:
Large integrators (who are most likely to be affected by rate limits during revocation events) were advised to contact Let's Encrypt.
If you regularly - outside of mass-revocation events - exceed the rate limits, the subscriber needs to apply for a rate limit override form anyway. This is how it's been handled before ARI and I personally don't see that changing with ARI.
ARI is a suggestion. The renewal time given by it is not a requirement in any way. Rather it's what the CA recommends for uninterrupted communication (i.e. advisory of an impeding revocation).
Before ARI, Let's Encrypt suggested subscribers to always renew after 2/3 of the certs lifetime has elapsed. If you were already spreading out your renewals over a much larger interval, you were already ignoring recommendations: In this case ignoring ARI is the logical continuation of this approach.
You can still utilize ARI to be notified of impeding revocations and perhaps for load avoidance, but in general your setup sounds incompatible with the ARI suggested window.
I have not, but I've also stated a few times that one of the goals of ARI is to smooth out renewals on the ACME server side, so I'm wondering why it would give you more trouble on the ACME client side than you currently already have.
And I've also talked about a fairly generous window example in the RFC draft of 4 days. So I'm also wondering why that would be an issue for you. What kind of windows does the Let's Encrypt production environment currently suggest? Let's start with that. IMO it doesn't make much sense to discuss a problem that doesn't actually exist.
I think this conversation has told me enough. My plan now is to check ARI and use it as a hint to start renewing right away if the window has been changed. (Since the window doesn't actually matter after all, but is at least a signal for potential problems.)