ARI recommendation may cause renewal outside Suggested Window

Should a cron-based Client adjust their ARI selected random renewal time to avoid being before the suggestedWindow?

Let me explain ...

Consider a cron-based client. Run frequency doesn't matter but let's use every 1H for example.

  • At 01:40, Client sees an ARI 48H-long suggestedWindow starting in 20min (02:00)
  • Client chooses random time (point #2 below) which happens to be 02:02
  • From recommendation point #5, this causes the Client to renew at 01:40 which is before the suggestedWindow. As the next cron run is at 02:40 and the selected random time is before that

The longer the cron run interval the "earlier" it could be.

I vaguely recall some early ARI comments that to avoid Rate Limits you needed to renew within the window. The Rate Limit docs today don't mention the window only that using "Replaces" avoids Rate Limits. Perhaps I mis-remember or LE is more flexible with "Replaces" now.

I realize it is not always possible to renew within the suggestedWindow. Such as a fairly short Window and long cron intervals or an extended outage where error retry pushes beyond end of Window. But, this case I describe can occur even when all components are operating correctly.

ARI Draft

Clients MUST attempt renewal at a time of their choosing based on the suggested renewal window. The following algorithm is RECOMMENDED for choosing a renewal time:

  1. Query the renewalInfo resource to get a suggested renewal window.
  2. Select a uniform random time within the suggested window.
  3. If the selected time is in the past, attempt renewal immediately.
  4. Otherwise, if the client can schedule itself to attempt renewal
    at exactly the selected time, do so.
  5. Otherwise, if the selected time is before the next time that the
    client would wake up normally, attempt renewal immediately.
  6. Otherwise, sleep until the time indicated by the Retry-After
    header and return to Step 1.
3 Likes

I don't interpret cron based clients as being able to "schedule itself" at all. Whether it's literally cron or some other scheduler, the point is that the client is being run by a process outside of itself and it doesn't know when it will be run next.

So step 4 and 5 don't apply. And step 2 is unnecessary. In your example, the client would just sleep until the next run which is still within the ARI window even if it's past the arbitrary time it might have chosen in the previous run.

On the next run, the client recognizes that it is now being run within the suggested window (or more accurately, after the window start time) and renews immediately.

The only way this simplified algorithm breaks is if the schedule is not occurring often enough such that the client could miss the ARI window entirely. But that's why recommendations for renewal schedules have dropped from once or twice daily to more frequent.

1 Like

They certainly do schedule themselves. It is just a fixed time rather than event-driven based on an original selectedTime within a suggestedWindow.

On a different thread Aaron suggested a 1H interval as being frequently enough for such clients.

A cron-based Client can know when it is next scheduled to run. My own client does that today.

I track state info anyway to honor ARI Retry-After. That is, if the cron is every 1H the LE Retry-After is currently 6H. So, only every 6th cron run refreshes RenewalInfo. This state info also ensures error backoff retry doesn't happen every cron run. Both for RenewalInfo query errors and cert request failures.

Think of it as the cron-job waking up and saying "What do I need to do this time"? Maybe nothing if no error recovery, the RenewalInfo is still fresh, and the saved random selectedTime from the ARI window is in the future. (this is per cert of course).

A simple drop-in replacement of ARI for OCSP in a cron based client is certainly the easy way to go. But, it isn't the most faithful to the spec. Or the friendliest in case of high frequency runs.

Your argument about no client knowing when it will run next would obsolete point #5 altogether. Is that your opinion?

2 Likes

Pretty much, yes. If the client is not responsible for maintaining its own schedule, it can't know when it will be run next. So the algorithm about when to renew comes down to:

  • Check for updated ARI window if appropriate based on previous Retry-After
  • If $now is after ARI start time
    • Renew
  • Else
    • End (for that particular cert)

It respects the ARI suggestion as best it can given limited information without missing it entirely (provided the schedule doesn't it force it to miss the window).

2 Likes

Well, you can know when the next planned run is just by comparing the time you previously ran to $now. You have to keep state info anyway as I noted to honor Retry-After and avoid error retry floods.

I don't think your suggestion is unreasonable. I just think it isn't as faithful to the spec as it could be.

You could do (and I actually do) something like:

Immediately after getting first cert:

  • Query ARI
  • Select random time within ARI suggestedWindow. Save this as selectedRenewalTime
  • Save ARI Retry-After (plus $now) as ARIrefreshTime

Next cron run:

  • If $now past ARIrefreshTime, refresh ARI and choose/save new selectedRenewalTime
  • If $now past saved selectedRenewalTime, renew cert

NB: Similar "next time" values saved for ARI query errors and cert request errors.
Checking and handling these error retry sequences take precedence over above.

Is all this tracking of state info more complicated? Certainly. But I believe it respects the spec fairly closely only subject to the granularity of the cron run frequency.

Which leads back to my original post. When I set the selectedRenewalTime I know how long it has been since the last cron run as I noted that in the state info too. I could adjust this time to ensure I stay w/in the suggestedWindow by blocking out this amount of time from the start (and end). I mentioned a possible drift beyond the end anyway due to errors and whatnot so I am less concerned about that.

2 Likes

Only step 4 does not apply if an ACME client isn't able to schedule renewals by itself. Step 5 is when ACME clients can NOT do that, i.e., most of the ACME clients.

Assuming ACME clients will not run less frequent than the duration of the ARI window and if the ARI team doesn't want this to happen, they should add some kind of exclusion for renewal that are too soon according to rule 5 and that those renewals should wait till the next check.

I.e.: "… unless if the current time is before the ARI window; then the renewal should be skipped until the next renewal check." (or something like that)

4 Likes

Yes, my point exactly.

2 Likes

You can guess, but you can't know for sure. The human controlling the schedule could change it between runs. Even unchanged schedules are supposed to have an element of randomness if possible. The server could be down during the next interval. But all that is irrelevant because the point of an externally scheduled client is that the client doesn't need to worry about the schedule.

Unless I'm misunderstanding something, I think the only difference between your proposed implementation and mine is trying to pick an explicit time for renewal instead of just using the start time of the window as a trigger. Picking an explicit time seems pointless since you can't ensure the renewal happens at that specific time because scheduled clients don't generally control when they run next. It also unnecessarily reduces the effective size of the renewal window by however long it is between the actual ARI start and the picked time.

In both cases:

  • The client will be attempting to renew within the ARI window (give sufficiently frequent schedule)
  • The client won't be making excessive ARI calls or renewal requests because it tracks previous state.

What am I missing?

3 Likes

A while ago I opened an issue with a similar concern:

The answer being:

Yes, the situation you describe is acceptable -- if it wasn't, I wouldn't have specified it.

4 Likes

Sure, somewhat obscure but that's why I set a future time in the state info. The next cron run after that time takes the action. If someone changes their cron frequency from 1H to 24H the extra delay is on them :slight_smile:

You can't know how any system will behave in the future. Not even a dedicated thread waiting for the exact moment to do something. Users can kill or break tasks, modify needed subsystems and all that.

Yes, when the cert selectedRenewalTime has passed I induce skew for the cert request itself. My pseudo-code did not show every contingency. I further "move up" the end of the suggestedWindow when deriving the random selectedRenewalTime to account for this skew to stay within the window.

No, I don't target an explicit time.

Correct though I don't use the startTime in the suggestedWindow as the trigger. The spec says to choose a random time within that for renewal. So I do that.

Yes, but, the client does need to know about timing issues. It should honor Retry-After so needs to track that. If it has an ARI query failure it should do exponential backoff so needs to know what happened last time. Same for cert request failure.

A cron job running every 5min and continuously retrying failed cert requests or ARI queries is ill-behaved. So, a client should behave well regardless of the set cron frequency.

Picking a random selected time in the window is just one (small-ish) element within this scheme.

Well, it is not unnecessary as that is what the spec calls for. But, yes, a cron based client needs to run frequently enough for the window size of a CA. As noted, Aaron suggested 1H and LE uses 48H as current window size.

2 Likes

Thanks much. Hadn't seen that so this thread is asked and answered.

I have been treating the suggestedWindow as fairly important. Sounds like there's far more slack allowed for a "well behaved" client.

I still think I'll try to stay within the window. But it sounds like I can use best judgement how to handle that.

3 Likes

ACME clients can always improve on the suggested algorithm of course, no rules against that :slight_smile:

2 Likes

We could use a complex AI model to predict the server temperatures at renewal and weather patterns surrounding the networking equipment the lengths of all the paths to the validation servers (gotta predict those too). :smiley:

Or maybe roll a die hundreds of times and use computer vision and probability models to interpret the results. :crystal_ball: :thinking:

I kid. I kid. :wink:

2 Likes

You mean like the cloudflare lava lamp wall?

I once saw a Lego dice rolling machine that was mostly a belt elevator with a camera pointing to it and some CV model on a raspi.

3 Likes

My own approach is to take the ARI window as a literal suggestion, if my users wants their certs to renew before the ARI window then that's what they get, if they want to ignore the ARI suggestion and revert to a fixed renewal interval (which someone has asked for, because they have manual processes) then they can do that as well, losing the benefits of ARI for mass revocation events. For everyone else the renewal will happen at some randomly chosen point in the suggested window.

Note that it helps enormously to be checking for pending renewal jobs regularly. e.g. Certify The Web checks every 5 minutes to allow for things like very short lived certificates, and managing large collections of certificates (e.g. 100k+).

3 Likes

Just curious ... I can understand someone waiting to update their servers on a specific schedule. But, what is the advantage of not getting them on regular schedule?

That is, "deploy" certs separate from "acquire"

2 Likes

They probably want to use some of the built in deployment but also have manual stuff (like 3rd party vendors, old load balancers etc) they want to coordinate for some reason. Really though your guess is as good as mine because they didn't respond when I asked about their "workflow" to try to understand the requirement better. Sometimes that means there is a part of their process they are embarrassed about sharing. You can lead a horse to water..

4 Likes

I removed the Solved mark to re-open my original concern. Sorry @mholt :slight_smile: That gave Aaron's reply to your similar concern but that doesn't fully match with other LE comments. I should probably repost as a "Feature Request" for doc clarity but leaving it here for continuity.

There are a number of places that say Rate Limits won't apply when using Replaces that also fall within the Suggested Renewal Window. One example quote below from here: An Engineer’s Guide to Integrating ARI into Existing ACME Clients - Let's Encrypt

When Let’s Encrypt processes a new order request featuring a ‘replaces’ field, several important checks are conducted. First, it’s verified that the certificate indicated in this field has not been replaced previously. Next, ... If these criteria are met and the new order request is submitted within the ARI-suggested renewal window, the request qualifies for exemption from all rate limits.

There are a number of other threads with posts from LE staff saying the same.

In contrast, the current Rate Limit page is less explicit. It says ARI renewals occurring with "Replaces" and in the "optimal" suggested window are exempt from rate limits. It leads me to wonder, given the ACME ARI Draft item #5, whether that "optimal" window includes times outside the explicit Suggested Window. See: Rate Limits - Let's Encrypt

I plan to make best efforts to stay within the Suggested Window so will ignore item #5.

TL;DR:
It would be nice to have clarity from LE whether "optimal" allows falling outside "Suggested" window for Rate Limit purposes with "Replaces". And, if so, how far outside :slight_smile:

2 Likes

Fair point actually! As I've said elsewhere, I think trying to fall within the suggested renewal window kind of defeats the purpose if you don't get any benefits for doing so (i.e. no rate limit exemption). But if you can fall outside the suggested renewal window and still get the benefits, that seems to defeat the purpose of the renewal window! :face_with_spiral_eyes:

4 Likes

First, update Pebble if you are testing against that – it gave bad renewal info time until about a week or two ago. It now ensures the window is within the cert's notBefore/notAfter.

I decided two things with my client

  1. check ARI hourly AND subtract 65 minutes from the cert notafter and suggested window end. 60 minutes are to cover anything that would end before the next invocation; 5 minutes is to account for potential clock drift.

  2. every 6 hours I run an "automated orders" routine. This handles queued orders, orders backup certificates, and then does renewals. anything that would either expire or enter the renewal window within the next 365 minutes will be renewed. 360minutes = the 6 hour window until next run, 5 minutes for potential clock drift.

There is a small chance with this logic that certs could renew before the window starts. I decided that is acceptable for now. I am preparing for short-lived certs, and have come to the conclusion that renewing at the start of the window is currently the best option. I think you could easily have more lenient logic for long term certs. Maybe in the future I'll support two cases; I like having more overlap on short certs though.

i decided to keep the ARI and Renewals (mostly) separate, because I modeled a bottleneck on a uniform routing during stress testing. If need be, in the future multiple parallel task runners can be used if I hit bottlenecks.

On "mostly" separate: I can schedule cron jobs for each routine independently, but I put together an hourly task dispatcher so only one routine needs to be entered into cron. On generation it's assigned an hour+minute offset; the cron is set to the minute and internally it applies the offset to the schedule (so a 4x daily task can happen at any point, not 00:00)

4 Likes