ARI recommendation may cause renewal outside Suggested Window

QUESTION

This is a bit of a tangent - how have you all been handling issues with failed replaces? Does this happen often? As previously discussed here, there is no standard error across CAs for this situation – so many people here seem to be consolidating on a trick of attempting to resubmit an order with replaces if it fails the first time. I haven't been able to trigger this outside of tests.

2 Likes

In distributed management settings, some storage backends, which are supposed to synchronize operations, don't have good atomicity guarantees or have bugs. Think NFS, some AWS mounts, and S3 to name a few. People use them anyway, and sometimes this causes one instance to step on another.

So for example: two instances may go to renew the cert "at the same time" (one slightly after the other) but this fails for the late one, so it retries without replaces. (Actually, in this case, when it goes to retry, it should first see that the cert was already renewed, and thus avoid retrying altogether... but there is still code in CertMagic which drops replaces if it has to retry again for any other reason.)

2 Likes

Thanks! I have been trying to deal with that by detecting and blocking competing orders at initialization, but may end up using some "dogpile" strategies if this proves insufficient.

That has worked well for me; the only time I've been able to trip the issue is when a "legacy" cert tries to renew (i.e. the ACME server no longer has the ARI info). This happens in testing a lot, as pebble's data store is ephemeral.

2 Likes

I just dug into Boulder's source, and... ugh.

I am assuming that most Subscribers will be using Cron to schedule something daily like Certbot , which will do it's own task management and dispatch.

The current Boulder Source requires the renewal to happen within a computed window to bypass rate limits, which is generated by the same logic that powers the renewalInfo endpoint. The logic creates a buffered window around the 2/3 lifespan for 90day certs and 1/2 for shortlived. That code is here:

Based on the above, the "optimal" is the "suggested".

My current thoughts are this:

  • clients should probably archive the initial renewalInfo and immediately renew at the start of the suggestedWindow if it ever changes. either boulder updated logic, or there was some issue. The margins are small though, so if there is a change you need to maximize the response time in case there is an error and a retry is needed.

  • renewing within a 90day cert's suggestedWindow shouldn't be too hard. It's roughly +- 1 day from the 2/3 point. this should get picked up on a daily run.

  • renewing a short lived cert will be hard. there is a +- 2h24m window from the halfway point to renew.

IMHO the renewal windows should be wider. I think giving a full +- 24h for 90day certs would be more useful; and at least +-12h for shorter certificates. To meet the current windows, short-lived clients will have to be running non-stop. A 90day cert will have 1-2 opportunities to renew during the current window if the client only runs daily; but a 10 day cert will require a renewal program to run at least every 4 hours. Clients will absolutely need to run hourly to see if ARI checks are needed.

3 Likes

I don't see the harm but it seems unnecessarily complex. A similar question was answered by Aaron (or other staff) who said you should not assume anything just because a window changes. Simply choose a new random time within the new window and proceed. It could have changed for load balancing such as moving up renewals to cope with upcoming CA revocation or hopping past pending maintenance or any such thing.

Pretty sure the current window on 90d certs is 48H long. I am not sure if that 48H starts at the 2/3 mark or surrounds it though.

I agree more than once/day is best for 90d certs. At least twice and with a well-designed ACME Client more frequent than that does not induce larger load on ACME Server (or itself). By this I mean it honors ARI Retry-After and does suitable progressive backoff on any errors. So, this all needs to be in the state data.

Various strategies for short-lived are viable. Personally, I think an alternate CA cert (or even an LE 90day cert) should already be available and on "standby" if using short-lived certs. Even ignoring the ARI rate limits the window for correcting a problem is small. The problem might not even be an LE one but anything in the path.

Auto-deploy of the standby cert should happen unless error recovery promptly gets a fresh short-lived cert. Then switch back once the original problem has been researched and confidence in its reliability is restored.

I only see these as practical when used by very skilled admins along with appropriate monitoring systems and tooling.

1 Like

And Let's Encrypt is going to offer 6-day-certs, which will only have a window of 2.88 hours. So you'd need to run your ACME client at least 9 times a day for at least one attempt within the window.

A little bit less: 43.2 hours. That's why Jonathan suggested plus and minus 24 hours to make a 48 hour window. I guess that would make a little bit more sense if e.g. clients were just running once a day: then the client would have 2 attempts while with 43.2 hours there's a chance the second attempt would be outside the window.

The Boulder code above subtracts and adds 1 % margin from the ideal point, making 2 % in total.


Ideally, the ACME client would be an always running service/daemon with its own scheduler, not needing to rely on systemd timers or cronjobs. That way the ACME client could schedule in the renewal at any time,

1 Like

Huh. Looks like a fairly recent change. As recently as a few weeks ago I was getting windows like this for staging:

ARI Suggested Window: Apr23 13:20:44<->Apr25 13:20:44

But just now I see

ARI Suggested Window: May15 19:41:54<->May17 14:52:43

2 Likes

Yes, possibly. That's a considerable change for the ACME Client infrastructure.

And, it is possible for cron/timer based clients to achieve that just by running very frequently - like every few minutes. A Client being run that often should be well behaved so as not to create stresses if lots of people use it or used for large numbers of certs. It is certainly do-able. I have a working one :slight_smile:

It requires a fair amount of state data that ACME Clients have not required in the past when running every 12 or 24 hours for 90 day certs. And, which Clients still do not require if they continue to just rely on that pattern and cert life. They could just drop-in ARI in place of OCSP and be fairly reasonable for such a use.

2 Likes

FWIW, I decided to keep the autorenewals simple in my recent update of CertSage and just checked every time CertSage is run to see if the current time is >= to:

(validFrom_time_t + validTo_time_t * 2) / 3

I didn't want to work-in the intricacies of ARI at this time.

1 Like

Because the windows have relatively small margins, selecting an offset of 0 maximizes the retry chances if an issue happens, as (based on the current logic) a second retry is possible within this window.

It's about -24h and +24h from the 2/3 mark. The code is above; the 2/3 (or halfway) point it calculated, then 1% subtracted for start and added for end.

That is a great idea, and makes me glad that I recently wrote support for backup certs!

Certbot is powering the vast majority of installations. It could conceivably be strapped into something like Celery, but that seems like overkill -- and moving towards that sort of deployment would seem a bit user-antagonistic to me. My client is python based, and I specifically avoided task runners and messaging queues for ease of use. A handful of other existing clients could work on a daemonized model, but again that seems like overkill. This could cause issues for "control panel" clients too, like CertSage.

I've hit a sweet spot for my needs with an hourly run for ari checks, but it seems like i'll need to handle orders in there too. I'll have to start logging how many ARI checks and ACME orders happen on each run to figure out if multiple threads/processes are needed. We only have dozens of domains on this system now, but it's supposed to scale to tens of thousands.

2 Likes

There's a bit of a weird "benefit" with CertSage's current implementation in that while there is a daily cron job to ensure some frequency of execution, search bots and the like (as well as manual loads) hitting certsage.php will cause renewal checks as well.

4 Likes

You mean… integrated directly into the server?? :thinking:

4 Likes

Or the core operating system?

2 Likes

That's one option, as long as it's not written in Go :rofl:

3 Likes

Modula 2 is a good memory safe language.
And Modula 2 has had coroutines since it was designed, not an addon afterthought.

1 Like

Gotta be Rust...

:man_bowing: :crab: :gear: :woman_bowing:

3 Likes

Nay. Needs to be a memory safe Procedural programming - Wikipedia that has characters, bytes, and octets that are equal in size with American Standard Code for Information Interchange as the character set as that is what Domain name - Wikipedia are based on,
for better or worse.

1 Like

But the Rustaceans will come after us... :worried:

3 Likes

But didn't Alec Baldwin shoot and fatally wound a cinematographer in the production of that movie?

1 Like

His execution was not body safe.

2 Likes