What are you doing with ARI Retry-After?

I didn't see much about this apart from this very old thread ARI: Retry-After header describing startup problems.

How are ACME clients handling this? I can understand how a daemon type client could use that. But what about cron or timer based CLI clients (Certbot, lego, ...)? They would loop if they honor the getARI retry-after.

LE currently replies with a 6H retry-after same as in example from the draft. It seems helpful to surface this info so people can set an appropriate frequency for their cron or timer. And to change their timing if the retry-after changes. Do you think clients should do that? Are any of them doing that?

HTTP/1.1 200 OK
Content-Type: application/json
Retry-After: 21600

{
"suggestedWindow": {
"start": "2021-01-03T00:00:00Z",
"end": "2021-01-07T00:00:00Z"
},
"explanationURL": "https://example.com/docs/example-mass-reissuance-event"
}
The server SHOULD include a Retry-After header indicating the polling
interval that the ACME server recommends. Conforming clients SHOULD
query the renewalInfo URL again after the Retry-After period has
passed, as the server may provide a different suggestedWindow.

4 Likes

I don't know of any cron-based clients that do that, but technically it would be possible to statefully store the retry-after value on disk (together with other status information) and read that on renewals. This then allows the ACME client to determine whether to re-query ARI or not. Additionally, this technique may also allow the client to store rate-limit information independently of the cron interval. I think this is particularly relevant for shortlived certs, where you want to reduce the cron interval of your ACME client as much as possible, without spamming the ACME server on each run.

3 Likes

Sure, but why "hide" the info as a response header rather than in the response data?

This is especially awkward for any ACME API that now must surface both the response headers and data to the calling ACME Client so it can track this stateful data. Actually, awkward for any app as usually the http request "layer" is well below the one handling the ACME data.

Because it is a header it seems intended to be ignored by such cron/timer based ACME Clients. And, sounds like that is what is happening then. Hopefully an apps lower-level http request function does not blindly see this header and retry the request :slight_smile:

3 Likes

I think ACME has always operated under the assumption that an ACME client has easy access to both the headers and the response. Various parts of ACME are tied to the headers, i.e. Replay-Nonce header, Link header, Location header and of course, Retry-After. This is the "HTTP way" I guess.

3 Likes

I've been in the process of updating the public version of our client, and migrating all internal operations to it. My client caches ARI results.

The new design has these two periodic routines, which I'm still ironing out:

1- Update ARI information - refresh stale ARI data

2- Renew Expiring Certs - renew certs that have exceeded ARI or internal policy limits

These are decoupled from a singular "renew" command, because the renewal process is inherently blocking, and I wanted to avoid a situation where an ARI "replace ASAP" is delayed from being queried/recognized due to the order fulfillment process on less critical Certificates.

I store everything in a database; tests are against sqlite3 but i use postgresql in production.

I am doing daily ARI checks. The retry-after on the server might be 6H, but the nature of the ecosystem is that 24-48H (even 48-96H) should allow for a revoked certificate to caught and replaced without issue.

5 Likes

Totally agree for today. Seems the point of having a retry-after is so we can automatically update to changes in these patterns. Both with a CA and between them.

Thanks for the background. Sounds like you are in the group ignoring getARI retry-after :slight_smile:

2 Likes

As a subscriber, yes. As a client author, no.

I’m running the routine to update ARI info every 24 hours, but that could be set to run hourly. Only stale data (based on the retry-after) will be updated due to how logic is coded. I specifically implemented this as command line scripts that could be run by from to allow this control. I considered daemonizing this (and some others) via Celery, and that may happen in the future, but a command line cron script has a significantly lower knowledge requirement and allows for nearly the same customization.

I’m simply choosing not to use the retry header as the “target time” at this point, and only as the earliest time. When things shift to short profiles, which have tighter revocation requirements, I will likely update the integration. The ARI retry after is currently a single value that is targeted to the revocation needs of (yet to be released) short lived certificates.

Right now it’s not worth the extra noise (logs emails etc) for me; and the vast overwhelming majority of ACME clients do not use ARI information at all. If I had more personal bandwidth, this would be interesting to iterate on right now. I think we’re still at a time in ARI adoption where weird idiosyncrasies and production quirks are likely to surface — so constant monitoring and oversight is important. I feel better sifting through a once daily log, than multiple ones.

(By quirks, I mean situations like when people first started realizing pending authz from multi domain certs could compile across runs and wedge an account)

3 Likes

I get it now. You save retry-after as state data but just don't check it very often.

I don't fully agree. ARI is more versatile than just that with options like CA load management.

Ideally we would just do what we're told by this info. That ultimately will give an optimized CA experience. Although, re-reading the draft (again) it says just to re-query ARI "after" the retry-after period so I guess is intended to avoid flooding. It doesn't say to retry "at" or "near" the retry-after period :slight_smile:

For the time being I will track the ARI retry-after and alert if any changes. Then review and fine-tune cron schedule accordingly. I don't want a daemon for this use case. I'll revisit as ARI matures.

I still think it is odd to use a retry-after header with a 200 OK response to convey this. Retry-After usually means to retry the http request (with some fault code). I'd prefer to see it in the response data but that's separate issue from how to react to the data. As a header it seems more geared to a daemon which is what made me think about this at all.

1 Like

It is a little odd, and doesn't seem to really be what the specs had envisioned, but it may be the closest thing to a standard there is. It really wasn't always "retry" already, though. MDN and RFC 9110 that it references do say that it's a minimum amount of time to wait for retrying on errors, but also that it's a time to wait before fulfilling a 3xx-series redirect (which really doesn't seem like it fits a "retry"). I'm curious if there are any other standard use cases over HTTP that use Retry-After for a success response. It actually reminds me of the old Refresh header, which is very browser-centric and I don't think was ever really standardized as a part of HTTP, but indicates when one should poll again for the response to have changed.

(Some fun side notes I found while looking at this: Firefox has an open ticket to process Retry-After when getting a 503, which has been open for over 21 years now. I'm sure they'll get to it eventually. And the bottom of the MDN page on the Refresh header links to this mailing list post about the use of the header despite it never really being in the HTTP spec.)

2 Likes

I know you worked with ARI early on. Do you do anything with its retry-after?

1 Like

Just logged it. But that was really just a hobby client, and I've since moved on to just using lego for my systems. I don't think lego exposes or uses the Retry-After value at all, but just expects its user to run it on a cron job separately.

2 Likes

No, it does- emphasis added to the last sentence:

This protocol uses the Retry-After header [RFC9110] to indicate to clients how often to retry. Note that in other HTTP applications, Retry-After often indicates the earliest time to retry a request. In this protocol, it indicates both the earliest time and a target time.

IMHO the protocol implies using a daemon or very frequent runs to achieve the "target time", as the "retry-after" could shift at any moment and enrollment of certificates can happen at any time. Running a 'recheck if needed" 1,2,3,4,12,24 times a day will eventually coordinate all the times into buckets, but some of the initial checks will be closer to a target while others will not.

Same here.

2 Likes

My bad, thanks.

Since the retry-after is a response to a specific ARI certId the value could be different for every cert. My imagination can't quite see that happening but ...

Yeah, I'm starting to realize the retry-after isn't something a cron-based client should worry about. It isn't the same as a suggested polling interval if we need to treat it as a target. Cheers

1 Like

I see where some of my confusion arose from:

IETF Draft 3 said:

  1. Otherwise, sleep until the next normal wake time, re-check ARI,
    and return to Step 1.

Draft 4 said:

  1. Otherwise, sleep until the Retry-After period has passed, or
    until the next normal wake time, and return to Step 1.

And the latest said a lot more starting with:

  1. Otherwise, sleep until the time indicated by the Retry-After
    header and return to Step 1.

For all the latest retry-after see: draft-ietf-acme-ari-07 - Automated Certificate Management Environment (ACME) Renewal Information (ARI) Extension

1 Like

A very large shift in the Retry-After header would be rather bad for ARI protocol implementations, as according to the draft RFC, clients should not have to fetch a "fresh" renewalInfo resource from the ACME server, until the Retry-After time has been reached. Thus if the ACME server were to serve Retry-After headers with long intervals, it would undermine the purpose of ARI.

The Retry-After intervals should be chosen that it would not hamper sudden changes in renewalInfo (i.e., due to the cert being revoked or an upcoming revocation) (i.e.: not too long) and should not affect the ACME server too much (i.e.: not too short).

1 Like

The value is different for every cert; the retry header is currently coded to 6 hours from the ARI request, and the ARI response is 2/3 of the cert life. I believe ISRG caches the response to handle the potential of clients that do not respect the header (advised by the RFC).

I interpreted the docs about "sleep" to mean a process or messaging queue that is only concerned with a single certificate. For a client handling multiple certificates, that recommendation becomes difficult to follow, especially as the dominant model for current clients is a script invoked by cron.

I just wanted to share this other bit from the draft, again with emphasis added:

Server choice of Retry-After

Servers set the Retry-After header based on their requirements on how quickly to perform a revocation. For instance, a server that needs to revoke certificates within 24 hours of notification of a problem might choose to reserve twelve hours for investigation, six hours for clients to fetch RenewalInfo, and six hours for clients to perform a renewal. Setting a small value for Retry-After means that clients can respond more quickly, but also incurs more load on the server. Servers should estimate their expected load based on the number of clients, keeping in mind that third parties may also monitor RenewalInfo endpoints.

This 24 hour revocation period is the CA/B forum timeline for critical revocation (e.g. compromise), other methods (such as mass revocations) have 5 days. IMHO, due to implementation details of the TLS ecosystem, a certificate is unlikely to cause issues within 48 hours of revocation. A browser vendor may use proprietary channels to push this through quickly.

I've managed and advised highly trafficked websites that were constant security targets, and they should implement continuous ARI polling now to catch that retry window. Investing the resources on early adoption is a good option for them.

I think most other subscribers will be fine with a short delay. Remember, this is offering a new warning system so any improvements are a net benefit against the current status quo of absolutely no warning at all.

I think my point above got lost in the language. For a client managing multiple certificates, the Retry-After will be different for each certificate as they are based on the query time; and the server can change their recommended logic at any point - a conformant client and integration needs to be prepared for that. LetsEncrypt might decide that 24 hours is fine, or drop it 3.

Utilizing the "retry-after" as a target would mean deploying a messaging/dispatch system or using a daemonized process. Most ACME clients are invoked through commandline interfaces for a short lifecycle, so periodic task runs are the only option. This will start to segment the retry-afters into different buckets.

There is also the concern that rety-after could shift based on certificate type; we are expecting 6 hours because that is currently hardcoded into Boulder - but ISRG might decide on different retry protocols for different profiles.

TLDR; Actually utilizing the retry-after as the target is a lot of work with minimal payoff.

2 Likes

Yes, the target value derived from Retry-After will vary since it is an offset. But, what I meant was that Retry-After could be a different value itself for every cert. Perhaps very short in some cases and longer in others, for example. Or, given it can be a timestamp it could clump some together. It is completely up to the CA to do that.

Based on the latest draft I agree Retry-After is designed for a continuously running client. Such as one that places the Retry-After derived target time into a scheduler que and processes that. Not a FIFO que. Not that you implied that ... just sayin.

Update: @jvanasco I didn't see your response to Osiris as I wrote the above. Seems like we agree including about this :slight_smile:

2 Likes

The idea with ARI is that cron-based ACME clients would need to increase the amount of runs daily anyway, so I don't see any issue with storing the Retry-After header next to some ARI stuff somewhere next to the certs on persistent storage. Doesn't make much sense to store the ARI and not store the Retry-After info. Might as well use it.

Then, when you run your ACME client 8 times a day you can actually use it :slight_smile:

The latest draft says to treat Retry-After as a target time.

To have a cron-based client able to react to every target time is impractical. It would have to run nearly every minute and check those times. The longer the interval between cron runs the longer you "miss" the target time.

That's what Jonathon is talking about (and I agree).

If you only run every 12H, or even 6H or whatever it isn't much help to look at the prior target time from Retry-After. Just make fresh ARI request and react to suggested start time. That is, if your cron frequency doesn't align with the CA Retry-After intervals you end up just adding extra delay for some of the cert ARI requests.

2 Likes

Currently that is true, but it's just an implementation detail. I still track the Retry-After and make decisions based on that. I just have the cron to check ARI data set to a 1x daily run instead of a 2/4/6/8/12/24x daily run. The outcome is guaranteed by the current situation, but I still persist the values and operate on them. Doing this once daily simplifies manual auditing of the logs, which I will continue to do until I feel comfortable that no serious edge cases are likely.

Basically, I have the tooling ready to do this - but there is no real benefit to activating it at this time, so I will slowly enable it as I feel comfortable.

I run ARI checks and Renewals separately, so ARI checks don't block a Renewal. Because I use separate processes, I also check ARI status about an hour before the renewal job runs.

In the event of a revocation, an "immediate renewal" response of Cron-AriCheck will mark the certificate for renewal by setting the "renew" date in the past; the subsequent Cron-Renewal process will sort certificates by expiry, so the recently marked certificate is fast-tracked to be the first replaced order.

Edit:

Just for clarify, by "target" I mean "The time the ARI check will actually happen", and not the "earliest time to check after".

1 Like