Thoughts from starting to play with ARI

Probably, but I'm pretty sure certbot still just does 30 days before expiry no matter what kind of duration the CA gives out, unless the user has specifically configured it otherwise. (Could be wrong, maybe they've changed something in a recent version but I glanced through the changelog and didn't see anything.) Wouldn't shock me if most clients were similarly simplistic in their renewal logic.


Continuing from another thread:

For what it's worth, you can revoke your own certificate, and then the ARI information does get updated to a time in the past indicating to renew immediately. So some amount of testing can happen by client developers and other interested people themselves, though certainly it's not quite exactly the same as when the ARI changes "on its own" from the client's perspective like would happen in a CA-initiated event.


The "Updating Renewal Information" spec

So I've done a first pass at implementing the "Updating Renewal Information", so now time for some thoughts on that part.

First, my understanding of the general motivation for why this is part of the spec is this use case:

  1. The CA screws up in some way when issuing a certificate, not checking domain validation right or constructing the certificate wrong or something like that, meaning that the certificate is "bad" and The Rules require it to be revoked either within 1 day or 5 days, depending on various things.
  2. The CA marks the certificate as requiring immediate renewal.
  3. The well-behaving client checks ARI and creates a new certificate to replace the bad one.
  4. Once the new certificate is deployed and "working" (to whatever extent the client can determine that), it can send a message for this new "Update Renewal Info" endpoint to tell the CA that the bad certificate is no longer in use.
  5. The CA can now revoke the certificate even earlier than the deadline.

My main confusion is that I'm not sure about the value of step 5, which means I'm in turn not sure about the value of step 4. Once the certificate is no longer in use by anyone, is there really a problem if the CA waits until its originally-planned time to revoke the certificate? I suppose that it might be good for the CA to know that its subscribers aren't being inconvenienced, but can't it just take a good guess by looking if a certificate was issued for the same domain names by the same ACME account anyway?

The thing that this whole process seems more useful for, is unrelated to a CA incident but when a certificate is intentionally being replaced by a new one that has a slightly different set of names. That way the CA knows it doesn't need to send reminder emails for it as it expires. In that way, it's a lot like revoking with a reason of superseded (and I think I recall some discussion in the mailing list archive or the Github issues to that effect already), but without the obligation for the CA to distribute that knowledge via OCSP & CRL. That seems a useful concept, to say that "I'm done with this" but really nobody else needs to worry about it, since the certificate isn't in use anyway.

When looking at it that way, it seems a little weird that the message is posted to an endpoint called "renewalInfo". I found it a little awkward trying to describe the process in logging, as "updating renewal info" when really the "renewal info" isn't what's being "updated". It also seems a little weird that the revocation endpoint takes the entire certificate, whereas this renewal info endpoint takes this CertID structure instead. It almost seems like the "I'm done with this" notification to the CA should be something completely unrelated to the ARI spec, and be an extension to (or somehow mirror) the revocation functionality.

What about account key incidents?

So while this whole ARI spec is motivated by issues with certificates, it's also possible for there to be issues with ACME account keys. For instance, some common system has a bad RNG like the Debian Weak Key problem, or someone finds a break in RSA or ECDSA (or at least some implementations thereof), or even just a single account key got publicly exposed and compromised and needs to be added to the CA blocklist. It would be good if there were some way for the CA to notify the client that it needs to rotate its account key rather than the CA just deactivating the account. (This might be even more useful where the account is tied to some external state, like an external account binding or rate limit adjustment, rather than the usual case with Let's Encrypt where just making a new account isn't a big deal.)

After all, the ACME RFC says "Compromise of the private key of an account key pair has more serious consequences than compromise of a private key corresponding to a certificate."


The more I think about this the more it bugs me. If I need to talk to the ACME server about one of the certificates I have, it seems really weird that different APIs would want different things. I would expect everything to just use a hash (a standard cert "thumbprint"), but revoking needs me to send the whole entire cert for some reason while ARI (either getting the window or updating that it was replaced) needs me to send just the OCSP CertID structure. It kind of makes me want to just have ARI data require the same thing, of just encoding the entire cert like in the revoke request, just to have consistency. (Just because that'd be easier than allowing for other formats for revoking at this point.) I think it'd be easier in most languages to encode the whole cert than to build a partial OCSP structure, and existing ACME clients are more likely to have existing functionality to revoke a cert than existing functionality to call OCSP.

This is a bit of a "modest proposal"; I completely understand not wanting a full encoded certificate on a GET request, and that you'd want to keep it a GET in order to make it easy for CAs to implement caching and use their CDNs and so forth. Really it just makes me wonder why revoking needs the entire cert in the first place. I'm guessing that at the time, nobody considered that there might be more APIs added to ACME that would want to reference already-issued certificates. It's just the kind of thing that bugs me since it seems like there ought to be a more elegant solution.

Thanks for reading my rant; feel free to completely ignore this message.


But please don't :pray: :slight_smile:


Multiple CertIDs work for the same certificate

Continuing the discussion from another thread, Implementing ARI / POST issue:

So, I don't know what's an issue with the spec vs. an issue with Boulder's implementation of it, but I've confirmed that Boulder (staging at least) replies the same regardless of whether the AlgorithmIdentifier has the parameters left out entirely, if it is a null value, or even if some other random data is in there. I feel like in order for CAs to be able to cache responses appropriately, and for clients to be sure of interoperability with CAs, there should exactly one clear representation for the certificate that is used. If some clients put in the null and others leave it out, then there will be twice as many possible URLs that need to be dealt with. (There's some discussion of the history of whether to need to add a NULL parameters for the algorithm for PKCS1 in this Stack Exchange post I found, but I have no idea what RFCs/etc. to chase down in order to figure out if it belongs there or not for OCSP CertID.)

Even more bizarre, Boulder doesn't seem to care what gets put in the issuerNameHash and issuerKeyHash at all! If I leave them as empty octet strings I still get the same response since all Boulder cares about is the serial number. (Like, it even makes sure that SHA-256 is specified, but then doesn't care what the actual hashed data is.) So only testing against Boulder doesn't tell me whether my implementation is actually hashing the right data or not, or if it would work with any other CA. Plus again, each client could be making its own different URL to test which would make it harder for caching and such to work. (Plus I'm curious if it's possible to inject tons of data or otherwise confuse something by putting stuff in those fields that doesn't belong.)

So, if this OCSP CertId is going to be what's used, I think that it needs to be very clearly and explicitly specified the steps to map a certificate to the value, server implementations should make sure that they check for and only allow the one canonical representation, and there should be a bunch of test cases for clients to use to check their work beyond just the one in Appendix A.


Okay, finally getting the time to reply to this thread. Apologies if the things I say here end up disjointed, I'm trying to reply to 26 posts all at the same time :slight_smile:

The URL construction is informed by two factors:

  1. Needing to unambiguously uniquely identify the certificate in question. We can't just use the serial, because that's only guaranteed to be unique on a per-issuer basis, and a single ACME server may issue from multiple issuers.
  2. Not wanting to invent something from whole cloth. An earlier form of the draft used a very different construction. Re-using something that many ACME clients can already do should reduce implementation effort.

It turns out that OCSP already has a mechanism which satisfies both of these criteria. I was told in no uncertain terms by the ACME working group that using the same mechanism as OCSP would be vastly preferable, so I changed the draft.

Interestingly, the desire to use the OCSP structure does not come from our end at all. It was selected based on the belief that it would be comparatively easy for client authors, while still conveying all of the necessary information. Having feedback from client authors that it is not easy is useful.

Now, I do have a personal ideal url construction method: just take the url from the Order's "certificate" field, append "/renewalInfo", et voila! This would be pleasantly RESTful. Unfortunately, ACME's url discovery mechanism disallows this: there's nothing preventing a CA from populating the "certificate" field with a URL that already contains query parameters, so we can't specify anything that involves appending to or otherwise manipulating that value.

Good point! This is either a bug in the spec or a bug in our directory. I'll decide which, and change one of them. Thanks!

As @Nummer378 says, this is already possible simply by looking at the list of all affected certs that all CAs are required to provide as part of their incident reporting process. Also, the next point below is relevant here.

No. I strongly disagree.

This reading is simply incorrect on the security front. If a cert's ARI window changes, that means one thing and one thing only: you should renew during the new window. Maybe the cert will get revoked; maybe the CA is just experiencing a load spike; maybe the CA randomly perturbs renewal windows frequently. You cannot and should not draw security conclusions from the suggested renewal window.

And this reading is also off-base on the reliability front: the CA is the only entity that has a global perspective on its traffic. If it is suggesting that you renew at a certain time, that's because it believes it can handle the load at that time. Don't second-guess it and decide to renew right now instead! Maybe it's trying to get you to avoid a load-spike that's happening right now!

Having an unauthenticated endpoint was explicitly requested by other members of the working group, for the specific purpose of allowing third-party monitors to surface ARI data to Subscribers in ways that most (unmonitored) ACME clients cannot.

This is a good idea and something worth considering. I'm not sure about the batch method for updating, but a batch method for querying makes sense. The difficulty is this: Let's Encrypt diverges from RFC 8555 in not providing a list of all active orders for an account. This is because, in practice, these lists can be huge, expensive to query, and necessitate pagination which is a pain. The same would be true of an endpoint which (in practice) lists all certificates for an account.

Really, pagination sucks.

This is an interesting idea, but not something that can be standardized at the ACME protocol level.

(Continuing the off-topic aside: working on it! This is harder than it seems, and not just from a compliance standpoint. The ACME protocol literally doesn't give us a straightforward way to let clients select between (say) 90-day and 10-day certificate profiles. GTS does this by having clients append query parameters to their directory URL. Anyway, more on this in a different thread later.)

Now this is a fascinating idea. I really like this.

It turns out that revoking (and generating new OCSP responses for) 200,000 certificates takes a long time. Having some of that out of the way already is nice.

But more generally, this endpoint was originally envisioned to enable something that many ACME CAs want but don't currently have: tracking "renewals". The only indication that one certificate might be considered a "renewal" of another is that they are from the same subscriber account and share (most of) the same SANs. Having an endpoint that lets clients explicitly mark one cert as replaced by another would be really cool. Unfortunately, this endpoint in its current form doesn't actually support that, because it doesn't have a way to say "replaced by what". This capability was removed due to conversations with people who pointed out that many clients which manage hundreds of SANs randomly shuffle names between certs and don't have a clear sense of which certs are replacing which others.

I think it's clear that this update endpoint needs to evolve in one direction or the other, either being removed again or taking a "replaced by:" field, but I'm honestly not sure which direction to go at this time.

This is also a good point. But, like you, I think it's a point in favor of "the revocation endpoint should be changed", not a point in favor of "renewalInfo should take the whole cert as input". I've put this note in my big doc of "things I'd change in ACME-bis".

Yep, Boulder only cares that the algorithm is SHA-256. However, including that NULL for the parameters makes the request BER encoded, not DER encoded, so that's a non-conformant request (because the draft specifies DER).

Yep, as explained here, Boulder makes the guarantee that serials are globally unique, not just unique-per-issuer, so we can field the request correctly given only the serial. This is not true for all CAs, so the spec can't rely on that, it's just a short-cut we can take. Apologies that it makes your testing harder, we can certainly make Boulder (or Pebble) stricter.

  • Server: "Dang, I'm having a bad time with all these issuances, much pressure, so busy, let's put a few certs which have ample of time to renew a few hours later in the queue..
  • Server: sigh...

I'll make you a PR to consider this weekend!


No problem at all; I know it's been a busy week for you. :slight_smile:

Just, my first thought on something meeting those two criteria is the sha-256 of the leaf certificate, which is much more easily available in I'm guessing pretty much every programming language. I'm very curious who thinks that using a part of OCSP would be preferable and would love to be pointed to that discussion so I can understand more about what I'm missing. (I can almost see trying to use some sort of OCSP extension to a different endpoint, or something, since it's getting some concept of certificate status. But using just the one piece out of the OCSP request structure seems really weird to me.) I'm not nearly as familiar with the ins and outs of certificate structures as many other people here, though.

Well, you could certainly add another field to the order with a link to the renewal info. That means that clients need to store that URL separately, though, and can't just get the status from the certificate file itself like many clients and existing outside-of-a-client monitoring systems do. (Incidentally, this is one of the pain points of the current draft: In addition to having the leaf certificate, a system needs to have the intermediate certificate as well.)

If you want to get really fancy, you could use a RFC 6570 URL Template or the like in the directory, to have the server specify the method of constructing the URL, rather than relying on concatenating exact strings in places. Not sure it'd be worth the hassle of bringing in yet another spec just for this, though.

I think Matthew's point is more that from a client's perspective, if it's known that a CA is having some sort of capacity problems, attempting to renew now (even if there is a current load spike) doesn't leave the client any worse off than it was before (since worst case it just fails to renew and has to try again when originally planned), and may leave it better off (since if it does manage to grab a certificate, it's all set for another 90 days (or whatever the cert lifetime is) and can hopefully ride out whatever current load problems the CA may be experiencing). That's why I likened it to a "run on the bank", where by exposing the possibility of there being a minor problem, the game-theory-optimal play may be for everyone to try to get their certificate immediately, leading to an even worse problem than if the CA didn't "announce" that there could be a problem at all.

In practice it may not matter nearly as much as, say, current broken clients that just get a brand new cert every single day up to whatever the rate limits are, but I think it's worth considering how to ensure the CA server & client interests align as much as possible.

But would it still be hard to just not error upon a client sending the notAfter field in the order request, even if it was just ignored and did 90 days anyway? It's just a protocol divergence that means clients can't send the same request to all CAs.

That does help explain some of the motivation; thanks. (It's sometimes hard to appreciate the full scale that LE deals with.)

I don't suppose this document is public? (Totally fine if it isn't, just being nosy.)

Then shouldn't the server reject it as non-conformant?

It's not entirely about making testing harder (though that's certainly one thing and it'd be good if Pebble or something similar could give a bunch of strict test cases for it); I think it's also about making the CA's life harder if multiple requests for the same certificate need to be separately tracked/generated/cached/etc. when they should all in theory be the same.


This is how Draft 0 was structured, and we've explicitly moved away from that based on feedback, yeah.

Attempting to renew now does leave it worse off, because it exacerbates any current load spike. The whole point of ARI is to let the central authority mitigate the tragedy of the commons. Don't write clients which think they're special and therefore should disregard the suggestions and, in doing so, make things worse for everyone.

Let's Encrypt actually doesn't diverge from RFC8555 here. We're implementing the spec:

notBefore (optional, string): The requested value of the notBefore field in the certificate, in the date format defined in [RFC3339].
notAfter (optional, string): The requested value of the notAfter field in the certificate, in the date format defined in [RFC3339].

The server MUST return an error if it cannot fulfill the request as
specified, and it MUST NOT issue a certificate with contents other
than those requested.

The notBefore and notAfter fields are optional. But if they are supplied, then they MUST be respected. We cannot respect them (because we have to carefully set the notBefore ourselves based on allowable backdating practices, and we have to set the notAfter based on that notBefore), so we cannot allow them to be supplied.

Sorry, it's literally just a note-to-self. No promises about actually kicking off work on an ACME-bis. But I'm personally kinda leaning that direction. It would be a huge project, and that alone makes it hard.


Thank you for the reply!

But what if you can't :no_mouth:

Maybe it's rate limits (having to squish lots of renewals together, since the client has relinquished control over its own scheduling, even though it knows its own load better than the CA), or because of CA unavailability during the window.

Both are quite plausible, especially if most clients don't support ARI in the first place. Then those which do get punished because they waited too long to try renewing.

And here's the thing, we ACME client developers get blasted -- HARD -- if our client doesn't do everything its power to keep people's sites up.

And the client is the only entity that has a global perspective of which certificates it needs to renew, when. :thinking:

Just saying, it goes both ways.

I strongly think clients should be in charge of their scheduling.

See my IETF letter which suggests a refined version of this, where certain certificates need only be described by attributes, not necessarily needing to enumerate ALL certificates.

Interesting! I thought the CSR could include this information. Maybe not?

Oh, I think we can solve this -- see my email to the working group, specifically suggestion (A) at the bottom where we use Retry-After header and an expanded Order object.

Neat, I didn't know that. I suspect most developers don't either :sweat_smile:

@jvanasco @aarongable

So what about this, my suggestion in the IETF email:

Instead of a totally separate flow to obtain ARI, simply utilize a
Retry-After header in the flow of existing ACME responses. Upon finalizing
an order, the ACME server can respond with a Retry-After header which acts
as the current-draft Retry-After header for ARI responses. The client then
attempts renewal at/after the Retry-After time, but with the OCSP CertID
added to the NewOrder object; this indicates to the ACME server that the
client is asking if now is a good time to renew the certificate indicated
by the CertID. If it's not a good time, the ACME server can reply as such,
with another Retry-After, and the client then waits and repeats, until the
server actually issues the certificate. If the client needs the certificate
immediately, simply omit the CertID from the NewOrder and the normal,
"non-ARI" flow is assumed. This is backwards-compatible and requires no
additional infrastructure or endpoints.

This solves many problems, including the ability for the CA to know which certificate explicitly replaced which other certificate, and it's backwards-compatible.

This is along the right track I think, but I also think we could implement ARI semantics without needing a separate endpoint entirely. Just add Retry-After to the ACME response when an order is finalized, and have the ACME client try again at/after that time, and all the client has to do then is add the OCSP CertID to the Order object. If the CA sees that, and it's not a good time from the CA's perspective, it can just say "not right now, but Retry-After..." and repeat, until it's a good time to renew.

Still doesn't hurt to try right now, and if the request is rejected, we back off and try again later.


This reminds me of my convex optimization class in grad school. Linear programming to find the optimal point satisfying multiple constraints. I wonder if there's a way to model this.

Wouldn't the CA just return a 429?

I think that only works if ARI is required/enforced.

Not special, just competing. But we wouldn't have to compete if ARI was required or if everyone implemented it. (But everyone won't. Just like practically no servers actually implemented OCSP stapling.)

I'd like to suggest to make the ARI draft less permissible with regard to ignoring the window. Currently, it can be read as just a light suggestion which may be easily ignored.


I'm not sure how you'd phrase it otherwise, though. I can make a client which is "ACME-compliant" but not "ARI-compliant" but still query the ARI endpoint to inform my renewal decisions, even if I want to use a different algorithm.


It might not hurt you, but if your unnecessary (because the cert has 25 more days left until expiry) renewal goes through, you might just have blocked issuance for a client which got a rejection from the server for a certificate with perhaps just a few more minutes left!

So what you're suggesting now is very selfish and goes against everything ARI tries to prevent.

First, take away any possible ambiguity. Currently, the draft states:

Conforming clients MUST attempt renewal at a time of their choosing based on the suggested renewal window.

I'd rephrase that as:

Conforming clients MUST attempt renewal at a time of their choosing within the provided renewal window.

Secondly, the whole name "suggestedWindow" is a misnomer if you'd ask me, as it leaves it open for interpretation that the client might try to renew earlier or later than the window. I'd rather call it "renewalWindow" or something similar.


That's basically what all current ACME clients do. I don't think it's good faith to call the behavior "selfish" when you yourself benefit from this. :wink:

Ok. That doesn't preclude attempting outside the renewal window, so that wording is fine with me.

I meant deliberately ignoring an ARI window. ARI ignorant clients wouldn't know better. ARI aware clients should though. It's the developers choice to ignore the ARI window for "the good of the client", which would ultimately possibly hinder other clients, which I would call selfish.

Fair point, not what I had in mind, so let's update it again:

Conforming clients MUST attempt renewal at a time of their choosing within the provided renewal window and MUST NOT attempt renewal outside of the provided window.


To be really, really clear: no one's intention is to implement ARI and then ignore the window.

We'll definitely be watching the window, and taking it into account when scheduling renewals.

Well, that can't really be what's meant, since after the window, if my prior renewals haven't worked, then I should still be allowed to keep trying. And I might want to explicitly renew early, if I know that my own network will likely be down or congested during the supplied window or something like that.

And as has been said earlier "renewal" isn't always a well-defined concept, such as if there are a lot of certificates mapping to a lot of sites, with names shuffled around depending on what sites need them at any given time.

On to something completely different:

A "Priority" concept

I think this might have been discussed in some other venues, but just to add to this already-too-long thread: I think it'd be good if there was a concept of "priority", where either

  1. A client can specify that it's fine with a lower-priority window
  2. The server can specify that the renewal time in the past isn't just a "suggestion", but that the cert will be revoked before expiration.
  3. Or maybe both :slight_smile:

The use case is for when I have a site where I'm confident I don't need the full 30 days to resolve a problem if renewal fails. Maybe I'm fine with just 25 days, or 20. For instance, if this is a "dev" or "testing" server, and I'm fine with helping the CA out by not needing as much overlap time. Or maybe even for a paid-for CA using ACME that will charge less if there's a smaller overlap.

If I previously based my logic on the cert being 25 days before expiration, and want to start using ARI, once the usual CA window passes (30 days before expiration, in Boulder's current implementation) then my client has no way of knowing whether the suggested window is in the past because that's just the CA using its default setting, or whether it's because there's an imminent revocation on the way.

So it might be good if I could tell the server that I want a "relaxed" window or a "conservative" window. And it might be good if the server could tell me that its suggestion is in the past due to "just because" or "because you need to renew now".