Thoughts from starting to play with ARI

So, seeing that the recent incident utilized updating ARI to tell affected people to renew prematurely, even though I wasn't affected it inspired me to take some time this weekend and finally play around with getting the data.

At this point all I did was make a script to try to get the information; I haven't worked on integrating it into my weird custom workflow or anything yet. So here are a few random thoughts from first trying to see how to implement it.

Building the CertID

My first thought is, yikes building the URL to hit is much more convoluted than it feels like it needs to be. Some of it may be that I tried my script in Node.js (and Javascript isn't known for being easy to deal with low-level certificate primitives), but I suspect it may apply to many languages, especially if people are trying to wrap something around their existing issuance infrastructure rather than integrating directly into a client. The core issue is that the URL you need to construct is based on an OCSP structure identifying the certificate, which requires taking one's existing certificate and parsing out the serial number and issuer, and also taking the intermediate certificate that signed it and getting its public key too. So rather than just, like, using the fingerprint of the existing leaf or something similarly simple that a lot of tooling can already give you, one needs to really dig into both the leaf, and the intermediate, and hash various pieces thereof, and then take all that to build a new ASN.1 structure.

I can only imagine that it's specified the way it is to make it easier for Let's Encrypt (and other CAs presumably) to reutilize some portion of what they already do for OCSP, since I can't imagine it makes it easier on any client authors. Though perhaps there are clients that already check OCSP that can pull out the CertID easily enough (maybe OpenSSL or some other common popular library gives it out of the box, I don't know, I just know that the framework I was using doesn't). I did get it working eventually, though (and got to learn a lot more about ASN.1 in the process).

Getting renewalInfo from the directory

Second: The spec says that "The full request URL is computed by concatenating the renewalInfo URL from the server's directory with a forward slash and the base64url-encoded bytes of a DER-encoded CertID…"

That is, renewalInfoUrl + '/' + encodedCertId. However, the actual directory entry in the Let's Encrypt directory (both staging and production) already ends in a slash. So I think that by the spec there should an additional slash added to the end, like https://acme-staging-v02.api.letsencrypt.org/draft-ietf-acme-ari-01/renewalInfo//MGgwDQYJYIZ… with two slashes between the renewalInfo and the abomination of a path I needed to construct. Though in fact it works without the extra slash in there as well. I think in this case the spec is probably right and the Let's Encrypt directory should be updated to remove the slash at the end, matching how the other entries in the directory generally don't have one either.

"Security Considerations"

So there's none of the usual ACME nonces and signing and such, which certainly makes things easier, especially if one is trying to run some process outside of one's regular ACME client to just keep an eye on needing to renew early due to an incident. All the information needed to look up the renewal info for someone's certificate is in the public certificate transparency logs, and the spec specifically says that anyone can look up anybody's information "because the information contained in renewalInfo resources is not considered confidential". I wonder though, if it could be used to tell people whether someone else's certificate was affected by an incident. Usually this wouldn't be a huge deal, I agree, but it potentially exposes what validation method a system uses or the like. That is, if an incident happens that affects all users in subset X, anyone can then check whether anybody's site is in that subset X. And if that subset is based on running a particular vulnerable web server or ACME client, that might expose information to an attacker that the subscriber might prefer not be exposed so easily.

I'm not really objecting, or saying that anything should necessarily change, just that if it turns out that there's some incident along the lines of "ACME Client ABC uses a bad RNG and its certs should all be revoked", if a CA uses ARI to mark them all as needing renewal immediately, someone could find out which sites were using the vulnerable version of ACME Client ABC, and that should be something people are aware of as this starts to roll out.

I'm sure I'll have more thoughts as I play around with it more. Thank you!

9 Likes

I'm not sure how applicable this scenario is. Any CA incident that involves the CA/browser forum already requires a list of affected certificates to be made public. During past incidents (i.e. those predating ARI) Let's Encrypt has commonly made lists (often CSV files) available for all affected serials and/or crt.sh ids, like in this incident. The scenario you describe can only apply if an incident occurs that does not involve the CAB and where a legitimate interest exists in keeping this information undisclosed.

In general, probing for vulnerabilities can always be done some way or another. In your hypothetical example, I could just assume that all certificates use weak keys and try to bruteforce them. Thus, I would be able to figure this out no matter if affected certs are public or not. Note that this always assumes that you're targeting specific sites, because a full bruteforce of all serials is always infeasible.

You can also argue that having serials fully public (i.e. not only being able to probe a given site via ARI, but actually having the full list of affected certs) is useful in many cases. For example, during the above linked incident I actually contacted site owners I knew were affected but hadn't renewed yet. This was only feasible because I could search the list of affected certificates by FQDN, so I could scan that for domains I knew.

7 Likes

Yeah, that's all a good point. I somehow forgot that all the certs involved were generally public in an incident anyway.

I could imagine them wanting some sort of moratorium before releasing a public list, in the case of it being related to a vulnerable web server or ACME client or something, where in addition to needing to revoke there is some other remotely-exploitable vulnerability on the servers, and I guess all I'm saying is that ARI is one more place where such a list could be "public" and might need to be handled with care in some situations.

But the more likely case is something like the more recent incident, where really servers aren't at risk and it's just that anyone who happened to get a cert at the wrong time just needs to get a new one and that's the end of it. And it's really exciting to see how ARI might help with that.

And on a slightly different note, I think I've just managed to integrate it into my hacky homebrew certificate issuance solution, so I can try running my Lambda function every day rather than having it on a cron job to just run every 2 months. All I'm just checking for now if the current time is after the scheduled start window, rather than the recommended algorithm of trying to pick a random time within the window, but that's probably more than good enough for my couple of certs. Not bad for a few-hours-on-the-weekend project.

8 Likes

From my own observations the ARI API design strongly lends itself to serving results without touching a database, as the current renewal could just exist as a json file per certid in some (geo-replicated, simple) file storage.

This is also why I think the update renewal info call exists. Obviously a CA would know a cert has been renewed, but providing a hook for clients to tell the backend to refresh it's cached renewal info is handy because again you wouldn't have to touch a database.

7 Likes

Oh, your post strongly resonates with me!

I have the same thoughts. If ARI is intended to out-live revocation (and thus OCSP), I don't think it makes sense to model ARI after OCSP to maintain some sort of elegant parity, because once OCSP is gone, well now you just have a really complex algorithm just to make an HTTP request.

I don't understand the technical motivation either -- I'm sure @aarongable has some insights -- but if it's primarily elegance, I would like to file a feature request: easier construction of the URI. :slight_smile:

I struggled with this when I implemented ARI because my ACME client, ACMEz, does not do OCSP in and of itself: the package is purely an ACME implementation. However, a level higher in CertMagic (the package that does maintenance and renewals using ACMEz), does implement OCSP stapling. Indeed, I was a little frustrated at having to implement OCSP logic at multiple layers of abstraction.

Also -- and this is a Go-specific thing -- very special, error-prone OCSP code is unexported in Go's ocsp package, so I had to copy it out and modify it, which was a bit tedious. (I opened an issue to request it be exported, but I doubt that'll happen.)

I noticed this too but didn't say anything, so I'm glad you brought this up.

Aha, I've also wondered about this. I am certain we will see certificate monitoring services (e.g. CertSpotter) also scraping and monitoring ARI. We will 100% for sure have third-parties learning about ARI updates probably faster than the relevant ARI-conforming ACME clients.

This touches on some points I still have confusion about. I raised concerns in a previous topic regarding rate limits -- which are still concerns -- but more fundamental questions remain:

  • If we think the certificate is going to be revoked in the future, why continue to trust it? i.e. if there was a misissuance or a key was compromised, we should stop trusting it now, not later. (I've heard all the "well it's just policy most of the time, not a security concern" arguments -- but I'm not convinced, since the policies have to be enforced to maintain security.) Basically ARI becomes another form of revocation!
  • If we think the CA is going to experience congestion soon, then why wait to renew? A narrower renewal window lowers our chances of getting a certificate than does trying right away with the same well-mannered exponential backoff (that I assume most clients aren't doing anyway because they're cron jobs) -- especially at a time when we know the CA is expecting higher loads.

The point is, if the renewal window changes, something is wrong and for optimal reliability and security, renew now.

So, if my ACME clients do support ARI, we'll probably try renewing right away if we see the renewal window move.

I am also curious how many times ARI will be used (i.e. change the renewal window) for:

  • congestion
  • revocation
  • something else -- are there any other reasons?

Right now revocation leads 1-0.

And yeah, it will be interesting to see what transparency monitors do with ARI stats.


Overall, I will add my experience to yours: I found that implementing the basic ARI client code is not particularly pleasant; implementing ACME itself was just about a similar amount of complexity (in terms of constructing API calls) but with ACME the reward is much more significant. With ARI, it felt a little anti-climatic. :upside_down_face:

I like the idea of a way to know "you should renew your certificate now," but done differently.

Part of the reason this is complicated is because revocation is already broken. If we had short-lived certs, we likely wouldn't need ARI. Congestion would be a given (no matter what), and revocation wouldn't be useful.

My ARI wishlist:

  • An authenticated endpoint. This prevents clients and transparency monitors from using it as a signal for revocation; i.e. another form of revocation. It keeps ARI true to its purpose: to tell the ACME client when to renew, and to get a signal from the client when the cert has been replaced.
  • An easier way to get ARI info. Mentioned above already. It's too hard to craft the request, which isn't even authenticated.
  • Easier for clients to scale with reduced network traffic. Two API calls per certificate is tedious and noisy. Some clients of ours manage tens of thousands of certificates. I'd much rather see a single endpoint that lists certificates with renewal windows that have changed, and a batch method for clients to update ARI status for their certificates: maybe a JSON array with all the cert IDs / serial nos. in a single request. To keep things lightweight for the CA, this could be a static JSON doc that's updated every few minutes or hour.
  • Renewal window change should basically be "renew ASAP." For reasons mentioned above, it doesn't make much sense to say "there is or soon will be a problem with this cert" and continue serving a certificate that (a) isn't likely to renew successfully at first, or (b) already has reason to be distrusted. I think any certs appearing in the list at the ARI endpoint should be considered "at risk" and replaced right away. That way we're not serving certs that have a known compliance or security issue.
5 Likes

Yeah, I would love to understand the technical motivation here, and can only imagine it has something to do with how Let's Encrypt already handles having an OCSP entry for every cert and this was the easiest way to add it. I don't see how building an OCSP CertID gives you any better uniqueness/security/whatever over just using the leaf sha-256 thumbprint. I mean, I guess the CertID approach includes a way to update the hash algorithms if someone finds a break in SHA-256, but as it is there's no way to figure out which hash algorithms the server supports. And you could include the hash algorithm in the URL in some way …renewal-info/sha-256/1234abcdef… or the like to make it more forwards-compatible anyway, in a much easier way.

As a concrete example, I currently have a daily cron script on my router (A Ubiquiti ER-X, which can do a lot but as a small system installing software on it can sometimes be a pain), which figures out if it needs to send a message to my central renewal system by a daily cron shell script running an easy

if openssl x509 -noout -checkend 2592000 -in "${certpath}"
then
  #Nothing to do yet
  exit 0
fi

which sees if we're within 30 days of expiration. I've love to change this to use ARI instead, but I don't think it'd be very easy to figure out the right incantation of openssl commands to build the request, if that's even possible. Once I had the path to use, using curl & jq to parse the result wouldn't be that bad. But something involving openssl x509 -fingerprint -sha256 or the like sure seems easier.

And being able to have a "sample code" or "reference implementation" or whatever you want to call it, in some common languages (shell script, Python, Java, Powershell, or whatever the cool kids are using these days) would be really helpful for client authors. Again, especially if it's something you can wrap around an existing client that can just tell that client to --force early.

Only if the window changes to a time in the past, if I understand the recommended algorithm in the draft spec correctly. I could certainly imagine the renewal window changing without it really indicating a problem. For instance, say the CA was expecting an extended (~24 hr.) issuance outage in the future to do some database maintenance, or one of the main datacenters was going to be offline for network maintenance so issuance capacity was going to be significantly reduced. The CA could then change windows so that clients were suggested to renew before or after that time, without it really indicating any sort of problem. I mean yeah, any well-behaved client could just try during the downtime window, get a 503 or fail to connect or whatever, and then try again later. But specifically telling the client that it should try earlier than its usual renewal date might be helpful.

But yeah, if the renewal window changes to a time in the past, then yes renew ASAP.

I see it kind of the other way around; without ARI it's harder for the CA to issue shorter-lived certs, since many current clients will renew 30 days from expiration (or maybe even worse, like 60 days from issuance) regardless of how long the certificate will be valid for. I think ARI becoming popular will be part of the process, since once it's common for clients to check ARI for renewal times, the CA can just change to 60-day or 30-day certificates without the client even caring about the difference. It might even be able to lower lifetimes only for well-behaved clients that are showing that they always renew near the suggested window, keeping 90-day certs for the cases where the CA can't figure out when they're set to renew.

6 Likes

My quick and dirty script (no error handling at all...) for checking ARI periodically in my monitoring system looks something like this:

#!/bin/bash

readonly USER_AGENT='ARI-Check/0.0.1.dev0'

lencr_ari_uri=$(curl -A "${USER_AGENT}" -Ls https://acme-v02.api.letsencrypt.org/directory | jq -r .renewalInfo)

now_ts=$(date +%s)
          # Get raw OCSP request without nonce and using SHA-256 hash (default SHA-1 is not supported by LE ARI)
cert_id=$(openssl ocsp -sha256 -issuer chain.pem -cert cert.pem -reqout /dev/stdout -no_nonce | \
                # Extract raw Cert ID from OCSP request (by skipping OCSP request wrapping)
                openssl asn1parse -in /dev/stdin -inform DER -strparse +8 -out /dev/stdout -noout | \
                # Encode raw Cert ID as URL-safe Base64
                base64 -w 0 | tr '/+' '_-' | tr -d '=')
ari_url="${lencr_ari_uri}/${cert_id}"
window_end_ts=$(date --date=$(curl -A "${USER_AGENT}" -Ls "${ari_url}" | jq -r .suggestedWindow.end) +%s)
ts_diff=$((window_end_ts - now_ts))
if (( ts_diff > 0 )); then
        echo 'OK'
        exit 0
else
        echo 'CRITICAL - Immediate renewal needed'
        exit 1
fi
4 Likes

Thank you so much! I'm very happy to see that there is an openssl incantation to do it, especially for exactly these kinds of use cases of quick monitoring outside of one's "main" client.

6 Likes

Right, so this is the congestion control motivation for ARI: avoid a potential thundering herd of clients after the server resumes operations following maintenance.

However, the planned maintenance means that there's a smaller window of opportunity to renew a cert, and we can expect increased traffic load and higher likelihood of reduced availability due to load, especially if the CA is signaling resource limitations by moving ARI windows around.

Which leads me to:

Thus, from a client perspective, any change in the window indicates a problem of sorts -- whether forwards or backwards. The problem is either with the certificate or with the CA, but either way, you want the highest chance of getting your cert in time. If you wait until a later window, you have less time to renew it and less time to retry in case there is a problem. :bangbang:

So, my own clients probably will never intentionally move their window later in time. I just don't think that'll be helpful for uptime. The more chances you have to retry a failed renewal, the better. We use exponential backoff, of course -- as any well-behaved client does, and we do not run on a cron job, so there's not any extra burden on CA servers.

If clients are timing renewals based on how long since issuance instead of how long until expiration, I think that's a bug that clients need to fix. I don't think adding more infrastructure is going to solve that.

ARI as a renewal hint is basically a re-implementation of OCSP stapling.

Anyway, I'm really glad we're discussing this. Thanks for sharing your experiences and perspective!

4 Likes

Ah, I think I understand your point (also after having reread your post in your prior thread on rate limits discussing this). I think some thought has to be put into how to avoid a "tragedy of the commons" or "run on the bank" type situation, where any sort of notification that future renewals might be more challenging than normal (which as you say, any window change in any direction might mean) then means that everyone tries to renew immediately, to get their own certs while the getting is good, which can lead to the very clustering/capacity problem that telling clients a preferred time was trying to avoid. It might be tricky to figure out how to align everybody's incentives correctly.

Thinking on it more, with the current system, even without ARI, a "greedy" client is incentivized to basically renew all certificates as soon as possible, up to the CA's rate limits (like, renew each cert after only a day-and-a-half or so), and on failure constantly retrying immediately (again, up to whatever rate limits are in place). That's what would always allow for the maximum time to resolve any problems and keep the highest chance of not having a certificate renewed in time. But there's some sort of expectation that "well-behaved" clients wouldn't act that way, since it would be "abuse" of the CA's resources. So I feel like if a CA moves their suggested ARI window ahead or behind by 1 day or so (for a cert with plenty of time left before that window), a "well-behaved" client would want to follow that, just like it would want to do exponential backoff. But I can see how in the general case, what we're really needing to do is revamp how rate limits work and interact with clients, which they're working on but is certainly not a trivial problem (as it's much more than a technical issue but more of an ecosystem/society one).

5 Likes

True!

I think to avoid this, the CA could add affected certificates to the renewal window change list incrementally, rather than all at once (for large events).

Another thought: if ARI was a list instead of one-endpoint-per-cert, the list wouldn't have to enumerate serial numbers, it could also have a mechanism for describing ranges of NotAfter dates. ACME clients already have this information, and it wouldn't involve a list with a million entries.

Yep, and I guess that's one reason why rate limits exist. I do think there's a fairly intuitive/obvious balance between too greedy and too complacent; and best-effort renewal 30 days out is pretty well in the middle there.

I think ARI for short-lived certificates (say, 7 days or less) is an interesting thought experiment. Does ARI have real utility then?

Keeping with the standard "renew with ~1/3 lifetime remaining", certs would be renewed in the nominal case after about 4-5 days. At < 7 days total lifetime, revocation is nearly impractical since it would take half that time to notice and respond to an incident. The cert would be renewed a day later anyway. And even if ARI is used, it might still take a day or so for the client to check the ARI endpoint again.

Given that ARI is pulled/polled, and not pushed, I feel like once lifetimes get short, you actually don't need ARI. With revocation out, its only utility is congestion control. But you'd have to refresh ARI every few hours instead of every few days, and you'd only be able to slide the window a little bit. I think the current implementation of ARI definitely doesn't make sense for short-lived certs, but I could see a case for a single ARI endpoint that says, "You should renew these certs now" rather than a very, very narrow and concrete timeframe.

But either way, I think it's clear/obvious that ARI needs a batching approach.

2 Likes

Another random thought I just had:

It would be good if the recommended algorithm gave better guidance on handling errors. (Maybe this is more about ACME in general than ARI specifically.) It just occurred to me that if a certificate is expired, probably the workflow should just try to renew directly rather than querying ARI at all (which I'm guessing would give a 404 probably?). But other errors querying ARI probably mean that one should fall back to a "traditional" two-thirds-through-lifetime workflow. (Maybe only after getting a couple failures, with some sort of backoff?) I was looking forward to just offloading all my renewal timing logic to Let's Encrypt servers, but it looks like in order to really do it "right" one needs to do both ARI and look at the current certificate lifetime, and combine them in some intelligent way.

5 Likes

A I

4 Likes

These are all useful thoughts! It's a day off in the US today, so I'll provide full responses tomorrow hopefully. The one thing I'd say, though, is that a really good place to have these conversations is on the ACME mailing list where this standard is being discussed. Alternatively, just file bugs or PRs against the draft directly, and I'm more than happy to take a look!

7 Likes

Thank you, Aaron! I just might do some of that, but before jumping in and saying everything that's been done is wrong :wink:, I'd love to read through some of the design decision of how it goes to this point. I've tried looking through some of the issues and some of the mailing list archive, but it's hard to see the big picture and I don't know what exactly to try to read through. It looks like some older versions did use a certificate thumbprint, so I'd love to know why that decision was changed to this OCSP-based structure. It also looks like older versions included a hint to the client that it would need to change its certificate key, and I'm curious why that was removed since it seems like it could be useful.

7 Likes

As an off-topic aside @aarongable, it would great if LE could implement optional shorter lifetime (notAfter) for Orders.

Implementing this would encourage more ACME clients to properly support short cert lifetimes and move more towards renewal based on percentage lifetime instead of days.

CAs currently variously return an error if notAfter is supplied in an order, BuyPass parses and validates it (but otherwise ignores it), but Google honors it and defaults to 90 days if the value is out of range. CA fallback is made more difficult if CAs return an order error, because then you have to know in advance the CA capabilities to avoid including it in the order.

7 Likes

Yeah, it'd be great if all an ACME client needed was the directory URL and optional external account binding info for it, but in practice an ACME client needs to know more specifics about the CA's implementation. Not just what fields can't be included, or how long a certificate one is likely to get, but things like if there's a "test" environment too. Sometimes when staging goes down, people post here confused since their client is using the LE Staging environment for a dry-run scenario even when they're trying to use a different CA for actual issuance.

It'd be great if whatever extensions get added to ACME try to minimize these kinds of problems. Like, there needs to be some way for the client to know that Let's Encrypt supports SHA-256 (and I assume not anything else?). I just tried it first because that's what the example in the draft spec says, and it's kind of the "default" hashing algorithm used nowadays, but if I try a request and get an error, it's hard to know if it's because I'm using an unsupported hashing algorithm, the certificate is expired, the certificate is from a different CA, or something else is wrong about the request.

On an entirely different note, I'm very amused that I sometimes get back a two-day-long suggested window, with nanosecond-level precision. (I'm assuming that LE is doing that on purpose, just to make sure that clients can handle the spec which literally says there can be any number of digits after the decimal point, if I read it correctly.) Like, anytime within these couple days would be great, but 00:50:46.321 would really be earlier than you need to.

suggestedWindow: {
  start: '2023-08-14T00:50:46.333333334Z',
  end: '2023-08-16T00:50:46.333333334Z'
}
6 Likes

I can't do another list right now. I was going to put this in a PR, but there are people here who may comment on this idea as being good or terrible:

It would be nice if a server SHOULD/RECOMMEND to extend the ACME Order object with an initial ARI payload when the certificate is first ready.

This would save an initial request to the ACME server for this information, and help publicize adoption of the standard as existing clients would at least be able to initially inform themselves of the suggested renewal window.

The field could be titled renewalInfo and contain the [suggestedWindow, explanationURL] fields, as well as a retryAfter field that contains the data provided in the header.

I think this would address at least one other suggestion people have had on the acme list:

  • 2021-09-29 acme list , Michael Richardson:

I am surprised that it's not something that is inside the certificate.
An in-certificate hint would be useful in RFC7030 situations for IoT
networks, particularly when a CA rollover is expected soon.

I understand concerns of clients that may break if fields in an Acme Order object change, but I've yet to read anything in the ACME spec that prohibits the addition of new fields.

4 Likes

Can't/shouldn't they be doing this already?

This is a great point. :100: I'd like to echo this.

2 Likes

I think this is just a side effect of how golang does timestamp arithmetic on UTC structures and LE didn't bother to truncate the timestamps before sending them off:

5 Likes