Thoughts from starting to play with ARI

I think the problem is that most schedules do not run sufficiently often :wink:

That's debatable, as the client is the one who chooses the time within the window.

I think we aren't disagreeing with the sense of having some mutually coordinate scheduling.

I think we're discussing practical problems from experience, and better implementation alternatives.

Yes, agreed.

I wrote about this together with some industry leaders I highly respect, here:

Not sure how much of this is based on math, rather than practicality though.

Hopefully a helpful resource...

Yep, glad you noticed that too. That's exactly one of the main problems we're discussing in this thread.

if we use a persistent random time in windows, I think there is no point to have a window but a single server-picked random time makes more sense than client side logic is super simple: if it past ari time renew: else sleep til next job: if it's between sleep X til renewal time. RNG here needn't be crypto safe one so it should be extremely cheap isn't it?

They may not run sufficiently often yet because there's no reason to do so when using a client that doesn't yet support ARI. That will obviously change as clients are updated and admins are told to update their schedules in the release notes for new versions.

The client is doing exactly what the CA instructed...attempting to renew within the suggested window. Whether it's at the beginning or the end of the window shouldn't matter. If it does, the CA needs to be more specific. It shouldn't be the client's responsibility to second guess timing within the suggested window.

Ah, you're more optimistic than me about users updating their software and following release notes :sweat_smile:

Like I've written in another thread, I would be OK with dropping the window and simply having the CA recommend a time, as @orangepizza also suggests. :+1:

Some just implement bad practices. acme-tiny, for example, recommends it be run once per month. It just gets a certificate, without comparing any previously downloaded certificates or anything.

There are just so many clients out there that have a bad design or are terribly coded. Most clients have zero tests. That's why I thought it might be smart to do a boolean field renewImmediately. IMHO, even when ARI gets adopted it's going to fail as intended with most clients, because they're going to parse dates incorrectly or otherwise not handle the suggestedWindow correctly due to the absence of tests. The major clients will be fine as they use tests, the bulk of clients out there seem to be written by iterating changes against production/staging.

I mean...this applies to all software, not just ACME clients. But yeah, if clients don't update themselves and nothing breaks for the user, I totally agree users are pretty bad at staying current. But if nothing breaks, there's also no problem from the user's perspective. The first time their site breaks because their client didn't get a new cert following an unplanned revocation, they'll hopefully figure out why and update. Shame on them, lesson learned (maybe).

But aren't these a tiny minority in the overall firehose of traffic that the CA has to deal with such that lack of ARI conformance would be a non-issue? If if a bad client gets sufficiently popular that the CA is negatively affected, there's plenty of historical precedent for LE reaching out to the client author(s) to get them to fix things or outright blocking the client.

This seems like more of "talking about unlikely edge cases as the norm".

On a per-client basis, I agree with everything you said. My concern is that the majority of clients are not properly designed or coded. The last time I checked the "recommended clients" list (ACME Client Implementations - Let's Encrypt) only a handful even had tests. Clients aren't really tracked, so information is limited (see The most seen ACME client - #25 by eva2000 for some info).

I'm not actually thinking about the impact on the CA. My concern is that end-users will be running clients that implement ARI incorrectly (and without tests), thinking they are fine because they see "Now supporting ARI!". Then their sites go offline during a mass revocation, because their code does not correctly parse or compare dates.

I recall LE had blocked certain clients in the past to protect greater availability. I think there was a firewall block on bitnami, because they did not randomize the runtime and just checked at midnight.

Maybe in some cases, but for say the recent attempt to enable asynchronous order finalization where clients just don't comply with the existing ACME spec at all and thus couldn't get a certificate from LE once it was turned on, it seems like LE just "indefinitely postponed" since they have higher priorities than enabling it, even though they did want it enabled to make some aspects of running the CA easier. I haven't read anything about them working to reach out to client authors about it, though certainly it's possible that they just haven't been advertising doing so (or that they have and I just didn't happen to see it).

In this particular case, though, I think that if they're handling ARI at all they'll probably be alright, specifically because all they're doing every time is tell whether a time is in the past. Rather than a separate "renewImmediately" signal, which they could misspell or something if it didn't regularly show up. They don't even need to really parse the date, it's in ISO-8601 format so a string comparison ought to be close enough.

(Obviously, it'd be better if they had tests, and if there were good test servers for it, as stated earlier in this thread.)

I'm not sure if you're trying to suggest a solution for this case or just lamenting the state of software in general. Yes, bad clients exist, users will use them, and some will eventually get burned. Then they migrate to a better client (or not) and move on. The circle of (software) life. :clown_face:

Well, a little of both :man_shrugging:

Given that we know many clients are not well made and lack any testing at all, I think it is reasonable to assume many will incorrectly process the renewalInfo payload during a mass outage event.

Given the severity of issues that can create (taking sites offline), I think the service/spec should simplify things and defend users from this as much as possible. A few years ago, I started to take the position that in many situations one should try to design and document things for an average bad/new user, not the perfect use case or experienced expert.

IMHO, this is one of those situations, and adding a renewImmediately field which utilizes a boolean value would go a long way to that goal. It would eliminate the potential errors in both parsing and comparing the dates in the suggestedWindow payload. Maybe I am too defensive in this, or I am being too harsh on the bottom 95% of ACME clients. I just foresee a lot of problems on this.

Hey. It would be nice if ISRG were to email the account holders of clients that don't have tests or are seriously out of date, and advise them to migrate every time they order a cert/renewal, but that's a lot more work - and would create problems through opt-outs.

wouldn't is make client to lazy out and not try to process ari but only look at that bit?

I think of it more as a failsafe. Pseudocode: if (r.suggestedWindow.enddate <= NOW) or (r.renewImmediately): handle_emergency()

If a lazy client did that, they wouldn't benefit from ARI's suggested window or ability to futureproof potentially shorter certificate lifespans.

Most importantly - If anyone were to encounter this, it would be a sign they should do an audit, and the clients should notify the user. Hitting a situation where the window is in the past implies:

  1. There is a configuration issue with the client. It is not checking ARI frequently enough or renewing certs frequently enough.
  2. There was a mass revocation event.
  3. The certificate was otherwise revoked.

IMHO, in these situations there should be something clear, concise, and overstressing to the users that immediate action is needed -- both a renewal and an audit.

What's wrong with a client running, say, daily and just renewing whenever the start date has past?

It's just like today but instead of checking if 30 days remain (or 1/3 of life) you retrieve and look at the ARI window for a past date. There is no need to know the ARI window in advance to plan a schedule and adapt the schedule as ARI window changes. You just run, check ARI and react. You have to check the ARI window often anyway to learn of a changed window. We talk of ARI window moving sooner from CA revocations but I suppose ARI window could go later if CA was having trouble and needed to delay requests.

For this common kind of client setup it gives the client the benefit of being able to auto-renew in response to unusual revocations. And, for what could be fairly minimal work.
If they had been checking for specific number of days remaining (rather than 1/3 life) then they also now handle shorter cert lives.

I appreciate that systems that manage 1,000s of certs may have complex scheduling requirements of their own. Those require tailored solutions.

Nothing. That sounds like great logic. I totally agree with all the points you make.

I still have a concern that clients will not parsing the date correctly, or failing to correctly compare the startdate to now() because, again, most clients do not have a test suite. I know about 10 clients that will implement this right, but I am pretty sure they are the only ones that have test suites that will ensure the client behaves correctly.

If the client encounters a situation where they are past the enddate, someone or something forked up, and this should trigger some alert to the user/admin. Either the client is not checking ARI enough or we're dealing with some sort of revocation.

If they can't compare an ISO-8601 string to now, then I wouldn't be confident that they can compare a JSON boolean to true either.

I don't know what the user/admin would do with such an alert, though? If they renew whenever they're past the start date or end date or whatever, and the renewal works, then they've done their job and everything is fine (whether that was due to a normal expiration, an imminent revocation, or a CA planned future outage).

Really where I think most clients are failing is that if a renewal fails (or maybe if a couple attempts fail), they need to alert their admin and they don't. But that's not really related to using ARI or not.

Perhaps I am traumatized by decades of bugs involving date comparisons. There are many ways people get dates wrong. JSON boolean can be handled with simple string or pattern matching.

If they were affected by a mass cancellation or revocation, that needs to be investigated. Most do not require any action, but there are security implications.

If they were not affected by a security event, then one or more services may have failed to run correctly.

Not surfacing this information is like ignoring/hiding errors/exceptions in code. This a sign that something went wrong up, and may still be messed up.

I'm going to reply in this thread, since I think it's getting off-topic for the other one.

The current Boulder implementation does. I don't think that there's anything requiring a CA to do so. And, well, I think that the ideal logic may be more complicated than just "is it revoked". Consider a certificate revoked for "cessationOfOperation"; that might imply that the certificate is never supposed to be renewed (at least not by the same subscriber, or same ACME client, or something).

Really it's just an example that I think the current draft doesn't give nearly enough guidance to CAs on how to pick good suggested windows.

Reviving this as it's still a top result for ARI client dev stuff.

I fought with constructing the ARI url for way too long yesterday, as I finally implemented it on our internal systems. My issues stemmed from every Python library handling the information differently than expected. In case anyone else has issues:

  1. The server serial. The ARI spec requires encoding the hex bytes of the cert. Every Python library (and most CLI tools) I tried will fully decode the serial's hex bytes into an Integer - so you need to do a bit of extra magic encoding that you would not necessarily expect.

  2. The first four bytes of the AKID structure need to be tossed. I didn't realize this for a few hours, then finally noticed the tail of my vars were identical - but the initial bits were not. I tried a few different ways of parsing - no luck. Then I tried looking up some spec info, but couldn't find any. Eventually I looked to see if anyone had tackled this for Certbot and found this PR from @orangepizza

        #by nature of asn1 encoding single member sequence
        #we can strip first 4 bytes to get akid
        #seq/len/octetstring/len
    

Funnily, aside from them figuring out that little bit – we both implemented essentially the same code.

I understand the reasons why the ARI id was constructed this way, but it feels overly complicated to construct under many programming libraries/frameworks. It seems easier to construct in low level systems, and a bit of a headache in higher level systems.

I've said it before but just for emphasis, for anyone implementing ARI replaces be super careful and assume that you will for some reason provide an invalid replaces cert ID (wrong account, different CA, different set of identifiers since last renewal, some other unforeseen thing).

If your new (renewal) order fails at all for anything I'd suggest discarding the ARI replaces ID and start fresh for the next attempt. It's not really in the spirit of things, but otherwise it's a potential self-own.