I also really like the idea of ARI using something like the SHA256 of the leaf cert/pre-cert. One consideration would be the difficulty of sharding databases for efficient indices. But I suspect with Key-Value stores/hash indexing this might not be that difficult to solve.
Wouldn't ARI be smart enough to offer a new window? I can live with a "MUST NOT attempt renewal earlier than the start of the window".
I thought about using "SHOULD NOT", but was afraid some client developers would take that as an excuse to renew whenever they want. "SHOULD NOT" leaves some room for doing opposite of what's written, but with some good thoughts behind it.
Then I misinterpreted your "right now". It suggested to me, as you've (I believe) previously stated, that the client would try to renew immediately when ARI suggested a (new) window, which aaron already stated is the opposite of what a client should do.
Well, the current spec doesn't put much requirement at all on how the CA should pick a window. But for now, the CA can just create a file to deploy to a CDN upon certificate creation based on their current expected future load, and only need to change it when doing a revocation/incident, or similarly needing to update for some specific scheduled maintenance or other event. Whereas I think your suggestion of having the CA update the window after it has passed would require the CA to serve it dynamically.
I was just going to wonder what use cases there are, but then you read my mind:
I would be curious how many sites out there fit any of these criteria; i.e. sites that are deemed "testing" or maybe unimportant such that "it's OK if they go down." As for financial incentives, maybe -- "pay more, you're more likely to get a cert." It's basically you pay for an SLA, I guess. Huh...
I suspect most clients won't opt-in to lower priority. (QOS was a thing for TCP, for example, and look how that went.) At least not without a financial benefit.
If it's known that a certificate will be revoked in the future, I don't know why that's fundamentally any different than "this certificate is effectively revoked" i.e. can't be trusted anymore. Currently, impending revocations are announced to humans (policy enforcers) on mailing lists or bug trackers, but if this is announced in program-consumable ways, I think we've just reinvented CRLs or OCSP.
Anyway, I like this kind of the out-of-the-box thinking / discussion.
I probably should have used an example of the CA knows it will be down for scheduled maintenance at some point, so renewals should happen sooner rather than later, rather than revocation.
I think we're talking (off topic!) about different things, this is just using the existing ACME spec where notAfter in the order payload is either honoured or not by the CA in preference to their default cert lifetime. Thanks, would be a good discussion for another time.
Just as another data point, the other CA I'm aware of that implemented ARI is Google (though I haven't tried actually using it yet since I would need to make a Google Cloud account to do so), and their directory (in both their production and their test environments) have renewalInfo listed without a trailing slash.
Thanks for the explanation. Though apparently other CAs either have other interpretations or intentionally diverged from the spec and choose to not error on them even if they can't respect them.
Really maybe the answer here is that more of these choices should be available, maybe in the meta data from the directory. Understanding which parameters a CA allows to set (dates, must-staple, hash algorithms for CSRs, hash algorithms for ARI requests, and who knows what else), plus other things like applicable rate limits, server challenge retry behavior, and so forth, in a client-consumable way, I think would really help with clients being able to just accept a directory URL as all that's needed for CA configuration.
That's still mostly off-topic for talking about ARI, so let's try to get back on track, with the next part I'm focusing on:
Selecting a random time within the suggested window
I'm not sure that the "RECOMMENDED" algorithm in the draft gives the intended result, for the typical case of a cron-style "wake up and check things a couple times a day" client. In particular, if I understand it correctly, it wants the client to select a new random time within the suggested window for each time that it runs. This means that if the window is of a decent size (say, 7 days), and the client is checking twice a day, then the client is much more likely to initiate renewing earlier in the window rather than later in the window. On each run it has a chance to pick a random time, and so it would need to basically always pick a time near the end of the window, on every single run, in order to actually renew near the end of that window.
If my math is right, which it may very well not be, with a 7-day long suggested window, and a client checking twice a day, It would have had better-than-50% odds of having done renewal by the end of the second day, and by the time it was halfway-though the window (3½ days), the odds would be over 92%. So there would be a significant skew toward renewing at the beginning of the window.
I don't know if maybe that's the intended outcome, or if the idea is really that a CA wouldn't ever be giving out large windows like that. (But if windows are always going to be relatively small, then I'm not sure why you'd need both a start and end time, rather than only a start time.) Really I think that maybe the spec should give some sort of guidance to CAs about how they should be selecting suggested windows, rather than the current draft which looks to primarily be focused just on guidance for clients. It may be that some CAs would think of the responsibility being on the client to pick a random time well within a large suggested window, whereas other CAs would want to do a bit more "micromanaging" and always give out relatively small windows.
If the intention is for clients (which aren't able to schedule renewal at an exact particular time per step 3 of the recommended algorithm, though even if they can it's not entirely clear to me from what's written how that scheduling should interact with what happens if ARI is checked again before that specific time) to try to uniformly spread themselves out within the window, then I think there needs to be guidance to either
Store the random time and suggested window in some persistent storage, and for the client to use the same random time if it got the same suggested window, or
Do something with a repeatable pseudo-random generator, seeded with the cert and suggested window and maybe some other things too like hostname or account id, and use that to calculate a renewal time in a repeatable way from what the client has access to, so that on the next run if the server gives the same suggested window then the client will calculate the same renewal time. (Like, calculate SHA-256 of a concatenation of the the suggested window, plus existing cert, plus hostname, take that giant number and divide by 2^256 to get a fraction between 0 and 1, and use that to scale how far through the suggested window to plan as the renewal time. Though I don't know if it needs to be that fancy.)
That is all assuming, though, that CAs will usually not change suggested windows absent some precipitating event, rather than dynamically recalculating expected load constantly and tweaking windows by an hour or two at a time in one direction or another each time a client requests it.
Determining next wake-up time
The other thing about following the recommended algorithm which, while I wouldn't say is hard, I think can be at least a little tricky, after you've done the above to select a renewal time, is in step 4 where it says "if the selected time is before the next time that the client would wake up normally". I think most cron-style schedule systems don't have an easy (or at least standardized) way to specify to the application when the next time will be that the client would get run. Obviously one can pass it in as another parameter (say, having it run my-client renew --next-run=12h when set to run twice a day), but it means that switching to once-a-day or three-times-a-day or whatever involves more than just updating the scheduling itself, but also updating the command-line that gets run. And it may not be obvious to the system administrators exactly what that value is and why it needs to be accurate. (Well, if it's not completely accurate that wouldn't be a terrible thing, but it seems like it ought to be close.) Also, for scheduling systems that let one add some jitter to the schedule time (which we would generally want, wouldn't we?) it may be even trickier to figure out the right parameter to pass along for "run anytime the system has capacity between 6am and 8am, and then again anytime the system has capacity between 6pm and 8pm".
Hmm, good points. I haven't considered the implications for cron-style clients because it's not relevant to me, but now that you mention it... I would be way more supportive of something like ARI if it squeezed out legacy cron-job clients.
Let's be honest. Clients using the OS's task scheduler are never going to go away. We can hope they're needed less often as more services integrate ACME natively. But there will always be scenarios where a simple client using a simple scheduler are all that is needed or wanted. I'd suspect a decent chunk of the services with native ACME integration will also just be using simple timers or schedulers via their language's class library as well.
All ARI is going to do for scheduled clients that support it is change the recommended wake-up window to be more often.
Why would it do that? Instead of checking for some remaining life remaining you could just check if right-now is past the ARI start window. If past that then renew.
Given Peter's point, this would likely skew them towards the front of the window but that depends on the freq of the cronjob and the duration of the window.
The benefit to cronjob clients is for situations like CA instigated revocations. They would have a chance to renew before being revoked.
The big difference with these, though, is they have control over their schedulers/timers.
And the problem with this, of course, is what is currently being talked about which is the frontal skew.
They would just need to run more often, and that still doesn't give them control over their timer to make them dynamic.
I feel like in order to be effective, ARI either needs to schedule the clients (which is not my preference, for reasons explained in other places) or the clients just need to control their own schedules, taking cues from the ACME server.
Having the timer out of the ACME client's control is not particularly effective.
Even for really smart clients with complete control over timing of everything that they want to do, they need to deal with how and whether to update their planned renewal time whenever they poll ARI for a suggested window. I'm not sure the current recommended algorithm is clear enough that if the window doesn't change (or at least if their current planned time remains within the new window) that they shouldn't update their planned time to a new random value within the window. (Assuming again, the expectation is that clients should endeavor to spread themselves out uniformly within the window, rather than the intention being that based on the server's retry-after and the size of the window that clients should be skewed toward the beginning of the window.)
So as an example, suppose a suggested window is Sunday 6am through Monday 6pm, and a client randomly picked Sunday 12pm from within in. Then, the next time the client polls for the information, the window is pushed back an hour, to be Sunday 7am through Monday 7pm. Should the client:
Keep its original time of Sunday 12pm because it's still within the window?
Change to Sunday 1pm because the window got pushed back an hour?
Pick a completely new random time from within the new window?
I could see good arguments for each of those. (And I'm just talking about the scenario where the client is worried about one certificate; it gets even trickier in the case where the client has a lot of certificates and is trying to spread out the load whilst still trying to take the suggested windows into account.)
But I think that in any case, the client would need to use some sort of persistent storage to keep track of its previous planned renewal time and the previous suggested window, or calculate it in some pseudorandom-deterministic way from the inputs. And I think that the current draft doesn't give enough guidance on what the client should do.
With a schedule that runs sufficiently often, does it really matter? When scheduled clients wake up, they're not blindly renewing all of the orders they're responsible for. They're checking to see whether it's time to do so or not.
Prior to ARI, they'd use whatever window they had calculated based on the cert lifetime. With ARI, they update the window based on the ARI response. So it just becomes a matter of running often enough to not miss the potential ARI window.
I don't think worrying about frontal skew is the client's responsibility. As long as they end up trying to renew within the suggested window (or after, but before the cert expires or is revoked), it has done its job.
IMHO there is a bit of overthinking in this entire thread and people seem to be talking about unlikely edge cases as the norm.
The logic and utility of using ARI info seems pretty obvious to me.
Using Certbot as an example, it's set to run every day via cron.
The ARI payload and retry-after can be stashed and read from each certificate's /etc/letsencrypt/renewal/{certname}.conf file.
Certbot can read that file every day when invoked, and update the ARI payload, attempt a renewal, or do nothing based on the information in the file. I think the odds of an ARI window being shorter than the daily recommend interval for cron schedulers is extremely unlikely. LE does mass revocation within 5 days, typically at the end. Even in the event of revocation, OCSP stapling can keep a cert around for 10 days - google/microsoft/apple/mozilla might do a private security update push the revocation info into their clients, but that tends to be for illicit activity, not for mass revocation from CA compliance issues.
Given the behavior of all the components in the ecosystem, I don't think the ARI draft needs to be over-engineered.
LetsEncrypt has 50million certs. This could easily be handled in a single Redis instance on a tiny cloud machine.
SHA256 and MD5 are really wonderful at even distribution and get around the distribution issues that serials have from Benford's law. I use them for partial indexing and sharding all the time with postgres. I've got a web crawler with a few billion URLs indexed. I stash the full md5 of the url as a column, and have a function index on the first 6 chars of that column. On searches, I calculate the md5 on the fly and do search on a substring of the md5 field to trigger the index hint. It makes 50s queries happen in 15ms. When sharding, I usually use the first 1-2 chars to determine the shard.
I'm bringing this up, because this is an old post from Instagram, but their team realized you can use the hashmapping function pretty well in redis to maximize memory
Combining the two concepts above, you can use the redis hashmap system to drop the footprint and speed the performance.
The SHA256 of this site's current cert is: E5 6E B6 B4 C4 BD 98 8F 53 F0 D8 02 62 0C B7 FA 9D C2 6E 98 29 78 D7 7D 7E 26 9F 5A 11 BB E8 9F
One could create a hash for E5 6E B6 B4 then set within that hash a key C4 BD 98 8F 53 F0 D8 02 62 0C B7 FA 9D C2 6E 98 29 78 D7 7D 7E 26 9F 5A 11 BB E8 9F
You could even do that with flat-file storage. Before S3 existed, I usually handled storing/serving user generated content by calculating an md5 of the file and then bucketing in directories of 2-4 chars. That kept every directory within the best-performance sizes for the filesystem. (IIRC, ext3 performance started to tank as you got over 4k items in a directory, even though it had a max of 32k items. ext4 allowed for unlimited items, but I recall performance dropping around some number that wasn't too large)
Quite likely. I'm an engineer: if it isn't broken I might still want to take it apart and "fix" it.
That really is the crux of the problem, I think, is that there isn't any guidance on how a CA is supposed to pick a good window, taking into account what clients are going to want to do with that information. Let's Encrypt's current implementation seems to be "eh, a third of the lifetime remaining seems about right, give a window that's that point ± 1 day". (Though I do understand that the goal for now for Let's Encrypt is more beta-testing ARI, not solving all the optimization problems involving it quite yet.) But really I don't think that there's any math or statistics around that "a third of lifetime remaining" recommendation that seems common even without ARI.
Some questions to think about when figuring out when a client should renew include:
Is certificate installation completely automated?
What alerting does the site operator have set up?
If there is a problem with renewing, what's the average time between it failing and the site operator fixing the problem? (And probably "average" isn't quite right, but maybe 95th and 99th percentile times, at least among cases where the problem does get fixed eventually?)
Is there any history of problems with this particular site, that means that we want to renew earlier before expiration than some other site which has been renewing reliably on time for years?
Really I don't know as it should be related much at all to how long the certificate lifetime is. If certs were 120 days, or 60 days, I don't think it affects much the reliability of renewal, or how much time one would want to leave as buffer in case of a problem. I'm actually really curious now where the third-of-lifetime recommendation comes from, since I doubt it comes from the traditional 1-year-ish-long certificates.
We have 250M+ certs. Ram isn't cheap, Redis is ram-hungry, and this scheme ignores how and when to recalculate and push updated ARI statuses to the redis datastore.
Thinking about cron-based Clients behavior, doesn't it heavily biased to renew at start of the window?
each cronjob waked inside the window will roll a new dice in window and if it was past it will renew now:
for hypocritical hourly cron based client over 48-hour window, 85% of renewal will happen in first 12 hours of the window.