Hi, our team is evaluating whether we can handle certificate revocation cases automatically. We're excited about ARI, but all the exposed interfaces currently rely on polling.
Given that we may manage up to 640k certificates, periodically querying each domain isn't practical. Would it be possible to provide a callback API during the certificate creation process? This way, we could receive notifications for unexpected revocations directly for the relevant domains.
As far as I know, the ARI protocol is still in development, so maybe. There's also the Feature Requests category on the forum. While ARI strictly speaking is not a Let's Encrypt "only" feature, I believe your question is better suited in the Feature Requests so I've moved it there.
Maybe @aarongable can chip in with regard to this request I too have often wondered what the load implications would be if one were to poll many times a day many, MANY certificates.. While I don't have any numbers to back it up, my feeling would be that it isn't very economical to poll in such a manner.
Such a callback API would make sense to me. However, the implementation would be another issue, especially for ACME clients that aren't active all the time, but just that one time during a renewal attempt.
I like this idea, but I immediately wonder if the implications of supporting it might be detrimental to it being deployed. Aside from the technical logistics, this could require a lot of work to prevent it from being a purposeful - or accidental - DDOS vector or other type of misuse.
I do think something like this was suggested before, and the official response was something like "you should poll frequently and not care, we can take care of load via caching/etc if needed"
Thank you @Osiris and @jvanasco. I have also posted the same question in the link above. Polling puts significant strain on Let's Encrypt, and there is a risk that revocation information may not reach the client promptly. I believe implementing notifications would allow subscribers to receive revocation updates immediately, ensuring timely action. Additionally, this approach would help LE servers conserve substantial resources and energy.
It might just move the chokepoint of needing to make lots of requests from the ACME client side to the ACME server side. The ARI protocol was designed so that the CA could, if needed, publish a bunch of static files to a CDN, making serving responses relatively straightforward. If they needed to hit a bunch of HTTPS endpoints, around the world, as part of handling an incident, I could see that being more complicated. (Plus the need to handle the possible abusive cases.)
Certbot at least checks OCSP twice a day on its renewal schedule; checking ARI should basically work as a drop-in replacement doing roughly the same amount of load on both the client and server sides.
I'm not really objecting to the idea, I'm just not yet convinced that it really helps all that much.
I'm not sure that would be the case once an outage happens. While previous mass revocations have been as little as a few thousand certs, they have also spiked into millions of certs. Pushing to large numbers of callback URIs within the ideal time constraints - which would be days before the revocation - can be more burdensome than responding to inbound requests that use caching. Hundreds or thousands of requests would need to be processed simultaneously, and one would have to deal with timeouts and retries.
I do like the idea and I hope LE staff explores it - or finds other ways to work with large providers - I am just not fully convinced of it's utility.
With polling, your servers will consistently face pressure, as they need to handle ongoing requests regardless of whether there are changes. In contrast, with push notifications, revocations—being rare events—might cause short-term spikes in traffic, but for the majority of the time, there would be minimal or no traffic. This would lead to more efficient resource usage overall.
And perhaps better adoption of ARI for a certain kind of users (with lots and lots of certs).
For the small Certbot users of this world with just a "few" certs it would't matter much, but if you've got tens/hundreds of thousands of certs, well..
It will not be "polling vs pushing", polling will still need to be supported as most clients do not and will not have a persistent web capability to support the callbacks.
With polling, many requests can be handled in-memory and there are a lot of techniques for caching and sharding data. The requests are far more numerous, but they are light operations.
With pushing, the concern isn't a short term spike in traffic so much as a (likely necessary) capability to quickly scale out to handle a massive number of blocking requests. Even when doing this async, the blocking issues (timeouts, slow requests, dropped connections) will complicate batching that is usually needed for parallel operations like this.
I say this for a lot of things that a particular to large integrations and high-availability users, I really do think ISRG should consider a commercial tier of services that charge a small/reasonable amount for stuff like this. This type of feature is really most (only?) useful to a select number of clients. In the past, ISRG has indicated they do not want to offer commercial packages, and really do not like to develop features that won't be utilized by all subscribers. IMHO, implementing this for a small subset of paid users (vs a global rollout) might hit the sweet spot of making this technically possible for those who need it, and eliminating some of the technical work that would be needed for a global rollout.
I agree it would be great to avoid checking status for every cert a couple of times per day, but that's what ARI currently amounts to. If you are managing 640k certs and want to use ARI then polling their ARI status is currently the cost of doing business. If batch polling every 5 minutes and only checking each cert once per day you would require a little over 2000 checks per batch. Assuming you don't manage all your certs on one server/container it then depends on what scale out of cert management you have (for instance I have test systems that scale across hundreds of nodes and they each look after their own subset). So it's certainly an inconvenience, but not impractical.
If you imagine a worst case short-notice mass revocation of almost every cert, the ARI polling is going to be the least of the problems regarding load, so at scale considerations such as CA fallback are also important (and there are certain things to watch out for with ARI there too!).
A callback API would be great (e.g. a request to /.well-known/acme-ari-notification with the certID and account public key), but I don't think it will happen any time soon and it doesn't cover all uses of domains as many services are not webservers, unless the callback could be nominated to happen on an account specific domain (e.g., not just the certs domain)
This sounds like something one would have to "sign up for" and there also would need to be an agreed "specification" as to how CAs notify accounts with issues.
Possibly an opportunity for a "broker service" that would handle such required "specification" notifications and convert them to a menu of choices "email/sms/http(s)/voice mail/etc.".
There are a lot of people who are relying on emails to renew, rather than having actual monitoring. Having an actual external monitoring service, that also checks ARI/OSCP, and notifies the main client that it should probably renew, seems like a useful service that could be added to other external uptime checks that any "production" site should be having anyway.
...as long as they actually do notify properly & on time
I imagine "uptime monitoring" services will implement this anyway, but still, having an external service do that seems kinda pointless since it defeats the purpose of automation.