Self-trigger 5 day (120 h) or 24 h revocation (for ARI validation)

Hello,

Recently there was discussion on mdsp about delayed revocation problem, with one of the proposals to remediate being random revocation of a sample of prod certificates. While not taking sides in this discussion, can I ask for a way for the subscriber to self-trigger a delayed revocation while not going through revokeCert? To my understanding revokeCert would more or less immediately revoke the cert causing downtime without getting a change for ARI to do its work, which is not the intention. The intention would be to commit LE to revoke in the specified 120 h or 24 h period without causing damage/downtime (if all layers of the stack work). I argue such a feature, while being an option, not a requirement for subscribers, would enable easy robustness testing of the actual, production infrastructure.

This might possibly a new endpoint in ACME?

(Asking as a maintainer of one of ARI implementations: GitHub - woju/systemd-dehydrated)

3 Likes

This is an aside:

To my understanding revokeCert would more or less immediately revoke the cert causing downtime without getting a change for ARI to do its work, which is not the intention.

The goal of Certificate Revocation is that – but in practice, a revoked Certificate is likely to appear valid for several days. If you have OCSP stapling turned on, the server will cache the response for several days. CRL lists aren't really checked much yet. LetsEncrypt is ending support of OCSP in the near future, but until then - there is a fairly slim chance of downtime within the first 24hours after revoking a Certificate.

That being said, I suspect LetsEncrypt will respond with the following:

  • The engineering effort to develop and support this is better used elsewhere; and
  • Clients doing this would be unnecessarily burdening the CRL ecosystem with these requests.

You could ensure your production infrastructure is working by maintaining one or more secondary certificates from the LetsEncrypt staging server. That system supports ARI, although it is sometimes out of step with the Boulder release so a change might land there faster.

Your client shouldn't care that it's the staging server or not a publicly trusted cert, as it's just another ACME server that supports ARI. You can run it on a dedicated subdomain that only exists to authorize those challenges, or even on the same subdomain.

3 Likes
  • The engineering effort to develop and support this is better used elsewhere; and

I expect the engineering effort to be minimal, as the better part of the required infrastructure should be already in place, because LE supports timed revocation for policy reasons, doesn't it.

I'll start reading boulder source, maybe I could contribute a PR.

  • Clients doing this would be unnecessarily burdening the CRL ecosystem with these requests.

If it was a too big of a burden, then LE would be underprovisioned for 100% revocation event, and I'm quite sure that's not the case. I've seen discussions about this very scenario, and IIRC CRL sharding in CCADB was requested by ISRG staff specifically because of calculations around this theoretical, but possible event. I'd be happy to engage in discussion about ratelimit of this new interface, maybe even something like 1/(week*account), but saying revoking certs is a burden is indirectly admitting to not planning for BR compliance. I'm five-nines sure it'd not be a problem given reasonable ratelimits.

I'll wait for an answer from LE.

What do you mean by this, "timed revocation"?

1 Like

I probably shouldn't interject but I believe they want to be able to test a CA revocation. At the moment the CA realizes one is required the cert is still valid. At some point the ARI object will have a time window indicating to renew. It might (will?) also have an "explanationURL" with the reason for this window.

In many respects this is no different than an upcoming cert expiration that hasn't already been renewed. Except the explanationURL would be different. If not in value in what it describes.

2 Likes

It's a new endpoint to build, CI-Test, and maintain in perpetuity.

I didn't say LE couldn't handle this, I said it would unnecessarily burden the CRL ecosystem. You are requesting that everything hooked into the CRL ecosystem - from CAs to browser vendors and clients - take on additional traffic, monitoring and data storage, for the sole reason of testing your production systems.

This is unnecessary, because you could accomplish the same task (testing your client picks up ARI revocations on production deployments) by revoking against any non-trusted ACME CA that supports ARI.

but saying revoking certs is a burden is indirectly admitting to not planning for BR compliance.

No, it is reiterating the major criticism of the CRL system: inefficiency. ISRG and other groups have pushed for sharding to help alleviate some issues. Previous to that, Mozilla developed CRLite to improve some bottlenecks. All the major browser vendors use proprietary technologies to track and push CRL information to their clients as well. The push for short certificates is also tied into this (though OCSP was probably a bigger factor).

I interpreted this as generally "testing ARI expirations in the production environment", which would cover both unexpected revocation and expected ARI behavior – but using a CA revocation to trigger it, and scheduling it in advance, so the ARI payload in interpreted as "renew immediately".

This exact integration testing could be accomplished on a production system by utilizing a second, non-public, CA / ACME Server. There isn't any fundamental need for this to happen against the publicly trusted ACME server. I actually think that's a really good idea, and I'm going to explore deploying that onto one of my servers next week.

I think if the primary motivation for a feature like this was not testing related, it would be a completely different story. But ISRG already maintains a public staging environment with generous rate limits for testing though (Staging Environment - Let's Encrypt), and this exact test could be handled there.

In any event, I've clarified my original statements and someone from LE will eventually chime in with their response.

2 Likes

Doesn't Pebble now have some sort of ARI support as well that would be much easier to tweak the ARI response from for testing purposes?

4 Likes

Yes, it landed in this PR: Implement latest draft-ietf-acme-ari spec by pgporada · Pull Request #461 · letsencrypt/pebble · GitHub

3 Likes

What do you mean by this, "timed revocation"?

That in some cases LE is required to revoke not "now" (as in revokeCert), but after specified time, either 5 days (120 h) or 1 day (24 h). The exact time differs between different kinds of incidents, those are described in various policy documents, incl. BR and LE's CPS.

I probably shouldn't interject but I believe they want to be able to test a CA revocation.

Pretty much yes, that's the scenario we want to test. AFAIK we can't cause the 24/120 h revocation without doing stupid things like posting private key publicly.

It's a new endpoint to build, CI-Test, and maintain in perpetuity.

Yes, it probably is. Isn't it within scope of this Feature Requests forum category?

I didn't say LE couldn't handle this, I said it would unnecessarily burden the CRL ecosystem.

Ah, yes, sorry. The argument still holds, I mean, if any WebPKI participant didn't plan for 100% LE revocation, then said actor has underprovisioned infra. If you argue there are CRL consumers that don't have capacity to suddenly ingest a CRL with all of LE's unexpired certs, I'd certainly agree, but I'd argue it's not my expectation to fix things.

You are requesting that everything hooked into the CRL ecosystem - from CAs to browser vendors and clients - take on additional traffic, monitoring and data storage, for the sole reason of testing your production systems.

Well, it's the current proposal to change Mozilla's policy (Add section 6.1.3 for Delayed Revocation · mozilla/pkipolicy@efa8ac4 · GitHub) to include a sort of Russian roulette, where CA's will be rolling for a set of random certs to 120h-revoke. That would surely excercise "everything hooked into the CRL ecosystem", and ISTM Mozilla is fine with that.

If this random chance was to hit my infra, I'd prefer it hit at the time of my choosing before the real thing comes.

Also, every subscriber already has a capability to test for this exact scenario by posting private key for a testing domain to CPS' problem reporting email (for the record, I never did). I argue that would be worse case of additional load (bordering on abuse, because email needs to be processed by a human). I'm hoping to get a more civilised approach available.

This exact integration testing could be accomplished on a production system by utilizing a second, non-public, CA / ACME Server. There isn't any fundamental need for this to happen against the publicly trusted ACME server.

This is incorrect. There are revocation scenarios, where you need to e.g. get a government approval to replace a key (c.f. recent discussions about delayed revocations). People tend to get their excercises more seriously when they know the delay will cause damage.

I'm under impression ISRG is the place where ARI and ACME innovation happens, so that's why I'm asking LE for this feature, not some random CA elsewhere. I intend for this FR to be in line with promoting automation and "keeping well oiled" philosophy.

Yes, of course it does, but that's good for unit-testing ARI client, not the full flow. I'm after a different thing.

1 Like

Let's Encrypt has stated in the past that they're not intending to ever revoke 100% of all active subscriber certificates - they would just revoke the intermediate in such a scenario (this is why they have backup intermediates).

6 Likes

Yes and no. From my outside knowledge, there are effectively two ways LE can revoke a certificate:

  • Instantaneously via the ACME protocol - this is fully automated.
  • They also have tooling that is executed by a human that can revoke an arbitrary amount of certificates (also instantaneously). There's nothing in LE's infrastructure that "delays" a revocation (or if there is, it's not public). If there's a compliance issue with a certificate, an engineer will identify the affected certificates and run the tooling at some point, after which the certificates will be revoked. There's nothing in the database that says "revoke this certificate in X hours".

For incidents where Let's Encrypt provides early warning via ARI, there's an "incident" table in the database where affected certificates can be added to. ARI will then recommend immediate renewal for certificates in the incident table. There's also tooling to revoke all certs in the incident table, but there's no automatic revocation countdown or something like that.

5 Likes

Hi @woju, and welcome to the forum!

@Nummer378 is correct -- it would take significant effort to build an automated delayed revocation system. We do not have any systems that do delayed revocation today, and queues are notoriously one of the hardest systems to implement correctly and safely. For example, today our servers could be wholly down for several (about 3.5) days without harming the Internet or violating any of our compliance requirements. If we implemented a system that automatically revoked certs 4 days after their revocation was requested, we'd be reducing that downtime window to just 1 day.

It's a very good idea, and one we've considered ourselves! You're absolutely correct that it would be valuable both for testing ARI, and for allowing smoother transitions to new certs when revocation occurs for any reason. It would even make our own lives easier, being able to load a bunch of certs into the queue instead of having to manually revoke them at the last second.

Unfortunately it's just not quite a high enough priority for us to tackle at this time. We have a lot of exciting public-facing stuff we're working on, and we need to make sure all of that work goes smoothly and on time.

8 Likes

think this is XY problem: what actually needs is a way to force ARI to renew now mode: the cert doesn't need to be actually revoked, explanationURL isn't meant to parsed by client but just given to operator. And this skips all keep service remember to revoke problem. it'd need some ratelimit as this allow user to move renewal window at will though.

2 Likes

Hi, @aarongable, and thanks for reply!

I appreciate that this is non-zero engineering effort, and I accept ISRG's prioritisation of effort. However, I don't think this undertaking would have any bearing on compliance, and on the downtime window for that matter, because this API could be documented as best-effort. In my mind, using this API would not trigger any compliance requirements on LE's part, and missing the deadline (for whatever reason, even load management) would not be an incident, because there wasn't any incident in the first place (e.g. the private key is still held by subscriber only, etc.). I'm unaware of any BR or root store policies/requirements regarding ARI responses.

Would LE consider for a time being some half-solution, like just adding the certificate to incident table, thus triggering ARI "immediate replacement" response, without building that other part (timed autorevoke)? That workflow would then rely on the subscriber to actually revoke the cert using revokeCert ACME endpoint, mirroring "at the last second" manual revocation, as you've explained.

1 Like

If the semantics of the API are "POSTing to this endpoint means that Let's Encrypt will revoke the certificate in some number of hours", then that POST request counts as "the Subscriber request[ing] in writing... that the CA revoke the Certificate", for which we are then obligated to revoke within 24 hours per Section 4.9.1.1 of the Baseline Requirements. We already have a mechanism for requesting revocation: the ACME standardized revoke-cert endpoint. While we may in the future augment that endpoint to do a somewhat-delayed revocation and serve updated ARI data in the mean time, we do not have any desire to confuse the issue by adding a second revocation endpoint with different behavior.

If the semantics of the API are simply "POSTing to this endpoint means that LE will begin serving 'renew now' ARI responses", then you're right, we'd bear no compliance burden. But again, my last paragraph above applies: we've thought about this idea, and the amount of engineering work it requires coupled with the amount of utility it gains has not yet caused it to rise to the top of our priority list. I appreciate the request here, and we're taking it into account! Just don't expect this request to immediately tip us over into dropping our other projects to do this.

7 Likes

And while I certainly understand that there's nothing quite like testing production, I still feel that most of the use cases can be covered with testing against Pebble, or revoking in the staging environment (or production environment for a certificate that you don't consider "production") to get renew-immediately ARI responses.

4 Likes

OK, I understand. I will wait until it gets on the radar. Thanks anyway!

1 Like