Just looking at these two issues, it seems like many users are radically unprepared for changes in intermediate certificates:
(I think Let's Encrypt has been very clear on the issues and best practices around this, so I don't think Let's Encrypt did anything wrong here—it's not unreasonable to expect people relying on Let's Encrypt services to read the documentation and/or subscribe to API Announcements and/or talk to someone from Let's Encrypt about whether they did it right.)
In a few other cases, such as the v1 API deprecation, Let's Encrypt has used a "brownout" approach to try to make sure that a change will come to subscribers' attention before it fully takes effect.
I wonder if a similar approach would be useful for (scheduled, non-emergency) intermediate certificate changes. For example, maybe the API at the regular URL would start issuing from the new intermediate on a certain date, while there would be a temporary "legacy" API that still issued from the old intermediate for, perhaps, 90 or 100 days afterward. It could be called acme-legacy.api.letsencrypt.org or acme-deprecated.api.letsencrypt.org or thisservicewillgoawayonjune7.api.letsencrypt.org or something.
In that case, people who were surprised by incompatibilities involving the new chain would get one additional certificate lifetime in which to act to remedy things in response to that surprise.
(I realize this could represent a lot of extra back-end engineering work, depending on how many things in the production API assume that all CPS-relevant issuance is coming from it—I remember that there were two different intermediates which were designed to operate in parallel at the time of Let's Encrypt's original launch, and that turned out not to be a convenient design.)
It's a good idea in general! It's hard in practice.
In particular, I'm totally willing to say that its still going to be another couple weeks before we're ready to issue from two different intermediates side-by-side (a feature we're going to use for RSA and ECDSA) at all. Having that feature in place in time for the X3-to-R3 transition would not have been practical or possible, thanks to the mythical man-month and all that.
In addition, there's simply the problem of lead time. Many of the transitions we make are time-constrained by certificate expirations, regulatory dates, or other factors. In this particular case, any brownout period would have had to end by the date on which we made the transition anyway. So the brownout period would have had to start much earlier, which would require the new cross signatures to be available earlier, which would require the new hierarchy to have been issued earlier, which would require the ceremony tool to have been feature-complete earlier, and so on and so forth. At the end of the day, planning long-term projects like that is always hard, and isn't made easier by being a very small team. And even though 90 days is a blink of the eye in terms of normal certificate lifetimes, its still a whole quarter of a year in terms of software development times.
This is something that we're always thinking about. Brownouts are something that we always want to provide, whenever we can (be on the lookout for APIv1 brownouts coming soon!). Unfortunately it just wasn't reasonably possible in this particular case.
Can we have "emergency escape" endpoint from x4 with shorter lifetime (like a week) (different boulder config)? Or it's just a hsm sit on a shelf and don't have real server for it?
Just a couple random thoughts I've had along these lines:
Would it ever make sense to use different intermediates on different servers at the same time? My understanding is that there are two main data centers, so if there was one intermediate for one data center and one for the other, then at least people might have some more expectation that you can't predict with intermediate you're going to get. Or is @schoen's comment saying that this was tried and rejected?
. I know that running a key ceremony has to require a lot of logistical planning and costs and such (opportunity cost of using the time for that instead of something else, at the very least, and I'm guessing more), but how hard would it be to rotate intermediates every year, or maybe every 18 months or so, rather than every 4 years or so? Especially if it were on a regular schedule, maybe people would pay more attention and expect it? (Or maybe not.) Or here's a crazy idea: Maybe, if the key ceremony is the pain (rather than the key transition itself), switch back and both between the two "live" intermediates every 6 months (or whatever) directly (like swap from R3 to R4 and back)? I'm assuming that the not-currently-live intermediate could still serve its purpose as disaster recovery, which is perhaps a terrible assumption, but maybe making three intermediates in the same ceremony, switching between the first two regularly, and keeping the third as DR could almost make sense?
The recent switch exposed those that set their constraints so tight that even the smallest (and expected) change broke their connections. Lessons learned for those that put their hands in that fire.
But that won't stop others from doing that exact same thing (even as we speak, odds are someone is).
There is no way for LE to put out that fire. It is an eventuality in the design.
Alternating every X months will only find them on their next scheduled renewed after that X month.
Should LE alter/modify its' plans based on this knowledge (that this will definitely happen again)?
I say: NO.
Should LE perhaps put out some information on this... like: Do's and don'ts to cover this situation? (more than what's already out there?)
Maybe; as this might not be an expensive/time consuming endeavor.
But you put the "Danger" sign immediately before anyone reaches "the danger".
Where would this sign get posted?
So, exactly where would you put the information? How would you index it? What keywords would be used to ensure that those that are thinking about designing their systems so incorrectly that they can find this information to help them understand the unnecessary risks such a path leads down?
Don't get me wrong: I'm not saying "don't even try".
I'm saying "don't expect too much" from such an informational effort.
LE will feel this problem more because it has the majority of the issued certs.
And it would always feel it sooner because their certs expire sooner.
But the problem is fundamental to the security design - not the CA/intermediate used.
I think it is a problem that should not be taking up any more CPU, nor brain, cycles attempting to rectify.
You will be assimilated. Resistance is futile.
--the borg
In retrospect, it's been very confusing to people that both "DST Root X3" and "Let's Encrypt Authority X3" have "X3" in the name and expire around the same time. If nothing else, it'd be good to try to avoid that confusion at much as possible in the future. I might suggest that the next ISRG Root (as I'm assuming another one will be needed at some point), probably shouldn't have "X3" in the name despite it being the logical next choice.
Imagine if you went by the nickname Eduardo. To affirmatively reference who you really are I'd have to either know Eduardo is Tom or scan your identification documents. Since one less familiar with you won't have the luxury of the former, they're left with the latter. You are Tom, aren't you?
Having the information in the name makes referencing the certificate foolproof.
I must agree.
I'm all for saving space... but this is too much.
Two letters just isn't quite enough.
What's to stop anyone else from calling their next CA/Intermediates "R3" or "R4"?
There should be some uniformity in the name.
Is it easier or more difficult to distinguish an intermediate certificate from a root certificate if they have "intermediate" and "root" in their respective common names?
I'm pretty sure that if someone can seriously troubleshoot TLS issues, they also recognise a CommonName "R3" with Organisation "Let's Encrypt" as the correct certificate. And if they can't do the latter, they're probably also not that good at the former.
Riddle me this: what if you have a certificate functioning as an intermediate, but with "Root" in its common name? Now THAT's confusing! Note: this is going to happen pretty soon
Fact is: a CommonName is pretty much a reference to the public/private key pair doing the signing and being in a certificate and not as much to the certificate itself.