I just wanted to say that Let's Encrypt's incident reports are a joy to read, and really set the standard for the rest of the industry. They explain how everything is automated, what failed, what they missed from a wholistic process perspective, and how they're stopping it and similar problems from happening in the future.
(With other CA incident reports, I sometimes get the vibe along the lines of "Eh, a human clicked the wrong button; we updated the training to tell them not to click that button again" without really explaining why they have a process that lets a human click a wrong button in the first place.)
Further more I second your compliment! Nice report Although I did laugh a little bit about the caption of the "first factor of misissuance" being quite so positive in essence Made me think about the situation of being at a job interview when the interviewer asks you about your biggest flaw and you somehow need to turn it from worst to best
I was wondering when reading the report about the statistic of "incidents per issued certificate during a certain time". Let's Encrypt is prooooooooobably having the lowest number from the larger CAs.
Thanks! We work really hard on them. We probably spent more collective hours preparing all the materials for and writing the incident report, than we did actively responding to the incident. And we'll spend many more as we implement our remediation items.
It's important to not just understand how something failed, but also why! Our systems were put together in this particular way for very good reason, and it's important to understand what decisions led to the circumstance that bit us.
[I see a double-edged sword]
Should we be doing something about that? LOL
[sabotage!!!!! only to read the report about how it happened and how to prevent it in the future]
Oh, I figured, and I suspect most good incident reports are like that. I know the timing was probably poor, given that the bulk of the time between incident and report was over what you said was a holiday weekend for you.
It really is surprising that this didn't bite you previously. (As it seems like something that's obvious in retrospect, though I can certainly see how it was missed.) I take it that the certificate duration isn't part of the "template", or you would have seen it two years ago when you reduced certificate durations by one second?
It wasn't completely clear to me from the report how you'd deal with addressing Factor 1, since even if your new check would have found that changing the template would lead to mismatching precertificates, wouldn't you still need to do a complete atomic start to deploy a similar change in the future, contrary to your usual approach?
Yeah, I was thinking of one recent report in particular from another CA, but I didn't want to be bashing other CAs on this forum.
We haven't decided either. With the new lint in place, we'd have 500ed and failed to issue -- not that big a deal in the grand scheme of things. So we'd just have to post a maintenance window and deploy one DC at a time.
We are discussing a few ideas for how to be able to seamlessly deploy changes like this as a rolling deploy, but without any further certificate profile changes immediately planned, it's not as urgent.
They are a random subset of the certificates which were issued during two one-minute-long periods as the profile change was deployed to each datacenter.
This random subset was 654 certificates, in about 8mins 30seconds. Can ISRG share the percentage of total volume affected during this window? I'm sure us normal folk could pull it from the transparency logs, but if you have this number handy already, it would be really interesting to know.
It wasn't 8 minutes 30 seconds, really: It was two periods (one per DC), each just under 1 minute.
I did a quick script to munge some log files, and counted "good" issuances between the first and last mismatch in each DC. That's not quite right, since a mismatch could have occurred slightly outside that window, but it's a close approximation.
DC1: 617 good certs, 356 mismatched - 37%
DC2: 574 good certs, 289 mismatched - 33%
Sum: 1191 good certs, 645 mismatched certs - 35%
Thanks! I knew about the 2x 1 minute periods. These numbers give a great context to the severity of this issue. 35% within those 1minute periods, but then 100% valid in the other 6m30s during the overall window.
Ah, I see now. If every certificate used the same NotBefore and NotAfter as part of the template, and you just updated them every week with your deployment, then you would have caught this much sooner. Someone should probably put together a pull request for that.