Let's Encrypt's 2023-06-15 incident report is great

Continuing the discussion from 2023.06.15 Certificate Policies Extension Mismatch:

I just wanted to say that Let's Encrypt's incident reports are a joy to read, and really set the standard for the rest of the industry. They explain how everything is automated, what failed, what they missed from a wholistic process perspective, and how they're stopping it and similar problems from happening in the future.

(With other CA incident reports, I sometimes get the vibe along the lines of "Eh, a human clicked the wrong button; we updated the training to tell them not to click that button again" without really explaining why they have a process that lets a human click a wrong button in the first place.)

13 Likes

Yes they are; but they are very very infrequent too! :slight_smile:

5 Likes

Which explicitely is not allowed :wink:

Further more I second your compliment! Nice report :slight_smile: Although I did laugh a little bit about the caption of the "first factor of misissuance" being quite so positive in essence :rofl: Made me think about the situation of being at a job interview when the interviewer asks you about your biggest flaw and you somehow need to turn it from worst to best :stuck_out_tongue:

I was wondering when reading the report about the statistic of "incidents per issued certificate during a certain time". Let's Encrypt is prooooooooobably having the lowest number from the larger CAs.

6 Likes

Thanks! We work really hard on them. We probably spent more collective hours preparing all the materials for and writing the incident report, than we did actively responding to the incident. And we'll spend many more as we implement our remediation items.

It's important to not just understand how something failed, but also why! Our systems were put together in this particular way for very good reason, and it's important to understand what decisions led to the circumstance that bit us. :smiley:

8 Likes

[I see a double-edged sword]
Should we be doing something about that? LOL
[sabotage!!!!! only to read the report about how it happened and how to prevent it in the future]

4 Likes

Oh, I figured, and I suspect most good incident reports are like that. I know the timing was probably poor, given that the bulk of the time between incident and report was over what you said was a holiday weekend for you.

It really is surprising that this didn't bite you previously. (As it seems like something that's obvious in retrospect, though I can certainly see how it was missed.) I take it that the certificate duration isn't part of the "template", or you would have seen it two years ago when you reduced certificate durations by one second?

It wasn't completely clear to me from the report how you'd deal with addressing Factor 1, since even if your new check would have found that changing the template would lead to mismatching precertificates, wouldn't you still need to do a complete atomic start to deploy a similar change in the future, contrary to your usual approach?

Yeah, I was thinking of one recent report in particular from another CA, but I didn't want to be bashing other CAs on this forum.

6 Likes

Then where [pray tell]?
LOL

4 Likes

We haven't decided either. With the new lint in place, we'd have 500ed and failed to issue -- not that big a deal in the grand scheme of things. So we'd just have to post a maintenance window and deploy one DC at a time.

We are discussing a few ideas for how to be able to seamlessly deploy changes like this as a rolling deploy, but without any further certificate profile changes immediately planned, it's not as urgent.

6 Likes

Correct. The NotBefore and NotAfter dates are passed as input (since they're not the same on every certificate!)

7 Likes

They are a random subset of the certificates which were issued during two one-minute-long periods as the profile change was deployed to each datacenter.

This random subset was 654 certificates, in about 8mins 30seconds. Can ISRG share the percentage of total volume affected during this window? I'm sure us normal folk could pull it from the transparency logs, but if you have this number handy already, it would be really interesting to know.

5 Likes

It wasn't 8 minutes 30 seconds, really: It was two periods (one per DC), each just under 1 minute.

I did a quick script to munge some log files, and counted "good" issuances between the first and last mismatch in each DC. That's not quite right, since a mismatch could have occurred slightly outside that window, but it's a close approximation.

DC1: 617 good certs, 356 mismatched - 37%
DC2: 574 good certs, 289 mismatched - 33%
Sum: 1191 good certs, 645 mismatched certs - 35%

9 Likes

Thanks! I knew about the 2x 1 minute periods. These numbers give a great context to the severity of this issue. 35% within those 1minute periods, but then 100% valid in the other 6m30s during the overall window.

5 Likes

Ah, I see now. If every certificate used the same NotBefore and NotAfter as part of the template, and you just updated them every week with your deployment, then you would have caught this much sooner. Someone should probably put together a pull request for that. :slight_smile:

4 Likes

And credit goes to Andrew Ayer for a pretty rapid bug report, clearly it helps to have external monitoring and sense checking.

7 Likes

Huge thank you to Let's Encrypt for being an excellent example for the industry to follow. :pray:

You're the hero we need, not the hero we deserve :slight_smile: :bat:

5 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.