Let's Encrypt's 2023-06-15 incident report is great

petercooperjr · June 20, 2023, 7:00pm

Continuing the discussion from 2023.06.15 Certificate Policies Extension Mismatch:

I just wanted to say that Let's Encrypt's incident reports are a joy to read, and really set the standard for the rest of the industry. They explain how everything is automated, what failed, what they missed from a wholistic process perspective, and how they're stopping it and similar problems from happening in the future.

(With other CA incident reports, I sometimes get the vibe along the lines of "Eh, a human clicked the wrong button; we updated the training to tell them not to click that button again" without really explaining why they have a process that lets a human click a wrong button in the first place.)

Bruce5051 · June 20, 2023, 7:02pm

Yes they are; but they are very very infrequent too!

Osiris · June 20, 2023, 7:05pm

Which explicitely is not allowed

Further more I second your compliment! Nice report Although I did laugh a little bit about the caption of the "first factor of misissuance" being quite so positive in essence Made me think about the situation of being at a job interview when the interviewer asks you about your biggest flaw and you somehow need to turn it from worst to best

I was wondering when reading the report about the statistic of "incidents per issued certificate during a certain time". Let's Encrypt is prooooooooobably having the lowest number from the larger CAs.

aarongable · June 20, 2023, 7:20pm

Thanks! We work really hard on them. We probably spent more collective hours preparing all the materials for and writing the incident report, than we did actively responding to the incident. And we'll spend many more as we implement our remediation items.

It's important to not just understand how something failed, but also why! Our systems were put together in this particular way for very good reason, and it's important to understand what decisions led to the circumstance that bit us.

rg305 · June 20, 2023, 7:34pm

[I see a double-edged sword]
Should we be doing something about that? LOL
[sabotage!!!!! only to read the report about how it happened and how to prevent it in the future]

petercooperjr · June 20, 2023, 7:54pm

Oh, I figured, and I suspect most good incident reports are like that. I know the timing was probably poor, given that the bulk of the time between incident and report was over what you said was a holiday weekend for you.

It really is surprising that this didn't bite you previously. (As it seems like something that's obvious in retrospect, though I can certainly see how it was missed.) I take it that the certificate duration isn't part of the "template", or you would have seen it two years ago when you reduced certificate durations by one second?

It wasn't completely clear to me from the report how you'd deal with addressing Factor 1, since even if your new check would have found that changing the template would lead to mismatching precertificates, wouldn't you still need to do a complete atomic start to deploy a similar change in the future, contrary to your usual approach?

Yeah, I was thinking of one recent report in particular from another CA, but I didn't want to be bashing other CAs on this forum.

rg305 · June 20, 2023, 7:57pm

Then where [pray tell]?
LOL

mcpherrinm · June 20, 2023, 8:03pm

We haven't decided either. With the new lint in place, we'd have 500ed and failed to issue -- not that big a deal in the grand scheme of things. So we'd just have to post a maintenance window and deploy one DC at a time.

We are discussing a few ideas for how to be able to seamlessly deploy changes like this as a rolling deploy, but without any further certificate profile changes immediately planned, it's not as urgent.

mcpherrinm · June 20, 2023, 8:07pm

Correct. The NotBefore and NotAfter dates are passed as input (since they're not the same on every certificate!)

github.com

letsencrypt/boulder/blob/824417f6c066e36d9048120db5369fd70b1fa92f/issuance/issuance.go#L586-L600


      
          type IssuanceRequest struct {
          	PublicKey crypto.PublicKey
          
          	Serial []byte
          
          	NotBefore time.Time
          	NotAfter  time.Time
          
          	CommonName string
          	DNSNames   []string
          
          	IncludeMustStaple bool
          	IncludeCTPoison   bool
          	SCTList           []ct.SignedCertificateTimestamp
          }

jvanasco · June 20, 2023, 11:14pm

They are a random subset of the certificates which were issued during two one-minute-long periods as the profile change was deployed to each datacenter.

This random subset was 654 certificates, in about 8mins 30seconds. Can ISRG share the percentage of total volume affected during this window? I'm sure us normal folk could pull it from the transparency logs, but if you have this number handy already, it would be really interesting to know.

mcpherrinm · June 21, 2023, 12:37am

It wasn't 8 minutes 30 seconds, really: It was two periods (one per DC), each just under 1 minute.

I did a quick script to munge some log files, and counted "good" issuances between the first and last mismatch in each DC. That's not quite right, since a mismatch could have occurred slightly outside that window, but it's a close approximation.

DC1: 617 good certs, 356 mismatched - 37%
DC2: 574 good certs, 289 mismatched - 33%
Sum: 1191 good certs, 645 mismatched certs - 35%

jvanasco · June 21, 2023, 1:57am

Thanks! I knew about the 2x 1 minute periods. These numbers give a great context to the severity of this issue. 35% within those 1minute periods, but then 100% valid in the other 6m30s during the overall window.

petercooperjr · June 21, 2023, 2:14am

Ah, I see now. If every certificate used the same NotBefore and NotAfter as part of the template, and you just updated them every week with your deployment, then you would have caught this much sooner. Someone should probably put together a pull request for that.

webprofusion · June 21, 2023, 2:50am

And credit goes to Andrew Ayer for a pretty rapid bug report, clearly it helps to have external monitoring and sense checking.

mholt · June 21, 2023, 3:39am

Huge thank you to Let's Encrypt for being an excellent example for the industry to follow.

You're the hero we need, not the hero we deserve

system · July 21, 2023, 3:39am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why Let's Encrypt Made Me A Better Cryptographer Praise	7	1619	February 8, 2018
Independent audits of Let's Encrypt finished	5	4712	December 11, 2015
Just looked up and Praise	5	109	January 16, 2025
Happy New Year! Looking back on the last year of accomplishments Praise	4	2776	February 6, 2019
6 day certificates!? Pinch me I'm dreaming Issuance Policy	37	1498	January 30, 2025

Let's Encrypt's 2023-06-15 incident report is great

Related topics