How does a typical affected certificate look like?
guess revoke only happens on cert without CN?
Yep, that's correct -- the affected certificates are those that have no Subject Common Name (and equivalently, have Subject Alternative Names that are all greater than 64 characters long).
How many certificates were revoked? Curious how many in the Caddy community might be experiencing the ARI renewal behavior.
Does anybody know how to search crt.sh for empty CNs?
crt.sh supports searching by the sha1(subject), which you could use to theoretically search for these, but I wasn't able to get that to work without timing out.
This Censys search appears to work properly:
parsed.issuer.organization=`Let's Encrypt` and not parsed.subject.common_name: * and labels=`ever-trusted`
which says:
158.91K unexpired
Hm, that's roughly 0.03 to 0.04 % of all issued certs. (Assuming 5⋅106 certs/day issued, which obviously is not an accurate number.)
Now that 133613 certs were revoked, I'm curious how many were renewed with the assistance of ARI.
Just being nosy , what exactly is the "Let’s Encrypt Policy Management Authority" and how is it different than "Let's Encrypt" (or ISRG, I guess)? The CP/CPS says that the PMA is what approves the document and handles revisions. Is it just a committee of people within ISRG, or is it someone external?
While I'm definitely curious about that too (and about ARI adoption in general), I'm also curious about how many were renewed at all. Last I checked, certbot checked OCSP and not ARI, so it would be renewing within a day of it being revoked if run on the recommended schedule. (And of course, roughly a third had probably already been replaced hopefully just based on scheduled expiration anyway.)
And was there an email sent to affected subscribers too?
Yes, the "PMA" is a committee inside ISRG.
Thanks! I already saw it
Question though:
When LE decided to halt issuance, why was that 36 minutes after the incident was declared? I know it's a very small amount of time, but if one appreciates the fact there was just 19 minutes between the halting of issuance and restarting issuance, that latter amount of time is even smaller!
To me this indicates LE was and is very efficient in fixing the CP/CPS which of course is a great thing. However, for myself I don't have an answer to why the latter window of no issuance was shorter than the time between the decleration of the incident and halting of issuance.
Possible reasons I thought of:
- it simply takes a certain amount of time between the decision to halt issuance and the actual halting itself (buttons have to be pressed, things have to be set in motion et c. One does not simply halt issuance at the largest CA in the world );
- perhaps the decision to halt issuance was made a certain amount of time after the incident was declared;
- probably a combination of the above with perhaps a few other reasons I'm not familiar with.
Note that this is not intended as some criticism, as I think LE acted very, very fast. Frankly too fast for my liking, because I was curious what kind of error the production server was generating during the incident, but look at that, I just got myself a worthless certificate because issuance was restarted already I'm just curious how these kind of things work in such a crisis setting
Second question: is the use of "CN=none" for an empty subject even valid? I guess so.. But personally I would read that as it would produce an invalid Subject with literally an empty CN
Thirdly: props to @lenaunderwood for her first incident report
PMA discovered the problem, declared an incident, communicated that fact to me (I was oncall).
I spent a few minutes understanding what the situation was, joining the incident video call, then a few more getting logged into production and flipping the switch.
In the meantime, they updated the CP/CPS - since they're the ones who can do that, it was much faster for them to edit some text, push and merge in github. They were already reviewing the document after all, so editing it was very fast.
Aah, I see, PMA themselves declared the incident I can see following that it would indeed take some time.
I thought PMA notified some other body within Let's Encrypt and that body would declare the incident. My assumption of who declared the incident was thus incorrect
Curious to know what "flipping the switch" actually is Hopefully surrounded with lots of safe guards!
Oh, probably all AI-driven these days.
(Sorry!)
It’s nothing exciting, just a script which disables the API via load balancer configuration. Returns a static error message instead of load balancing.
Our production access is tightly controlled, so only a few people can run it from specially privileged laptops.
One for you, one for the boss and one for the intern
J/K, I'm sure the script has a meaningful filename which wouldn't get run accidentally
Let's hope this incident report gets praises and dismissed without any fuss
sudo ./stop_the_world.sh
I appreciate the fact that you think there are enough of us for there to be multiple bodies to be informed
More seriously, anyone at LE can declare an incident, since anyone might be the person to discover one. In this case it just happened to be that the incident was discovered by PMA during document review.
"PMA" and "non-PMA" or perhaps some "operator" group, although I guess multiple people can have multiple functions