OCSP Reliability Shout Out

I spotted this paper, "Revocation Statuses on the Internet" by Korzhitskii and Carlsson, being discussed on Twitter. It covers measurements of PKI revocation generally and Let's Encrypt's response to the 2020.02.29 CAA Rechecking Bug specifically. It's an interesting read overall but this part from §4.2 stood out to me as particularly cool and worthy of praise:

While we only had timeouts for 0.04% of the status requests, the differences between the number of affected certificates were substantial between CAs: only 0.07% of the Let’s Encrypt certificates had at least one timeout, compared to 13.98% of the other CAs’ certificates. These fractions are non-negligible, since most browsers soft-fail on an OCSP timeout and continue to establish a potentially insecure connection

It's not easy running a reliable OCSP service. Doubly so at the scale Let's Encrypt issues at. Kudos to everyone for delivering excellent reliability through a particularly tumultuous revocation event and beyond.

:tada::tada::tada:

17 Likes

@lestaff

:clap: :clap: :clap: :partying_face:

5 Likes

Just look at the daily issued certificates graph:

It's madness! We're almost at 2 million certificates issued daily on average! And per certificate Let's Encrypt needs to sign 13 OCSP responses during the certificates lifetimes! I.e., using 2 million divided by 7 (as the OCSP responses are 7 days valid) gets us almost 286 000 extra signatures daily. That's all in all a lot of work for those poor HSMs!

8 Likes

We sign faster than every 7 days to minimize the danger of a stale response should there be any interruption in the resigning service.

But yeah, signing certificates has never been the hard thing. In the very beginning (2014) the thing that raised eyebrows was that OCSP drove all the signing capacity, bandwidth, and CDN planning.

11 Likes

I'm curious, do you guys host your own infrastructure or do you use a cloud provider? If so, which one?

2 Likes

The r3.o.lencr.org is currently hosted on Akamai, which you can see if you do a DNS lookup for it. (That doesn't explain what all of the infrastructure behind that is, but the endpoint that OCSP clients talk to directly when they make OCSP queries for Let's Encrypt certificates will be Akamai today.)

4 Likes

Indeed, all requested OCSP statuses, signed by Let's Encrypts own HSM in their High Assurance Datacenter(s?), are cached by "a" CDN.

4 Likes

That's exactly my question

2 Likes

The answer to your question actually is: both :wink: LE has their own infrastructure ("High Assurance Datacenter 1" and "High Assurance Datacenter 2") and use a cloud provider (CDN).

5 Likes

Makes sense that they use akamai, they were one of the founding companies

2 Likes

Thanks @cpu! I was proud to read that stat. :slight_smile: One subtle thing to notice: This doesn't measure per-request availability, but "number of certificates with at least one timeout," which is a harsher measure. Our actual per-request availability is slightly better than 99.96%, and I imagine other CAs' per-request availability is significantly better than 86.02%.

6 Likes

Our PKI infrastructure resides in a couple of datacenters comprising just a few physical racks of space. We routinely execute datacenter failovers between to perform maintenance all the while continuing to serve traffic.

We use AWS for our Certificate Transparency infrastructure.

2 Likes

A little bit offtopic, but I'm curious: this implies "more than two" in my opinion, correct? If that's the case, what's the meaning of the two terms "High Assurance Datacenter 1" and "High Assurance Datacenter 2" used on https://letsencrypt.status.io/?

4 Likes