OCSP Reliability Shout Out

cpu · February 13, 2021, 7:22pm

I spotted this paper, "Revocation Statuses on the Internet" by Korzhitskii and Carlsson, being discussed on Twitter. It covers measurements of PKI revocation generally and Let's Encrypt's response to the 2020.02.29 CAA Rechecking Bug specifically. It's an interesting read overall but this part from §4.2 stood out to me as particularly cool and worthy of praise:

While we only had timeouts for 0.04% of the status requests, the differences between the number of affected certificates were substantial between CAs: only 0.07% of the Let’s Encrypt certificates had at least one timeout, compared to 13.98% of the other CAs’ certificates. These fractions are non-negligible, since most browsers soft-fail on an OCSP timeout and continue to establish a potentially insecure connection

It's not easy running a reliable OCSP service. Doubly so at the scale Let's Encrypt issues at. Kudos to everyone for delivering excellent reliability through a particularly tumultuous revocation event and beyond.

griffin · February 13, 2021, 7:35pm

@lestaff

Osiris · February 13, 2021, 7:35pm

Just look at the daily issued certificates graph:

It's madness! We're almost at 2 million certificates issued daily on average! And per certificate Let's Encrypt needs to sign 13 OCSP responses during the certificates lifetimes! I.e., using 2 million divided by 7 (as the OCSP responses are 7 days valid) gets us almost 286 000 extra signatures daily. That's all in all a lot of work for those poor HSMs!

jcjones · February 13, 2021, 7:56pm

We sign faster than every 7 days to minimize the danger of a stale response should there be any interruption in the resigning service.

But yeah, signing certificates has never been the hard thing. In the very beginning (2014) the thing that raised eyebrows was that OCSP drove all the signing capacity, bandwidth, and CDN planning.

Litbelb · February 13, 2021, 9:07pm

I'm curious, do you guys host your own infrastructure or do you use a cloud provider? If so, which one?

schoen · February 13, 2021, 9:12pm

The r3.o.lencr.org is currently hosted on Akamai, which you can see if you do a DNS lookup for it. (That doesn't explain what all of the infrastructure behind that is, but the endpoint that OCSP clients talk to directly when they make OCSP queries for Let's Encrypt certificates will be Akamai today.)

Osiris · February 13, 2021, 9:17pm

Indeed, all requested OCSP statuses, signed by Let's Encrypts own HSM in their High Assurance Datacenter(s?), are cached by "a" CDN.

Litbelb · February 13, 2021, 9:45pm

That's exactly my question

Osiris · February 13, 2021, 9:48pm

The answer to your question actually is: both LE has their own infrastructure ("High Assurance Datacenter 1" and "High Assurance Datacenter 2") and use a cloud provider (CDN).

Litbelb · February 13, 2021, 9:50pm

Makes sense that they use akamai, they were one of the founding companies

jsha · February 14, 2021, 12:37am

Thanks @cpu! I was proud to read that stat. One subtle thing to notice: This doesn't measure per-request availability, but "number of certificates with at least one timeout," which is a harsher measure. Our actual per-request availability is slightly better than 99.96%, and I imagine other CAs' per-request availability is significantly better than 86.02%.

Phil · February 17, 2021, 5:28pm

Our PKI infrastructure resides in a couple of datacenters comprising just a few physical racks of space. We routinely execute datacenter failovers between to perform maintenance all the while continuing to serve traffic.

We use AWS for our Certificate Transparency infrastructure.

Osiris · February 17, 2021, 5:33pm

A little bit offtopic, but I'm curious: this implies "more than two" in my opinion, correct? If that's the case, what's the meaning of the two terms "High Assurance Datacenter 1" and "High Assurance Datacenter 2" used on https://letsencrypt.status.io/?

system · March 22, 2021, 12:31pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OCSP on a high traffic website Server	11	2051	December 1, 2017
Letsencrypt OCSP response times measured? Issuance Tech	13	5705	June 15, 2016
Expired OCSP responses, December 12 Incidents	5	6134	December 19, 2016
Let's Encrypt Uptime - Comparing 2019 with 2016/17 Help	9	1482	December 24, 2019
Validation error Help	10	4191	May 19, 2017

OCSP Reliability Shout Out

Related topics