On December 12, Let's Encrypt started serving expired OCSP responses, where the NextUpdate field was in the past. This was reported here on the forum. We were serving expired responses for approximately 32 hours, with the result that visitors to some web sites that use OCSP stapling received warnings from their browsers.
The Let's Encrypt server software, Boulder, includes an OCSP Updater component that continually queries our database for a list of certificates that need a fresh OCSP status. That query is relatively slow, due to an expensive JOIN. Normally it completes in about 2 minutes, which is enough to make reasonable progress. However, starting on December 8, after two days of very high issuance rates and correspondingly high database load, this query became much slower, taking more than 50 minutes to run. This meant that we were not signing OCSP responses as fast as they were expiring, and we began to fall behind. Normally we sign fresh OCSP responses after 3 days, with NextUpdate set 7 days in the future. Because the OCSP Updater was making negligible progress, within 4 days some of our OCSP responses had expired.
To fix the problem, we did a few things: First, we adjusted the batch size, so that when OCSP Updater finished an expensive query, at least it would be able to request a large number of signatures before incurring the next expensive query. Next, we added a lower bound on which OCSP responses would be considered for updating, and deployed that as a hotfix. This decreased the query time from 50+ minutes to 7 minutes. At this point the OCSP Updater was catching up on the backlog, but we estimated that it would take several days to catch up entirely, which was not acceptable. We pushed a second hotfix, making the OCSP Updater request signatures in parallel. This reduced the impact of RPC round trip time and database storage time, allowing us to sign OCSP responses as fast as our HSM would allow. And we enabled OCSP Updater in both of our datacenters to double our total throughput. Once those changes were deployed, our catchup rate increased significantly and we estimated that we would be fully caught up in 18 hours, with all of the actually-expired OCSP responses caught up within 7 hours.
Additionally, when trying to improve our catchup rate, we were stymied by another issue: Our most recent HSM benchmark had revealed that we were not getting our full expected signing rate that we had previously tested. We later resolved the issue by setting a higher GOMAXPROCS value. Go uses goroutines (similar to green threads), and defaults to creating only as many OS threads as there are processors. Normally this works fine, as Go's I/O code is good at yielding execution when it is blocked. However, calling C code from Go, as we do when interacting with the PKCS#11 module we use to talk to the HSM, there is no such cooperation. So a blocked goroutine blocks a whole thread. In practice this means that if you have more cores in your HSM than on the box calling it, you need to instruct Go to create more OS threads so that work can proceed when other threads are blocked.
There were a number of failures involved here, but the main one was monitoring. We should have become aware of the problem long before it impacted users. We refresh OCSP responses once they are 3 days old so that we have plenty of time to fix issues before they result in an actual expiration. However, we didn't have an alert that would fire when we fell behind on updating. We did have a periodic monitor that would check for a well-formed OCSP response for a specific certificate, to ensure the OCSP service was running. But that monitor did not check whether the OCSP response was up-to-date. Even if it did, we want to alert based on the state of all OCSP responses in the system, not just for one cert. Implementing a broader check for this was a longstanding TODO item, that we should have prioritized higher. We're going to add an alert for the presence of any OCSP response older than 3 days, along with a critical alert when any OCSP responses is older than 7 days. We'll also fix our periodic monitor to check for freshness along with well-formedness. We're going to add alerts for database queries that are especially slow, and for database timeouts.
We're also deploying faster HSMs that will allow us to continue scaling up as we issue more certificates, and ensure that we can catch up faster if for any reason we fall behind again. And we'll be further improving the performance of the slow database query.
@tialaramex made some really good suggestions on the original thread. We'll be following the suggestion to give a pager address to a few community members, so that if they see an outage that it appears we have not noticed, they can generate an alert to our oncall staff. We'll also be considering how to expose more useful system stats publicly, although in most cases this may be more work than just ensuring we have good alerts for those stats.
Lastly, I want to apologize: We aspire to a very high standard of reliability, and we fell far short in this case, causing outages for some sites that use our services. We are going to work hard to do better in the future.