May 19, 2017: OCSP and Issuance Outage Postmortem

We've completed our full postmortem for last Friday's outage and want to provide some details to our community.

From 2017-05-18 17:25 UTC to 2017-05-19 06:05 UTC Let's Encrypt had a minor OCSP outage, serving HTTP 400's to a subset of OCSP clients that were making well-formed requests. From 2017-05-19 06:05 UTC to 2017-05-19 22:58 UTC, this became a major outage of both OCSP and the ACME API used for certificate issuance. Approximately 80% of requests failed during this phase. Most users experienced either consistent failure or consistent success.

The initial cause was a code deploy aiming to fix a problem with slash collapsing. The OCSP protocol specifies that requests can be made via POST or GET. Since the request body is binary data (DER-encoded ASN.1), GET requests must be encoded in some ASCII-safe way. The OCSP RFC in 1999 predated by several years the common use of base64url encoding, which was first standardized in RFC 3548 (2003), so it defined the GET request as the base64 encoding of the binary OCSP request. Unfortunately, the base64 alphabet includes slash ("/"). Most HTTP implementations default to merging repeated slashes, which will corrupt the base64 data and generally cause decoding to fail. We noticed that a small number of our OCSP requests were failing due to this decoding error. Most (about 75%) of our OCSP requests arrive via POST and are unaffected; of the GET requests, only requests that encoded to base64 with a doubled slash were affected. Generally this would happen because a random serial number happened to encode that way; it's also possible for OCSP request nonces to yield a double slash when base64 decoded. For a service to correctly respond to these requests, it must not merge repeated slashes.

Merging repeated slashes is such a common behavior that we had to disable it in three separate places: In our Go code, in our internal web server, and at our CDN. Thursday's deploy included a fix to disable slash merging in our Go code; previous updates had disabled slash merging in our web server and CDN. Unfortunately, we had missed part of RFC 6960:

An OCSP request using the GET method is constructed as follows:

GET {url}/{url-encoding of base-64 encoding of the DER encoding of
the OCSPRequest}

In other words, simply concatenate the two components with a slash. Since the OCSP URL embedded in our certificates ends in a slash (http://ocsp.int-x3.letsencrypt.org/), clients following the RFC strictly would send a doubled slash at the beginning of the request, i.e. "GET //MFMw..." rather than the "GET /MFMw..." we were expecting. A little less than half of our OCSP GET requests start with a doubled slash; the rest start with a single slash. Most CAs don't end their OCSP URLs with a slash, and we will be making this change to our configuration.

Previously, slash merging had hidden this error. When we deployed slash merging, we started to fail base64 decoding for a much larger set of requests, and responding with HTTP status 400. Since these responses were not cacheable, we weren't getting the offload benefit from our CDN. The rate of failed requests increased gradually over the course of 12 hours. Since our OCSP responses are cached at the CDN for that long, once we hit about 12 hours from the deploy (06:05 UTC), abruptly all of our responses fell out of cache, and our origin servers were hit with an abrupt flood of traffic from many CDN nodes connecting at once, attempting to fulfill client requests. Not only did this overwhelm our servers, it caused our upstream ISPs to conclude this was a DDoS, and they turned on mitigation measures, scrubbing much of the traffic.

Unfortunately, the problem initially looked like a DDoS to us as well. We spent several hours debugging the problem as if it were a DDoS and attempting various mitigations before linking it to the deploy 12 hours prior. At 2017-05-19 11:32 UTC, we rolled back to the previous version of our server software. At this point we were responding correctly to OCSP, but a large fraction of end-user traffic was getting errors from our CDN. Debugging with our CDN revealed that traffic was getting dropped between their nodes and our origin servers. Traceroutes further identified the problem was at our upstream ISP. We got their customer service on the phone, at which point we learned about the DDoS scrubbing and requested they turn it off. They did, but our connectivity problem continued. After more phone calls and escalations, we learned that there was actually a second DDoS mitigation in place, at their upstream ISP. Once we reached the right people there and had the mitigation removed, traffic fully recovered at 2017-05-19 22:58 UTC.

This outage took much longer to diagnose, and longer to recover from, than it should have. Here are some of the steps we are taking to improve our processes, beyond fixing the root cause bug:

  • Better monitoring for OCSP. Our overall success rate monitoring doesn't consider 4xx series errors. We're adding separate monitoring that would have spotted the sharp increase in 400s.
  • Monitoring for overall request volume. Even in the absence of the above, we should have received alerts on the sharp increase in overall request volume.
  • Improved monitoring for web server connection count and increase in connection count.
  • Improved communication with upstream providers.
  • Better isolation between OCSP and API traffic.

We apologize to our community for the downtime, and as always, will strive to do better in the future.

30 Likes