OCSP and issuance outage, 2017-05-19

jsha · May 19, 2017, 11:24pm

Hi all,

We recently had an outage, documented as two inter-related issues on our status.io: 1, 2. I'm cross-posting Josh's post with an initial explanation. As he says, we haven't yet done our full post-mortem, but want to get you the basic info early.

josh:

Josh from Let's Encrypt here. First, my apologies for the trouble this has caused.

I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.

OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.

Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.
From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.

To expand on this a bit: traffic to our OCSP responder gradually rose as cached entries expired and failed to be replaced in cache (because the update request was getting a 400 status). Eventually this amounted to a large enough amount of traffic that an upstream ISP flagged it as a DDoS and started scrubbing some of the traffic between our CDN and our origin servers. This meant that what started as an OCSP outage eventually became a full API outage, even when we rolled back the Boulder deploy that fixed slash collapsing. Fortunately, we were able to get in touch with the upstream ISP and they reverted their DDoS countermeasures. Service is now fully restored.

As always, we will be doing a full postmortem to learn from this incident and figure out ways to improve.

Topic		Replies	Views
May 19, 2017: OCSP and Issuance Outage Postmortem Incidents	0	25086	May 25, 2017
2019.08.20 Incorrect OCSP responses under certain conditions Incidents	0	1383	August 27, 2019
Issue with OCSP GET request Help	8	1892	July 4, 2021
2018.08.23 OCSP Responder Incident Incidents	0	2433	August 23, 2018
OCSP Request failed with following message Help	77	17000	February 22, 2018

OCSP and issuance outage, 2017-05-19

Related topics