We recently had an outage, documented as two inter-related issues on our status.io: 1, 2. I’m cross-posting Josh’s post with an initial explanation. As he says, we haven’t yet done our full post-mortem, but want to get you the basic info early.
To expand on this a bit: traffic to our OCSP responder gradually rose as cached entries expired and failed to be replaced in cache (because the update request was getting a 400 status). Eventually this amounted to a large enough amount of traffic that an upstream ISP flagged it as a DDoS and started scrubbing some of the traffic between our CDN and our origin servers. This meant that what started as an OCSP outage eventually became a full API outage, even when we rolled back the Boulder deploy that fixed slash collapsing. Fortunately, we were able to get in touch with the upstream ISP and they reverted their DDoS countermeasures. Service is now fully restored.
As always, we will be doing a full postmortem to learn from this incident and figure out ways to improve.