At 17:47 UTC on August 23rd, 2018 we deployed a configuration change to our OCSP responder service that resulted in 90% of traffic to our origin inaccurately receiving OCSP “unauthorized” statuses for valid OCSP requests. Most OCSP responses that were cached at our CDN prior to the incident were not affected. The change was reverted on 19:33 UTC the same day to resolve the problem, though CDN caching may have resulted in affected statuses being served for a limited period of time after resolution.
The root technical cause of this incident was a change developed during a previous incident in which malformed OCSP traffic was causing excessive strain on the OCSP responder. Unfortunately a bug in the implementation improperly rejected OCSP requests unless they matched the last configured serial prefix rather than any configured serial prefix. We have since fixed the bug.
We first became aware of the problem at 17:52 UTC after our internal alerting flagged invalid OCSP responses for certificates issued by our monitoring systems, though the scale of the issue was not immediately clear. We began investigating the root cause, identified the problem at 19:26 UTC and immediately disabled the prefix validation feature in staging and production.
The bug was not caught during testing because the unittest accompanying the initial PR did not cover the case of multiple acceptable prefixes. The bug was not caught in our staging environment for two reasons: (1) Our internal OCSP monitoring looks for HTTP 500’s, but ignores OCSP “unauthorized” responses, because large number of such responses can be triggered externally by misconfigured clients; (2) Our end-to-end OCSP monitoring tests were working in production, but not in staging.
- Review our procedures for ensuring that all monitoring tools are applied to both production and staging environments.
- Extend OCSP monitoring to include OCSP statuses (unauthorized, revoked, ok, etc) in addition to HTTP statuses.
- Add alerts when fraction of unauthorized or revoked OCSP responses is extremely high.
2018-08-23 01:43 UTC - feature configured in staging
2018-08-23 17:47 UTC - feature configured in production
2018-08-23 19:31 UTC - feature disabled in staging
2018-08-23 19:33 UTC - feature disabled in production