We've completed a full postmortem for our outage on July 17 and we’d like to provide some details to our community.
From 2017-07-17 20:43 UTC to 2017-07-17 21:54 UTC Let's Encrypt had an OCSP outage for non-cached responses. Concurrently, from 2017-07-17 20:43 UTC to 2017-07-17 23:24 UTC Let's Encrypt had an ACME API services outage.
The onset was somewhat graduated as an edge firewall became overloaded and progressively failed local and remote traffic. Let's Encrypt staff were alerted to the problem by internal monitoring of the staging environment at 20:48 and began to investigate. Due to a database repair in progress at our secondary datacenter, we were unable to simply fail over to our secondary datacenter. In the end, it was necessary to engage staff in the data center to reboot one of the redundant firewalls and enable the restoration of services.
The issue with load on the firewall was a known problem and remediation was already underway, including a plan to replace the current hardware. The High Availability (HA) failure pattern of the firewall in this situation did not flow as expected from testing and documentation, which led to the need for physical intervention and extended the outage time. A new HA arrangement for the firewalls is part of our remediation plan.
Let's Encrypt will be taking steps to reduce the load on the current firewalls until the new hardware and configuration can be put in place. Additionally, the response plan for this particular type of failure has been improved with lessons learned from the outage. We will be improving our internal documentation to reduce time to resolution in the future.
We apologize to our community for the downtime, and as always, will strive to do better in the future.