Outage: December 15, 2015


#1

On 15 December 2015 at 15:40Z, one of Let’s Encrypt’s datacenters had a brief network disruption that prompted that datacenter’s failsafes to temporarily halt the issuance of new certificates from all datacenters, and the updating of OCSP responses. This outage did not disrupt OCSP services, as Let’s Encrypt updates OCSP in batches to ensure OCSP service continuity during such disruptions.

The operations team became aware of the failsafe triggering at 15:44Z, and after review of the logs declared an incident on Status.io and Twitter at 15:55Z.

Status.io: https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/567037f63399baba6800068e

Tweet: https://twitter.com/letsencrypt_ops/status/676792727946682369

Remediation required two administrators in order to clear the failsafe. After self-tests, at 16:24:12Z the operations team restarted the Boulder software in that datacenter and began monitoring logs. At 16:24:29Z a certificate was properly issued to a waiting client. After reviewing a few minutes of issuance, the operations team announced the incident was over at 16:30Z.

Status.io: https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/567037f63399baba6800068e

Tweet: https://twitter.com/letsencrypt_ops/status/676801598111088640

The exact reason for the failsafe’s triggering has been investigated with the vendor. We have a mitigation that we believe will ensure that if/when this issue recurs, it will only affect one datacenter and so avoid an outage.