Validation outage (DNS problem: query timed out): March 28, 2016

From 28 Mar 2016 23:29 UTC to 29 Mar 2016 02:45 UTC, Let’s Encrypt failed to validate any newly requested authorizations. This prevented most certificate issuance during that time.

This was caused when our Ops team made a change to our firewall rules, locking down outbound DNS access from all of our hosts except those that are intended to do verifications. Unfortunately, the firewall change was overly aggressive and also locked down outbound DNS access from the verification hosts.

We discovered this thanks to people who pointed it out here on the forums and on Twitter. We should have spotted it much sooner on our own. We have monitoring for server failures (status 500), but we didn’t have sufficient monitoring for the rate of validation successes and failures. We’ll be making some changes to catch similar issues faster in the future:

  • Alert if the rate of successful validations drops below a threshold
  • Update our change procedures to include certificate issuance as a test step
  • Include validation success / failure in our dashboards

Sorry for the outage, and thanks for your patience as we improve our processes so we can produce an ever more reliable service.

Jacob