2025.12.16 ACME API Outage

inahga · December 19, 2025, 10:57pm

We'd like to provide some details to the community about the outage on December 16, 2025.

From 2025-12-16 15:45 UTC to 2025-12-16 16:30 UTC Let's Encrypt had a complete ACME API outage in both production datacenters.

We triggered this incident by improperly applying an updated hypervisor network configuration to a subset of our fleet.

What triggered this?

The config involved changing the Open vSwitch bond failure detect mode from miimon to carrier. This was required because the former mode was unusual and earlier had led to a smaller outage of a single hypervisor. This is normally configured once during OS installation, and is not something we change on running hypervisors.

For each hypervisor requiring this change, we edited the config, and executed the command systemctl restart networking. We believed at the time that this would only result in a momentary disruption of VM networking.

The configuration itself was sound, but systemctl restart networking was not. Upon networking restart, the Open vSwitch bridge network stayed down and would not come back up.

We applied the change to one third of our production fleet, before stopping rollout. We did not do this on staging hypervisors, because they already had new configuration.

We expect that losing a small subset of hypervisors and their VMs should not result in such a severe outage. Indeed the threshold of lost machines was within tolerance. However, the database subsystem reacted unexpectedly to the lost machines, due to misconfiguration.

Database Down

Unfortunately, the database orchestration VM went down in each region. They are responsible for determining the topology of primary and read replica databases. Normally, we expect that these machines are not in the hot path of service health. This was not the case.

When they went down, the ProxySQL instances we use to route traffic to our databases had a healthcheck fail, which caused all jobs accessing the database to be terminated and rescheduled. The rescheduled jobs couldn’t start without the orchestration VM as they didn’t have the database topology.

This is a misconfiguration of the healthcheck: If database orchestration goes down, ProxySQL should continue to use the existing database topology.

Our immediate fix was to restore both orchestration VMs in their respective regions by either rebooting the dead hypervisor or live migrating the orchestration VM off of the dead hypervisor, ending the incident.

Follow Up Steps

As always, we’re committed to learning from incidents and improving our processes and systems. We have roadmapped several reliability improvements for the new year.

Our hypervisors are due for an operating system upgrade. As part of this, we are standardizing and source controlling machine configuration, so that we can make changes like this safer. We’re also re-evaluating our hypervisor networking strategy.

The current database orchestration system is also on borrowed time. We are actively replacing it with Vitess, a horizontally sharded MySQL database. It is much more resilient against machine failure.

We apologize to the community for the downtime. We strive to improve the reliability of Let’s Encrypt, so we can serve the internet as we have done for the last ten years.

Topic		Replies	Views
2025.07.21 Complete API outage Incidents	0	433	September 2, 2025
July 17, 2017: Partial OCSP and Complete Issuance Outage Postmortem Incidents	0	1855	July 19, 2017
Outage: December 15, 2015 Incidents	0	2734	January 9, 2016
Let's Encrypt Undergoing Planned Maintenance Issuance Tech	4	1714	February 14, 2021
May 19, 2017: OCSP and Issuance Outage Postmortem Incidents	0	25062	May 25, 2017

2025.12.16 ACME API Outage

What triggered this?

Database Down

Follow Up Steps

Related topics