We'd like to provide some details to our community about the outage on July 21, 2025.
From 2025-07-21 18:35 UTC to 2025-07-22 02:22 UTC Let's Encrypt had a complete ACME API outage in all datacenters. We partially restored service around 2025-07-21 21:00 for subscribers with preexisting authorizations.
The incident was triggered by an operating system upgrade of the last of a cluster of four DNS resolvers. These servers are authoritative DNS servers for internal names, and support recursive resolution for external DNS names. The upgrade was nominally successful, but the entire cluster DNS resolvers stopped responding to DNS queries for external DNS names which they aren’t authoritative for. These resolvers are on the critical path for name resolution of some services which certificate issuance relies upon, particularly database servers, remote multi-perspective issuance corroboration (MPIC) servers, and external certificate transparency (CT) logs. We identified the outage immediately from internal alerting and began an investigation.
Diagnosing the problem was challenging. Ancillary systems for monitoring and debugging were also affected, making it difficult to observe whether our changes were working or not. Our internal DNS infrastructure itself is not robust enough, in particular it lacks a staging environment and stores its configuration in its internal database, making changes risky and hard to observe. During the entirety of the incident, we attempted numerous fixes, rollbacks, and debugging strategies.
Ultimately, we landed on two remediations that were effective.
The first was to stabilize the system by temporarily abandoning the internal resolvers. On worker nodes where Boulder processes run, we hard-coded entries for database servers in /etc/hosts. We moved name resolution of CT logs into the separate internal DNS infrastructure used for processing ACME challenges. We added entries for MPIC servers into external DNS hosting. This led to restoration of certificate issuance.
Late into the night, engineers discovered and fixed the root problem. For each DNS server, we upgraded it by tearing down the old one and bringing up a brand new one in place. The service configuration script incorrectly used a flag that automatically configured internal DNS forwarders, even though we have no need for internal forwarders. The script chose another replica already in the cluster as the forwarder. This means that as we were gradually upgrading replicas, each upgraded replica had a forwarding resolver set to another replica or a replica that no longer exists.
When we upgraded the final replica, all machines in the cluster then had an invalid resolver configuration. DNS queries returned SERVFAIL if dead-ended, or timed out if sent into an infinite loop. We removed the errant resolver configuration for each server--in an obscure tab of the UI. This fully restored service, and allowed us to rollback the first series of fragile workarounds.
We're taking action to prevent this kind of incident from happening again. We’re moving critical path internal name resolution to an architecture that addresses some critical gaps. We will split our DNS servers into independent per-datacenter and environment clusters to avoid entire system outages and better test changes in our staging environment. As well, we will move away from database-backed configuration to simpler configuration checked into source control to make understanding our systems more straightforward.
We also recognize we were uncommunicative with the community during the incident. Letsencrypt.status.io is and will continue to be our means of communicating during outages, but during this incident we went several hours without updates or estimated remediation timelines. We can do better.
We apologize to the community for the downtime. As always, we will learn from incidents and improve our processes to prevent them.