On December 7th, 2015, Let’s Encrypt team was made aware of a bug in its boulder codebase that handles certificate issuance. This bug allowed certificates to be issued for domains that had Certificate Authority Authorization (CAA) records that did not allow Let’s Encrypt to do so. The domains were still correctly determined to be under the control of the users that asked for the certificates. The Let’s Encrypt team deployed a fix in a few hours, found 6 certificates that should not have been issued, revoked them, and publicly disclosed their revocation later that same day.
CAA is a new technology that allows a domain controllers to use DNS records to specify which CAs are allowed to issue certificates for that domain. The Let’s Encrypt team added CAA record checks to our Certification Practice Statement (CPS) and so, by the Baseline Requirements, we are required to perform them.
At 15:30 UTC, the Let’s Encrypt project was informed that its code was not rejecting issuance for domains with CAA records that did not allow Let’s Encrypt to issue. The issued was quickly confirmed.
At 19:40 UTC, the team merged a patch fixing the verification of CAA records and adding a test. In parallel, a search through our audit logs to find affected domains began. Multiple avenues of investigation were taken. Some team members used the CT logs and DNS queries, others stick to using the audit trails logged by boulder.
At 21:15 UTC, the patched boulder services are deployed. There was some delay as we validated that the other changes that went out with the CAA fix were sufficiently low-risk. The deploy tools at the time took whatever was in the master branch and master is not continuously deployed.
At 02:00 UTC, the team finished the investigation into certificates that needed to be revoked. Audit log data was used for the final determination, since other data like DNS records may have changed over time. Six certificates were determined to have been issued in error.
At 02:10 UTC, the team revoked all 6 certificates. It was discovered that the revocation tool would not invalidate outstanding authorizations, so the domains were blacklisted temporarily, outstanding authorizations were manually invalidated (02:50 UTC), then the domains were manually removed from the blacklist. Issues were filed to improve the revocation tool in order to make manual steps unnecessary next time. Subscribers for the revoked certificates were notified of the revocation.
At 04:25 UTC, a public disclosure was made to the mozilla.dev.security.policy mailing list and the incident was declared resolved.
How This Happened
A bug was introduced into the boulder code base, making it past two reviewers. A lack of tests covering CAA checks allowed the incorrect behavior to go unnoticed.
What We Learned
The incident reinforced the importance of unit tests for all functionality. Writing tests is an important part of our development work flow, but clearly our coverage was not good enough. We are working to further improve it.
Deployment of the fix was slowed down a bit by the fact that we were deploying from trunk development. This meant we had to review recent developments before deploying a particular patch. We had intended to move to a release branch for deploys, but had not yet done so at the time. We have since moved to deploying from a release branch.
During the incident we identified shortcomings in the feature set of our administrative revocation tool. We had to execute some steps manually, which slowed down the completion of revocations. We are currently working on improvements to the tool which will allow us to carry out revocations in less time in the future.
What Went Well
Let’s Encrypt engineers were available and able to confirm the problem quickly so that work on a fix could start almost immediately. A fix and a test were written, reviewed by two engineers, and committed within just over three hours of being notified of the problem.
Our audit logs turned out to be thorough and easily searchable, allowing us to accurately identify which requests had failed CAA checks before resulting in issuance.
We believe quick and complete disclosure is important for CA incidents. We publicly disclosed the fully resolved incident less than 15 hours after the issue was reported.