The FULL post-mortem is included below.
On November 17, 2019 from 14:03 UTC to 19:31 UTC, all certificate issuance was unavailable because of a maxed out primary key on an auxiliary table we use when writing final certificates to the database. From 14:03 UTC to 16:27 UTC all calls to /finalize and /new-cert errored with a status 500 for approximately 820,000 requests. While investigating the problem we turned off all API access from 16:27 UTC to 19:31 UTC and returned status 503. Once services were restored, we had to fix rate limits for some users who tried to issue during that time. Rate limits were fixed at 20:55 UTC and the incident was resolved.
At 14:07 UTC alerts arrived indicating that our secondary datacenter had 50% success rate for requests made to the /finalize and /new-cert endpoints. Within minutes, alerts fired that there had been no successful issuances in the past 5 minutes. Our secondary datacenter serves 20% of Let’s Encrypt API traffic and we immediately determined that we should route that traffic to our primary datacenter and then investigate a root cause.
After re-routing traffic, we began reviewing graphs and logs and it became apparent that both datacenters were having problems. We had been misled at first because our initial alert grouped both datacenters’ alerts together. All API traffic to other endpoints was succeeding and we continued to search for a root cause. During the early investigation, we saw some metrics indicating that there were errors sending certificates to CT logs— a requirement for issuance. We remediated that by removing our Oak and Testflume 2020 logs from the configuration at 15:39 UTC and restarting all Boulder services. Those actions resolved the CT log submission errors but final issuance was still failing.
Next we found this log message from the boulder-ca service:
rpc error: code = Unknown desc = Error 167: Out of range value for column 'id' at row 1. One common cause of this error is that an AUTO_INCREMENT primary key has a type too small for the number of rows in the table— for instance INT instead of BIGINT. We started looking at the database for instances where an
id column might have hit its max id value, preventing issuance and generating the mysterious log line. We looked at the main table (called ‘certificates’) and there were no problems with the id type, but a team member noticed that the AUTO_INCREMENT value was less than the max id. It is expected that a table’s AUTO_INCREMENT value is higher than the maximum value of its primary key and we began exploring options for remediating that. At 16:27 UTC while writing up a fix, a team member expressed concerns that this could be a bug in MariaDB with possible database corruption. We decided to stop all API traffic to prevent additional writes to the database in case it was corrupted. Before attempting a manual fix to change the AUTO_INCREMENT value, we restarted MariaDB at 17:26 UTC. After the restart, we checked information_schema.tables in MariaDB and saw the certificate table’s AUTO_INCREMENT value was higher than the max id. We tested issuance and it failed so we ruled out issues with AUTO_INCREMENT as a root cause. Still fearing possible database corruption, we checked our replicas for their state and planned to role swap and freeze the current master for later debugging— our goal was to restore API traffic access.
While the team was writing and reviewing a change to promote a replica to a master and freeze the current master, another team member joined the conversation and reviewed the situation with fresh eyes. They realized that the issuedNames tables uses an INT(11) primary key column and we insert a row into this table in the same transaction as every insert into the certificates table. Our team discussed how to remediate this issue. We considered an ALTER TABLE migration to update each row on the table but it would have taken a long time for the size of the table, possibly leaving issuance down for hours up to days. A Boulder developer reviewed how the code uses this table and determined that it would be safe to rename the current table and recreate it where the primary key was a BIGINT(20). At 19:04 UTC, the database changes were rolled out and issuance tests succeeded. Since 820,000 certificates failed to be written to the database, the team waited for the boulder-ca to automatically re-insert them before restoring API access. At 19:31 UTC we restored services and within minutes realized some users would be rate limited because our system was counting those re-inserted certificates towards rate limits. Our team determined a fix and implemented it at 20:55 UTC. Within minutes our automatic issuance test that had been rate limited succeeded and we closed out the incident.
During this incident, our team was all hands on deck to investigate, remediate, and communicate the status. We would not have been as effective at communicating the problem without the help of our community forum members. We’d like to thank Juergen Auer (@JuergenAuer) for organizing responses to this problem in our community forum.
Basic Timeline (all times UTC)
2019-11-17 14:03 - the max id of the issuedNames table is reached and the table stops accepting writes
2019-11-17 14:07 - internal monitoring alerts for errors to /finalize and /new-cert
2019-11-17 14:15 - we believe it is a datacenter specific problem all traffic is routed to the primary datacenter
2019-11-17 15:00 - the problem is confirmed in both datacenters
2019-11-17 15:39 - the first fix is deployed under the assumption this is a problem with getting SCTs
2019-11-17 16:27 - traffic for all API calls is stopped while a second fix is discussed
2019-11-17 17:26 - MariaDB is restarted before implementing the second fix. The restart fixes the suspected root cause but does not resolve problems with issuance
2019-11-17 18:09 - the root cause is identified
2019-11-17 19:04 - the fix for the root cause is implemented and tested. The boulder-ca automatically re-inserts certificates into the database
2019-11-17 19:31 - access to the API is restored
2019-11-17 19:39 - work starts on fixing rate limits for users that attempted issuance while the API was available but final issuance was failing
2019-11-17 20:55 - the fixes for rate limits are deployed
What Went Well
- Monitoring quickly alerted us to a high error rate for /finalize and /new-cert.
- We communicated regularly to the public through our status.io page.
What Went Wrong
- Our alerting didn’t initially show the scope of the problem as both datacenters.
- It took us too long to discover the root cause.
Where We Got Lucky
- The database table that reached its max id was an auxiliary table and we could quickly rename and recreate it instead of doing an ALTER TABLE.
This incident was a table design issue and is unlikely to occur in the future. We have started work on the following remediation tasks to ensure it won’t happen with our current database design and to improve the speed at which we understand the problem and resolve it.
- Review current database design and update all primary key ids with AUTO_INCREMENT that are INT(11) to BIGINT(20).
- Fix and improve our alerting to better identify whether a problem impacts one datacenter or both.
- Add stats collection and monitoring of InnoDB metrics that would indicate an AUTO_INCREMENT value is close to the maximum id value.
- Update logging output to indicate which table or query is the source of a given error.