2019.11.17 Autoincrement maxed out

Today from 2019-11-17 14:03 to 19:31 UTC we had an outage for all issuance. The root cause was that one of the tables we rely on, with an autoincrement primary key, was not using a BIGINT type and hit its maximum value (2147483648). We write to this table as part of writing final certificates to our database.

We remediated by changing the table to have a BIGINT primary key.

After the remediation, we re-inserted all certificates issued during the outage (via automated tools). That caused those certificates to count against rate limits. For people that attempted issuance multiple times during the outage, this meant they were getting rate limit errors even though they never successfully received a certificate. We adjusted the timestamps in our rate limit-related tables so that people in this situation won’t hit rate limits anymore.

Thanks for your patience as we worked on this issue. We’ll be doing a full post-mortem and will post a more detailed incident report once that’s done.

14 Likes

The FULL post-mortem is included below.

Summary

On November 17, 2019 from 14:03 UTC to 19:31 UTC, all certificate issuance was unavailable because of a maxed out primary key on an auxiliary table we use when writing final certificates to the database. From 14:03 UTC to 16:27 UTC all calls to /finalize and /new-cert errored with a status 500 for approximately 820,000 requests. While investigating the problem we turned off all API access from 16:27 UTC to 19:31 UTC and returned status 503. Once services were restored, we had to fix rate limits for some users who tried to issue during that time. Rate limits were fixed at 20:55 UTC and the incident was resolved.

Details

At 14:07 UTC alerts arrived indicating that our secondary datacenter had 50% success rate for requests made to the /finalize and /new-cert endpoints. Within minutes, alerts fired that there had been no successful issuances in the past 5 minutes. Our secondary datacenter serves 20% of Let’s Encrypt API traffic and we immediately determined that we should route that traffic to our primary datacenter and then investigate a root cause.

After re-routing traffic, we began reviewing graphs and logs and it became apparent that both datacenters were having problems. We had been misled at first because our initial alert grouped both datacenters’ alerts together. All API traffic to other endpoints was succeeding and we continued to search for a root cause. During the early investigation, we saw some metrics indicating that there were errors sending certificates to CT logs— a requirement for issuance. We remediated that by removing our Oak and Testflume 2020 logs from the configuration at 15:39 UTC and restarting all Boulder services. Those actions resolved the CT log submission errors but final issuance was still failing.

Next we found this log message from the boulder-ca service: rpc error: code = Unknown desc = Error 167: Out of range value for column 'id' at row 1. One common cause of this error is that an AUTO_INCREMENT primary key has a type too small for the number of rows in the table— for instance INT instead of BIGINT. We started looking at the database for instances where an id column might have hit its max id value, preventing issuance and generating the mysterious log line. We looked at the main table (called ‘certificates’) and there were no problems with the id type, but a team member noticed that the AUTO_INCREMENT value was less than the max id. It is expected that a table’s AUTO_INCREMENT value is higher than the maximum value of its primary key and we began exploring options for remediating that. At 16:27 UTC while writing up a fix, a team member expressed concerns that this could be a bug in MariaDB with possible database corruption. We decided to stop all API traffic to prevent additional writes to the database in case it was corrupted. Before attempting a manual fix to change the AUTO_INCREMENT value, we restarted MariaDB at 17:26 UTC. After the restart, we checked information_schema.tables in MariaDB and saw the certificate table’s AUTO_INCREMENT value was higher than the max id. We tested issuance and it failed so we ruled out issues with AUTO_INCREMENT as a root cause. Still fearing possible database corruption, we checked our replicas for their state and planned to role swap and freeze the current master for later debugging— our goal was to restore API traffic access.

While the team was writing and reviewing a change to promote a replica to a master and freeze the current master, another team member joined the conversation and reviewed the situation with fresh eyes. They realized that the issuedNames tables uses an INT(11) primary key column and we insert a row into this table in the same transaction as every insert into the certificates table. Our team discussed how to remediate this issue. We considered an ALTER TABLE migration to update each row on the table but it would have taken a long time for the size of the table, possibly leaving issuance down for hours up to days. A Boulder developer reviewed how the code uses this table and determined that it would be safe to rename the current table and recreate it where the primary key was a BIGINT(20). At 19:04 UTC, the database changes were rolled out and issuance tests succeeded. Since 820,000 certificates failed to be written to the database, the team waited for the boulder-ca to automatically re-insert them before restoring API access. At 19:31 UTC we restored services and within minutes realized some users would be rate limited because our system was counting those re-inserted certificates towards rate limits. Our team determined a fix and implemented it at 20:55 UTC. Within minutes our automatic issuance test that had been rate limited succeeded and we closed out the incident.

During this incident, our team was all hands on deck to investigate, remediate, and communicate the status. We would not have been as effective at communicating the problem without the help of our community forum members. We’d like to thank Juergen Auer (@JuergenAuer) for organizing responses to this problem in our community forum.

Basic Timeline (all times UTC)

2019-11-17 14:03 - the max id of the issuedNames table is reached and the table stops accepting writes

2019-11-17 14:07 - internal monitoring alerts for errors to /finalize and /new-cert

2019-11-17 14:15 - we believe it is a datacenter specific problem all traffic is routed to the primary datacenter

2019-11-17 15:00 - the problem is confirmed in both datacenters

2019-11-17 15:39 - the first fix is deployed under the assumption this is a problem with getting SCTs

2019-11-17 16:27 - traffic for all API calls is stopped while a second fix is discussed

2019-11-17 17:26 - MariaDB is restarted before implementing the second fix. The restart fixes the suspected root cause but does not resolve problems with issuance

2019-11-17 18:09 - the root cause is identified

2019-11-17 19:04 - the fix for the root cause is implemented and tested. The boulder-ca automatically re-inserts certificates into the database

2019-11-17 19:31 - access to the API is restored

2019-11-17 19:39 - work starts on fixing rate limits for users that attempted issuance while the API was available but final issuance was failing

2019-11-17 20:55 - the fixes for rate limits are deployed

What Went Well

  • Monitoring quickly alerted us to a high error rate for /finalize and /new-cert.
  • We communicated regularly to the public through our status.io page.

What Went Wrong

  • Our alerting didn’t initially show the scope of the problem as both datacenters.
  • It took us too long to discover the root cause.

Where We Got Lucky

  • The database table that reached its max id was an auxiliary table and we could quickly rename and recreate it instead of doing an ALTER TABLE.

Remediation items

This incident was a table design issue and is unlikely to occur in the future. We have started work on the following remediation tasks to ensure it won’t happen with our current database design and to improve the speed at which we understand the problem and resolve it.

  • Review current database design and update all primary key ids with AUTO_INCREMENT that are INT(11) to BIGINT(20).
  • Fix and improve our alerting to better identify whether a problem impacts one datacenter or both.
  • Add stats collection and monitoring of InnoDB metrics that would indicate an AUTO_INCREMENT value is close to the maximum id value.
  • Update logging output to indicate which table or query is the source of a given error.
7 Likes