Database timeouts, October 13 2016

jsha · October 17, 2016, 8:39pm

Between 13 Oct 2016 10:00 UTC and 13 Oct 2016 22:40 UTC, Let’s Encrypt was serving Internal Server Errors (status 500) to a number of requests to the /acme/new-cert URL. The error rate started out small and increased gradually, hitting a peak of 42% at the height of the incident.

We determined that the problem was linked to a Boulder issue that would cause blocking inside Boulder when a database query for the CountCertificatesRange RPC times out. Specifically, all NewCertificate requests would attempt to acquire a lock in order to check the currently cached count of certificates, and inside that lock would potentially perform a CountCertificatesRange RPC. If the CountCertificatesRange RPC took 2 seconds, all currently pending NewCertificate requests would block for 2 seconds.

It turned out our CountCertificatesRange requests were regularly taking 2.8 seconds, just a hair under our 3 second timeout for the RPC. When those requests started taking just a little bit longer during times of increased load, the timeouts would pile up. Because RA never saw an updated certificate count, each inbound NewCertificate request would attempt to call CountCertificatesRange again. All other pending NewCertificate requests would continue to block, until they hit their own timeouts.

Once we identified the problem, we changed our rate limit configuration to remove the totalCertificates limit, which made Boulder stop calling CountCertificatesRange. This is a temporary measure based on the fact that we are not in danger of hitting our total certificate capacity in the next couple of weeks.

During deployment of that configuration change, there was a default value for the totalCertificates limit in our configuration manager that was much lower than the production value. For a period of a few minutes, all certificate requests were failing with status 429 and an error of “Certificate issuance limit reached.” We discovered this error and repaired it.

Planned fixes:

Improve locking behavior for CountCertificatesRange (general lesson: don’t hold locks during RPCs)
Adjust the CountCertificatesRange query so that it reliably finishes well within its timeout.
Set max_statement_time and long_query_time to make future slow queries easier to debug.

Topic		Replies	Views
2019.11.17 Autoincrement maxed out Incidents	1	4225	November 23, 2019
Old "bug" still there Help	3	804	January 11, 2018
Letsecncrypt taking too much time to Generate Certs and Sometimes getting Timeout Help	9	2833	November 5, 2021
LetsEncrypt Boulder - Web Server Response Time Lead to Multiple Failed Attempts and Rate Limits Being Reached Issuance Tech	4	1598	June 2, 2017
Timeout for new certificates Help	11	3859	February 11, 2022

Database timeouts, October 13 2016

Related topics