Between 13 Oct 2016 10:00 UTC and 13 Oct 2016 22:40 UTC, Let's Encrypt was serving Internal Server Errors (status 500) to a number of requests to the
/acme/new-cert URL. The error rate started out small and increased gradually, hitting a peak of 42% at the height of the incident.
We determined that the problem was linked to a Boulder issue that would cause blocking inside Boulder when a database query for the CountCertificatesRange RPC times out. Specifically, all NewCertificate requests would attempt to acquire a lock in order to check the currently cached count of certificates, and inside that lock would potentially perform a CountCertificatesRange RPC. If the CountCertificatesRange RPC took 2 seconds, all currently pending NewCertificate requests would block for 2 seconds.
It turned out our CountCertificatesRange requests were regularly taking 2.8 seconds, just a hair under our 3 second timeout for the RPC. When those requests started taking just a little bit longer during times of increased load, the timeouts would pile up. Because RA never saw an updated certificate count, each inbound NewCertificate request would attempt to call CountCertificatesRange again. All other pending NewCertificate requests would continue to block, until they hit their own timeouts.
Once we identified the problem, we changed our rate limit configuration to remove the totalCertificates limit, which made Boulder stop calling CountCertificatesRange. This is a temporary measure based on the fact that we are not in danger of hitting our total certificate capacity in the next couple of weeks.
During deployment of that configuration change, there was a default value for the totalCertificates limit in our configuration manager that was much lower than the production value. For a period of a few minutes, all certificate requests were failing with status 429 and an error of "Certificate issuance limit reached." We discovered this error and repaired it.
- Improve locking behavior for CountCertificatesRange (general lesson: don't hold locks during RPCs)
- Adjust the CountCertificatesRange query so that it reliably finishes well within its timeout.
long_query_time to make future slow queries easier to debug.