2018.11.30 Production Google CT Log Submission Failures

jsha · December 18, 2018, 8:39pm

On November 30, 2018 from 18:05 UTC until 19:05 UTC our ACME API /new-cert and /finalize endpoints were partially unavailable because our services could not get the needed SCTs from a trusted Google CT Log to complete issuance. This meant that certificate requests during that window failed.

At 2018-11-30 18:05 UTC our internal alerting notified the Let’s Encrypt staff that our primary datacenter was failing 20% of user requests for the /new-cert and /finalize endpoints. Initial investigation showed our Boulder services were healthy and the Google CT logs Icarus and Argon2019 were returning errors. At 18:33 UTC, we concluded that Google was blocking our IP address range. We failed over to our secondary datacenter, which was not experiencing problems. At 18:45 UTC, our secondary datacenter experienced the same problem and a Google contact confirmed that the CT logs were limiting traffic. Due to the SCT embedding requirements for Google Chrome, we were not able to issue any certificates and decided to stop all Boulder services. We also hoped that stopping services would alleviate the traffic being sent to Google. At 19:05 UTC, we saw the Google logs recover and turned Boulder services back on.

Timeline:

2018-11-30 18:05 UTC - internal alerting notified of us an issuance problem

2018-11-30 18:33 UTC - investigation suggests that Google CT Logs are blocking our primary datacenter traffic and we fail over to our secondary DC

2018-11-30 18:41 UTC - Contacted Google

2018-11-30 18:45 UTC - The secondary DC experiences that same problem and we stop all services

2018-11:30 19:05 UTC - We successfully get STHs from the Google CT logs and turn services back on

Our primary datacenter handles approximately 80% of all issuance traffic and our secondary datacenter handles the remaining 20%. During the Google CT Log outage, the primary datacenter ip was fully blocked and resulted in a 100% failure rate for that datacenter. Before failing over to our secondary datacenter, we returned 500s to users at a rate of 18 errors/second. This lasted about 30 minutes. When we failed over to our secondary datacenter, our issuance success rate was 100% for just a few minutes before we returned errors to users. At this point, we stopped all boulder services hoping to relieve request rate to the Google CT Logs. For about 20 minutes, we were not able to issue certificates.

Impact:
We served a confirmed total of 33,063 errors in response to certificate requests, while our services were running. Based on our normal issuance rates, an estimated 7,000 more certificate requests failed while our issuance service was shut down, for an estimated total of 40,000 failed requests.

Google has also posted a postmortem for their outage (caused by an incorrectly triggered DDoS mitigation system). Some of the discussion is happening on a separate thread regarding an older log outage.

Topic		Replies	Views
Status of Google's CT logs? Help	5	1059	May 7, 2020
Certificate Transparency logging delay Help	10	2678	June 20, 2019
Security Research: Chrome Removes “One Google Log” Requirement from Its CT Policy Issuance Policy	3	858	April 16, 2022
Database timeouts, October 13 2016 Incidents	0	2837	October 17, 2016
Error finalizing order :: Unable to meet CA SCT embedding requirement Help	4	62	May 29, 2025

2018.11.30 Production Google CT Log Submission Failures

Related topics