2018.11.30 Production Google CT Log Submission Failures


#1

On November 30, 2018 from 18:05 UTC until 19:05 UTC our ACME API /new-cert and /finalize endpoints were partially unavailable because our services could not get the needed SCTs from a trusted Google CT Log to complete issuance. This meant that certificate requests during that window failed.

At 2018-11-30 18:05 UTC our internal alerting notified the Let’s Encrypt staff that our primary datacenter was failing 20% of user requests for the /new-cert and /finalize endpoints. Initial investigation showed our Boulder services were healthy and the Google CT logs Icarus and Argon2019 were returning errors. At 18:33 UTC, we concluded that Google was blocking our IP address range. We failed over to our secondary datacenter, which was not experiencing problems. At 18:45 UTC, our secondary datacenter experienced the same problem and a Google contact confirmed that the CT logs were limiting traffic. Due to the SCT embedding requirements for Google Chrome, we were not able to issue any certificates and decided to stop all Boulder services. We also hoped that stopping services would alleviate the traffic being sent to Google. At 19:05 UTC, we saw the Google logs recover and turned Boulder services back on.

Timeline:

2018-11-30 18:05 UTC - internal alerting notified of us an issuance problem

2018-11-30 18:33 UTC - investigation suggests that Google CT Logs are blocking our primary datacenter traffic and we fail over to our secondary DC

2018-11-30 18:41 UTC - Contacted Google

2018-11-30 18:45 UTC - The secondary DC experiences that same problem and we stop all services

2018-11:30 19:05 UTC - We successfully get STHs from the Google CT logs and turn services back on

Our primary datacenter handles approximately 80% of all issuance traffic and our secondary datacenter handles the remaining 20%. During the Google CT Log outage, the primary datacenter ip was fully blocked and resulted in a 100% failure rate for that datacenter. Before failing over to our secondary datacenter, we returned 500s to users at a rate of 18 errors/second. This lasted about 30 minutes. When we failed over to our secondary datacenter, our issuance success rate was 100% for just a few minutes before we returned errors to users. At this point, we stopped all boulder services hoping to relieve request rate to the Google CT Logs. For about 20 minutes, we were not able to issue certificates.

Impact:
We served a confirmed total of 33,063 errors in response to certificate requests, while our services were running. Based on our normal issuance rates, an estimated 7,000 more certificate requests failed while our issuance service was shut down, for an estimated total of 40,000 failed requests.

Google has also posted a postmortem for their outage (caused by an incorrectly triggered DDoS mitigation system). Some of the discussion is happening on a separate thread regarding an older log outage.