Beginning today, our ACME API endpoints will return a new response during times of extremely high request volume. The responses will be HTTP status code 503 (Service Unavailable) saying Service busy; retry later
, with a Retry-After
header suggesting how long ACME clients should wait before trying again.
Let's Encrypt experiences spikes of load at first second of each hour, with the request volume particularly high at exactly 00:00:00 UTC, and even higher on the first day of each month. Currently, when Let's Encrypt's services are beyond capacity, some requests fail with an HTTP status code 500 indicating an internal error occurred; ACME clients usually then have to start over from the beginning, as they're unsure of the state of their order.
Starting today, instead ACME clients can expect to be asked to retry after a certain number of seconds. This already happens when clients exceed certain rate limits, but in this case, the only thing the client has done wrong is choose a period of high load to send its request.
RFC 8555 suggests clients should always reveal the details of a problem document the ACME server returns. In this case, we're returning:
{
"type": "urn:ietf:params:acme:error:rateLimited",
"detail": "Service busy; retry later."
}
Many ACME clients will automatically retry when presented with an HTTP status code 503 with a Retry-After
header, but some might simply error and halt. That is no different than today, as clients that receive a HTTP status code 500 generally halt.
Overall, ACME Clients should use randomness in deciding when to begin their renewals. This helps keep Let's Encrypt's service healthy, and avoids the clients accidentally contributing to the unintentional time-synchronized distributed load spikes.
[Edited: On 26 September 2022, Let's Encrypt changed status codes in an overload case from HTTP 429 to HTTP 503, to avoid confusion between rate limits and overload conditions.]