New "Service Busy" responses beginning during high load

Beginning today, our ACME API endpoints will return a new response during times of extremely high request volume. The responses will be HTTP status code 503 (Service Unavailable) saying Service busy; retry later, with a Retry-After header suggesting how long ACME clients should wait before trying again.

Let's Encrypt experiences spikes of load at first second of each hour, with the request volume particularly high at exactly 00:00:00 UTC, and even higher on the first day of each month. Currently, when Let's Encrypt's services are beyond capacity, some requests fail with an HTTP status code 500 indicating an internal error occurred; ACME clients usually then have to start over from the beginning, as they're unsure of the state of their order.

Starting today, instead ACME clients can expect to be asked to retry after a certain number of seconds. This already happens when clients exceed certain rate limits, but in this case, the only thing the client has done wrong is choose a period of high load to send its request.

RFC 8555 suggests clients should always reveal the details of a problem document the ACME server returns. In this case, we're returning:

{
    "type": "urn:ietf:params:acme:error:rateLimited",
    "detail": "Service busy; retry later."
}

Many ACME clients will automatically retry when presented with an HTTP status code 503 with a Retry-After header, but some might simply error and halt. That is no different than today, as clients that receive a HTTP status code 500 generally halt.

Overall, ACME Clients should use randomness in deciding when to begin their renewals. This helps keep Let's Encrypt's service healthy, and avoids the clients accidentally contributing to the unintentional time-synchronized distributed load spikes.

[Edited: On 26 September 2022, Let's Encrypt changed status codes in an overload case from HTTP 429 to HTTP 503, to avoid confusion between rate limits and overload conditions.]

23 Likes

In response to community feedback, we're planning to change the status code returned to 503 Service Unavailable, so as to avoid unintentionally conflating rate limits with API load. The Retry-After header and problem document will remain the same.

12 Likes

We are now serving HTTP status code 503 (Service Unavailable) during overload conditions. The top post has been updated.

8 Likes