New "Service Busy" responses beginning during high load

jcjones · September 8, 2022, 7:30pm

Beginning today, our ACME API endpoints will return a new response during times of extremely high request volume. The responses will be HTTP status code 503 (Service Unavailable) saying Service busy; retry later, with a Retry-After header suggesting how long ACME clients should wait before trying again.

Let's Encrypt experiences spikes of load at first second of each hour, with the request volume particularly high at exactly 00:00:00 UTC, and even higher on the first day of each month. Currently, when Let's Encrypt's services are beyond capacity, some requests fail with an HTTP status code 500 indicating an internal error occurred; ACME clients usually then have to start over from the beginning, as they're unsure of the state of their order.

Starting today, instead ACME clients can expect to be asked to retry after a certain number of seconds. This already happens when clients exceed certain rate limits, but in this case, the only thing the client has done wrong is choose a period of high load to send its request.

RFC 8555 suggests clients should always reveal the details of a problem document the ACME server returns. In this case, we're returning:

{
    "type": "urn:ietf:params:acme:error:rateLimited",
    "detail": "Service busy; retry later."
}

Many ACME clients will automatically retry when presented with an HTTP status code 503 with a Retry-After header, but some might simply error and halt. That is no different than today, as clients that receive a HTTP status code 500 generally halt.

Overall, ACME Clients should use randomness in deciding when to begin their renewals. This helps keep Let's Encrypt's service healthy, and avoids the clients accidentally contributing to the unintentional time-synchronized distributed load spikes.

[Edited: On 26 September 2022, Let's Encrypt changed status codes in an overload case from HTTP 429 to HTTP 503, to avoid confusion between rate limits and overload conditions.]

jcjones · September 19, 2022, 7:36pm

In response to community feedback, we're planning to change the status code returned to 503 Service Unavailable, so as to avoid unintentionally conflating rate limits with API load. The Retry-After header and problem document will remain the same.

jcjones · September 26, 2022, 5:55pm

We are now serving HTTP status code 503 (Service Unavailable) during overload conditions. The top post has been updated.

Topic		Replies	Views
Unable to renew - RateLimit: Service Busy Help	15	1115	June 28, 2024
Rate limit server busy retry later appearing at random on last 2 renewals Help	5	596	June 11, 2023
Current official status page for API? Client dev	15	7409	March 22, 2021
Half a dozen errors tonight saying that the server was busy Help	9	1182	October 15, 2022
ServiceUnavailable errors Help	7	1003	October 14, 2018

New "Service Busy" responses beginning during high load

Related topics