Staging environment - getting rate limit service busy retry later

ppekrol · November 4, 2024, 9:54pm

Our domain is: development.run

We do have our own impl of LetsEncrypt integration with practically 0 changes in past months (years?). Since November 1st all our integration tests are getting following response on staging
{"type": "urn:ietf:params:acme:error:rateLimited", "detail": "Service busy; retry later."}

Full details here:

System.InvalidOperationException
Let's Encrypt client failed to send the request (with retries): Method: POST, RequestUri: 'https://acme-staging-v02.api.letsencrypt.org/acme/new-acct', Version: 1.1, Content: System.Net.Http.StringContent, Headers:
{
  User-Agent: RavenDB/6.2
  User-Agent: (Microsoft Windows 10.0.26100;X64;.NET 8.0.10;X64;pl-PL;en-US;6.2.2-custom-62)
  Content-Type: application/jose+json
  Content-Length: 2215
}. Problem: {"type": "urn:ietf:params:acme:error:rateLimited", "detail": "Service busy; retry later."}
   at Raven.Server.Commercial.LetsEncryptClient.SendAsyncInternal(HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 246
   at Raven.Server.Commercial.LetsEncryptClient.SendAsync[TResult](HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 154
   at Raven.Server.Commercial.LetsEncryptClient.Init(String email, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 127
   at Raven.Server.Commercial.SetupWizard.LetsEncryptSetupUtils.Setup(SetupInfo setupInfo, SetupProgressAndResult progress, Boolean registerTcpDnsRecords, String acmeUrl, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\SetupWizard\LetsEncryptSetupUtils.cs:line 24
   at SlowTests.Tools.SetupSecuredClusterUsingRvn.Should_Create_Secured_Cluster_From_Rvn_Using_Lets_Encrypt_Mode_One_Node() in D:\Workspaces\HR\ravendb_2\test\SlowTests\Tools\SetupSecuredClusterUsingRvn.cs:line 380
   at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass47_0.<<InvokeTestMethodAsync>b__1>d.MoveNext() in /_/src/xunit.execution/Sdk/Frameworks/Runners/TestInvoker.cs:line 259
--- End of stack trace from previous location ---
   at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in /_/src/xunit.execution/Sdk/Frameworks/ExecutionTimer.cs:line 48
   at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in /_/src/xunit.core/Sdk/ExceptionAggregator.cs:line 90

System.Net.Http.HttpRequestException
Response status code does not indicate success: 503 (Service Temporarily Unavailable).
   at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
   at Raven.Server.Commercial.LetsEncryptClient.SendAsyncInternal(HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 242

Any idea what is going on? Something was deployed on Nov 1st?

Many thanks for help in advance.

jvanasco · November 4, 2024, 10:07pm

How frequently do you run these tests and how many API calls are there? In the past, some people have aggressively overtaxed the staging system and been ratelimited for that.

blocke · November 4, 2024, 10:16pm

We've noticed the same issue since this morning using a third-party Acme client. No test runs have been run between November 1st and this morning, so I can't say exactly when it started.

For us, a test-run consists of 4 staging cert generations and we generally run the full test suite an (unscientific) average of 5-7 times a day. So 20-50 staging certs a day? Some days more, some days less.

Not having any issues with live cert generation, just staging.

jcjones · November 4, 2024, 10:26pm

We've been revising our rate limits, and rate limits documentation, recently. The version live on the website doesn't mention any requests-per-second limits, but our last version did, and we're going to add them back in.

That aside, the change on 1 Nov was to start clamping down on how many requests per second each IP address can do. I've been adjusting that in production datacenter 2 all day to try and get a good balance -- because 100rps of new-nonce from one IP address is just abuse, for example, and that has to stop.

I'm looking at the 503 metrics out of staging right now and I need to spend some more time with the limits there, too. As an example, in Staging I've got /acme/new-acct set to no more than 1 new-acct per second per IP per load balancer -- and people are hitting it a lot. A lot more than Friday's numbers. Probably you are, too. I like that limit because we have no way to delete old accounts right now, but it's even tighter than we had documented of a limit of 20rps across new-order, new-acct and the other entrypoints.

Let me do some adjusting and I'll see if we can fix this for you. And we're going to keep iterating on the docs, too.

jcjones · November 4, 2024, 10:37pm

@blocke: I've reduced the 503 count in Staging back down to zero through a few adjustments to make it look more like what I got set up in production this morning. Please @ me if you see more issues with your testing.

blocke · November 4, 2024, 10:53pm

I am able to create a staging cert again and a run of the specific test in question did pass, as well. Thank you so much for your help!

jgmyeah · November 4, 2024, 11:29pm

Thanks for fixing this! We use cert-manager and it appears to have a backoff/retry mechanism which may have even exacerbated the issue on your end if other folks were seeing failures, suddenly now you are getting hammered with many more requests than you would have seen previously.

For example, we had probably 10 pipelines today fail with this issue and each one would have retried every few seconds for an hour, versus if those pipelines had been able to get the cert the first time they would have only made 10 total requests... just food for thought. Happy to provide more details.

jcjones · November 4, 2024, 11:31pm

Yeah, that would be my expectation as well. It is a little funny that traffic to staging has more daily variation than production, but that's been true for a long while.

@ppekrol, is your issue fixed, too?

attiss · November 5, 2024, 11:21am

We also had issues with rate-limits since Nov 1. ~18:00UTC. (Seems to be fixed since Nov 4. ~22:23UTC.)

ppekrol · November 5, 2024, 12:28pm

All good now. Many thanks for handling that. We have around 10 tests that use LE, but they run in parallel so that is why we hit the limit. We are planning to implement some resilience to our LE integration with a backoff and throttling policies in nearest future.

bytecamp · November 5, 2024, 12:49pm

it surely is and seems to be a bad implemented client. You almost always get a reply nonce in the http response header from the acme server, so calling new-nonce is redundant in most cases.

jcjones · November 5, 2024, 1:34pm

Exactly, and they're taking resources from the good clients.

On the other front, it's fun to realize how much of Staging's traffic is variable with the workday. I don't look at Stagjng's traffic that critically very often, but there's definitely differences.

Sorry for the issues, glad we got it fixed (so far)!

petercooperjr · November 5, 2024, 1:45pm

I almost think that there could be some value in intentionally having staging give a 503 Service Busy for a percentage of traffic, kind of like I think Pebble can, just to help ensure that clients don't see that as an "error" but a normal thing they should be expecting and handling. Though I certainly understand there's also value in having the configuration of staging being as close to production as possible.

system · December 5, 2024, 1:45pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Staging environment still has issues Help	14	1015	April 1, 2023
Using staging environment 1:1 before all production endpoint transactions Client dev	10	1289	February 7, 2020
Rate-limited on staging environment Help	4	64	May 31, 2025
500s from staging server, now acme:error:rateLimited Help	9	2323	March 25, 2017
We've Been Throttled! Issuance Policy	9	1143	November 27, 2022

Staging environment - getting rate limit service busy retry later

Related topics