We do have our own impl of LetsEncrypt integration with practically 0 changes in past months (years?). Since November 1st all our integration tests are getting following response on staging
{"type": "urn:ietf:params:acme:error:rateLimited", "detail": "Service busy; retry later."}
Full details here:
System.InvalidOperationException
Let's Encrypt client failed to send the request (with retries): Method: POST, RequestUri: 'https://acme-staging-v02.api.letsencrypt.org/acme/new-acct', Version: 1.1, Content: System.Net.Http.StringContent, Headers:
{
User-Agent: RavenDB/6.2
User-Agent: (Microsoft Windows 10.0.26100;X64;.NET 8.0.10;X64;pl-PL;en-US;6.2.2-custom-62)
Content-Type: application/jose+json
Content-Length: 2215
}. Problem: {"type": "urn:ietf:params:acme:error:rateLimited", "detail": "Service busy; retry later."}
at Raven.Server.Commercial.LetsEncryptClient.SendAsyncInternal(HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 246
at Raven.Server.Commercial.LetsEncryptClient.SendAsync[TResult](HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 154
at Raven.Server.Commercial.LetsEncryptClient.Init(String email, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 127
at Raven.Server.Commercial.SetupWizard.LetsEncryptSetupUtils.Setup(SetupInfo setupInfo, SetupProgressAndResult progress, Boolean registerTcpDnsRecords, String acmeUrl, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\SetupWizard\LetsEncryptSetupUtils.cs:line 24
at SlowTests.Tools.SetupSecuredClusterUsingRvn.Should_Create_Secured_Cluster_From_Rvn_Using_Lets_Encrypt_Mode_One_Node() in D:\Workspaces\HR\ravendb_2\test\SlowTests\Tools\SetupSecuredClusterUsingRvn.cs:line 380
at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass47_0.<<InvokeTestMethodAsync>b__1>d.MoveNext() in /_/src/xunit.execution/Sdk/Frameworks/Runners/TestInvoker.cs:line 259
--- End of stack trace from previous location ---
at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in /_/src/xunit.execution/Sdk/Frameworks/ExecutionTimer.cs:line 48
at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in /_/src/xunit.core/Sdk/ExceptionAggregator.cs:line 90
System.Net.Http.HttpRequestException
Response status code does not indicate success: 503 (Service Temporarily Unavailable).
at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
at Raven.Server.Commercial.LetsEncryptClient.SendAsyncInternal(HttpMethod method, Uri uri, Object message, CancellationToken token) in D:\Workspaces\HR\ravendb_2\src\Raven.Server\Commercial\LetsEncryptClient.cs:line 242
Any idea what is going on? Something was deployed on Nov 1st?
How frequently do you run these tests and how many API calls are there? In the past, some people have aggressively overtaxed the staging system and been ratelimited for that.
We've noticed the same issue since this morning using a third-party Acme client. No test runs have been run between November 1st and this morning, so I can't say exactly when it started.
For us, a test-run consists of 4 staging cert generations and we generally run the full test suite an (unscientific) average of 5-7 times a day. So 20-50 staging certs a day? Some days more, some days less.
Not having any issues with live cert generation, just staging.
We've been revising our rate limits, and rate limits documentation, recently. The version live on the website doesn't mention any requests-per-second limits, but our last version did, and we're going to add them back in.
That aside, the change on 1 Nov was to start clamping down on how many requests per second each IP address can do. I've been adjusting that in production datacenter 2 all day to try and get a good balance -- because 100rps of new-nonce from one IP address is just abuse, for example, and that has to stop.
I'm looking at the 503 metrics out of staging right now and I need to spend some more time with the limits there, too. As an example, in Staging I've got /acme/new-acct set to no more than 1 new-acct per second per IP per load balancer -- and people are hitting it a lot. A lot more than Friday's numbers. Probably you are, too. I like that limit because we have no way to delete old accounts right now, but it's even tighter than we had documented of a limit of 20rps across new-order, new-acct and the other entrypoints.
Let me do some adjusting and I'll see if we can fix this for you. And we're going to keep iterating on the docs, too.
@blocke: I've reduced the 503 count in Staging back down to zero through a few adjustments to make it look more like what I got set up in production this morning. Please @ me if you see more issues with your testing.
Thanks for fixing this! We use cert-manager and it appears to have a backoff/retry mechanism which may have even exacerbated the issue on your end if other folks were seeing failures, suddenly now you are getting hammered with many more requests than you would have seen previously.
For example, we had probably 10 pipelines today fail with this issue and each one would have retried every few seconds for an hour, versus if those pipelines had been able to get the cert the first time they would have only made 10 total requests... just food for thought. Happy to provide more details.
Yeah, that would be my expectation as well. It is a little funny that traffic to staging has more daily variation than production, but that's been true for a long while.
All good now. Many thanks for handling that. We have around 10 tests that use LE, but they run in parallel so that is why we hit the limit. We are planning to implement some resilience to our LE integration with a backoff and throttling policies in nearest future.
it surely is and seems to be a bad implemented client. You almost always get a reply nonce in the http response header from the acme server, so calling new-nonce is redundant in most cases.
Exactly, and they're taking resources from the good clients.
On the other front, it's fun to realize how much of Staging's traffic is variable with the workday. I don't look at Stagjng's traffic that critically very often, but there's definitely differences.
Sorry for the issues, glad we got it fixed (so far)!
I almost think that there could be some value in intentionally having staging give a 503 Service Busy for a percentage of traffic, kind of like I think Pebble can, just to help ensure that clients don't see that as an "error" but a normal thing they should be expecting and handling. Though I certainly understand there's also value in having the configuration of staging being as close to production as possible.