I ran this command: Used CERTES ACME Client NuGet Package for C# to create a certificate with 99 SANs
It produced this output:
"acme.error": "{"type":"urn:ietf:params:acme:error:rateLimited","detail":"Service busy; retry later.","status":0}"
My web server is (include version): IIS 10.0
The operating system my web server runs on is (include version): Windows Server 2022 DataCenter
My hosting provider, if applicable, is: AWS
I can login to a root shell on my machine (yes or no, or I don't know): Yes
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): No
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): Certes 3.0.3
=============================================
We have tried over several days to renew certificates, and this seems to be randomly causing certificates to fail, is anyone else experiencing this or are there known issue s with the service currently?
We have tried several times over the last 7 days, we are usually renewing several certificates one after another with a small 30s pause in between, each with 99 SANs.. seems to be sporadic, but never seen this error before now, so just wondered if something changed or you are seeing higher load than usual currently?
I am not affiliated to Letsencrypt, I do not know the load of the ACME server, sorry. However, it is possible that the overall load is increased. Retry will likely help.
What API call do you see the "service busy" in response to? Is it always the same API call?
Is there also the HTTP status code number available? Because some of them indicate there is a retry-after response header at which Certes should retry. I don't recall the details off-hand but more info would be helpful.
Is there any more log info available before and after that error? It looks similar to Let's Encrypt message but not entirely. Could there be another service between your client and LE issuing that error?
Also, have you tried updating to Certes 3.0.4? I couldn't find the changelog and don't want to install it to find out. Again, just trying to get more info.
Are you able to try with fewer SAN names in one cert? Does that change the symptom?
No service interruptions are posted and LE is issuing well over 4 million certs per day. So, clearly people are getting certs issued. So far we haven't seen other similar problems reported. Let's Encrypt Stats - Let's Encrypt
Just to be clear, I have seen a couple other posts in the last month that involved a "Service busy" message
Though I think there were generally other issues involved in those cases too, it may be that "Service busy" is happening more often than it used to. But agreed that most users are getting certs fine, and even if one attempt isn't working then the next attempt generally would. And I think the most common clients may be retrying automatically (as they should) rather than informing the user, so most people might not notice even if the message was happening more often to them.
Thank you everyone, we added some better handling of this error and are now 100% caught up with certificates! We are still seeing the error, but now just retrying after 15 minutes and this seems to mostly handle it OK.
If you're seeing "Service Busy" regularly, then I might be a little concerned that you're hitting the service too often or something. But it may just be that Let's Encrypt keeps getting busier.
Let's Encrypt's 429 & 503 errors should have a Retry-After header with a recommended delay before trying again, if you want to get really fancy.
We’re back to normal now (though keeping an eye on things). Our 503 rate never went above 1%, so most clients that retry on 503 should have been able to issue eventually.
I'm about to update status.io, but the fix took hold at 19:58 UTC, and we haven't served any 503s since then. I expect to serve some at midnight UTC, but it should be a "normal" amount, which I'll cross-check with historical values.
We do queue a lot of retries lately on our cPanel and DirectAdmin infrastructure setups, and they have increased in the last month. Maybe is time to scale flexibly up the infrastructure at peak times if there is a pattern ?
Ideally there should be no "peak time" due to randomisation of requests. All too often unfortunately some software will trigger on e.g. the whole hour or on 00:00 et c., which is of course bad for the ACME server infrastructure.
My personal opinion (also note that I'm not a LE staff member or something like that) would be NOT to scale up at peak times to discourage this behaviour.