I'm providing hosting for a large number of domains, some of them customer-provided domains, but many of them subdomains of a single domain, snikket.chat. All are sharing a single Let's Encrypt account.
Recently I've been sporadically seeing errors returned:
too many failed authorizations (5) for \"snikket.chat\" in the last 1h0m0s
The request in this case was for a certificate covering the following domains:
jmp-test-1727514904.snikket.chat
groups.jmp-test-1727514904.snikket.chat
share.jmp-test-1727514904.snikket.chat
There was no previous request for this set of domains at the time of the failure, that I am aware of. This seems to contradict the documentation, which states that this rate limit is per hostname (and confirmed by posts such as About Failed Validation Limit - #8 by schoen ). The hostname in the error is the base domain, which was not part of the certificate request.
Is there any kind of log of failed verifications I can pull that would help me understand why Let's Encrypt is limiting these requests? Or any other suggestions, or pointers to things I may have overlooked?
The problem is affecting renewals too, so I need to get it resolved before it becomes a huge problem for existing customers, and not just an inconvenience for new domains.
I believe you're correct that our behavior with regards to this limit has changed -- previously it treated each hostname independently, and now it is grouping validation failures at the "eTLD+1" level (in your case, at the level of "snikket.chat"). I suspect this is simply a bug/oversight and we'll fix it, but either way we'll update you when we know more.
Yep, we've confirmed that this was an unintended behavior change. We're in the process of rolling back the deploy that caused it, we're fixing the bug in the code, and we expect to deploy that fix later this week.
As a suggestion, consider integrations tests with added chaos if you're not already doing that, e.g. random numbers of subdomains, random invalidity etc as when we used fixed sized data sets as test data they often don't reveal the edges cases because the static data will reliably pass. Unit tests are fine but integration tests are the best canary.
We're facing the same issue for several hours now, it seems the rollback didn't work in our case? Suddenly our rate-limit override doesn't work anymore? It's causing us operational issues
Suddenly our rate-limit override doesn't work anymore?
There is not (and never was) an override for this particular limit (the one about too many failed authorizations). It has always been fixed at 5/hour since it was introduced. Are you sure this is the same limit you are encountering?
We're getting this error message
Our account should have a rate-limit override, to issue 16000 certificates per week
{
"type": "urn:ietf:params:acme:error:rateLimited",
"detail": "Error creating new order :: too many certificates already issued for "XXXXXX.com". Retry after 2024-09-30T18:00:00Z: see Rate Limits - Let's Encrypt",
"status": 429
}