Validation Failure Rate Limit - why?

I'm providing hosting for a large number of domains, some of them customer-provided domains, but many of them subdomains of a single domain, snikket.chat. All are sharing a single Let's Encrypt account.

Recently I've been sporadically seeing errors returned:

too many failed authorizations (5) for \"snikket.chat\" in the last 1h0m0s

The request in this case was for a certificate covering the following domains:

  • jmp-test-1727514904.snikket.chat
  • groups.jmp-test-1727514904.snikket.chat
  • share.jmp-test-1727514904.snikket.chat

There was no previous request for this set of domains at the time of the failure, that I am aware of. This seems to contradict the documentation, which states that this rate limit is per hostname (and confirmed by posts such as About Failed Validation Limit - #8 by schoen ). The hostname in the error is the base domain, which was not part of the certificate request.

Is there any kind of log of failed verifications I can pull that would help me understand why Let's Encrypt is limiting these requests? Or any other suggestions, or pointers to things I may have overlooked?

The problem is affecting renewals too, so I need to get it resolved before it becomes a huge problem for existing customers, and not just an inconvenience for new domains.

4 Likes

Hi @mwild1, thanks for raising this!

I believe you're correct that our behavior with regards to this limit has changed -- previously it treated each hostname independently, and now it is grouping validation failures at the "eTLD+1" level (in your case, at the level of "snikket.chat"). I suspect this is simply a bug/oversight and we'll fix it, but either way we'll update you when we know more.

Thanks again!

6 Likes

Yep, we've confirmed that this was an unintended behavior change. We're in the process of rolling back the deploy that caused it, we're fixing the bug in the code, and we expect to deploy that fix later this week.

7 Likes

Thanks very much for the prompt response and fix!

4 Likes

This has been rolled back in production @mwild1.

4 Likes

As a suggestion, consider integrations tests with added chaos if you're not already doing that, e.g. random numbers of subdomains, random invalidity etc as when we used fixed sized data sets as test data they often don't reveal the edges cases because the static data will reliably pass. Unit tests are fine but integration tests are the best canary.

2 Likes

We're facing the same issue for several hours now, it seems the rollback didn't work in our case? Suddenly our rate-limit override doesn't work anymore? It's causing us operational issues :frowning:

Suddenly our rate-limit override doesn't work anymore?

There is not (and never was) an override for this particular limit (the one about too many failed authorizations). It has always been fixed at 5/hour since it was introduced. Are you sure this is the same limit you are encountering?

3 Likes

We're getting this error message
Our account should have a rate-limit override, to issue 16000 certificates per week

{
"type": "urn:ietf:params:acme:error:rateLimited",
"detail": "Error creating new order :: too many certificates already issued for "XXXXXX.com". Retry after 2024-09-30T18:00:00Z: see Rate Limits - Let's Encrypt",
"status": 429
}

That's a different rate limit. Maybe related to this bug, but it might not be at all. We prefer to have separate threads for different issues.

2 Likes