This rate limit is documented to be one hour. A simple involuntary test of this rate limit produced the following result:
One hour is too high, as it induces user frustration.
Solution:
To prevent frustration and other adverse psychological effects on the user, I advise to reduce the rateLimit to a more reasonable value in the range between 5 and 15 minutes.
They're aware that their rate limits aren't the best tool at stopping abuse of their resources, and are looking at ways to improve them.
In the meantime, the main challenge I see is that not enough people are aware of the staging environment, and ACME clients should do more to push people in that direction if their initial attempt fails, that debugging should happen in the staging environment and then only try again in the production environment once one is sure that the configuration has been fixed to allow for certificate issuance.
I'm curious in what kind of non-disfunctional situation you'd frequently hit this rate limit?
The rate limit is 5 failures per account, per hostname, per hour. So in my mind, it's quite difficult to actually hit that rate limit, unless your ACME client is severely broken?
So I'm not sure what the problem really is: a rate limit problem or an ACME client problem?
Alternate Solution:
Test and Debug First with the Staging Environment as the Rate Limits are much higher.
Then when all is working move to the production environment.
All rate limits induce user frustration, because they prevent users from continuing to abuse the services they've been abusing. That isn't a reason to remove or change them.
I proposed a couple of years ago that a different failed-validation rate limit schedule would reduce frustration a lot, including one that was slightly more restrictive in order to draw users' attention to the problem earlier (e.g. 4 failed validations per 20 minutes, but 6 failed validations per 90 minutes, or something).
I think the problem with this from the Let's Encrypt side was just that it would entail slightly more complicated database logic, but I still think it would be helpful. The argument in favor is that some people don't know there is a rate limit at all, or don't know they're consuming server resources, or don't know that repeating the same method won't suddenly start working. So it could be helpful to proactively stop them sooner, so that they start looking at documentation trying something different sooner, but also to be somewhat more forgiving in case they then do fix the problem.
I would then also have two different documentation links, one for "failed validation limit - hey, did you know there is a limit?" and one for "failed validation limit - unfortunately, we're really going to stop you for a little while".
Perhaps some sort of scaled multiplier...
If you have X active certs, then your wait time [from the most recent issuance date] is X times some number of minutes [or hours].
example:
We're likely reimplementing rate limits this year, and we're definitely thinking about how we can make some of this better. There's been some good ideas here, and I'll make sure we take them into consideration!
I don't really see how different rate limits would really make a difference in these cases. The current rate limit is an hour.. I.e., 60 minutes. Yes, it would be frustrating to wait for that, but one could easily switch to the staging area if one wants to continu tinkering with their experimental or failing systems.
I understand the first "frustrated" reaction, but if you think about it, there really is no need to be frustrated about it to begin with. Just switch to the staging environment. Once everything is sorted out, the rate limit of 60 minutes would have been lifted already probably. Shorter rate limits won't fix the initial frustration to begin with I'd say and what is 60 minutes in an entire lifetime?