We have a lot of users who want to avoid rate limits, but want help knowing whether their DNS, network, server, and other things are set up properly before trying to get a certificate.
Many of these users are resorting to switching to the staging endpoint, testing it, and then switching back to production if it worked; and if it didn’t work, hopefully they can fix it and try it again.
I don’t love this idea, as it slows down transactions and adds more complexity. But users are constantly making mistakes and getting rate limited and are getting desperate for solutions.
Would it be good or bad from Let’s Encrypt’s perspective if ACME clients optionally tried each transaction (mainly just cert issuance) against the staging environment first, then used production if it succeeded? I’m mainly concerned about resource usage. If clients adopted this method universally, wouldn’t a 2x uptick in transactions stress LE’s servers to the point of defeating rate limits?
Are there any other tips that you recommend giving to users who are struggling to get it right on the first try, at scale?
(1) There is a Failed Validation limit of 5 failures per account, per hostname, per hour.
(2) but they are subject to a Duplicate Certificate limit of 5 per week
(1) isn't really relevant, one hour later it's gone.
(2) is critical. Most of the "I can't create a certificate because of a rate limit" topics have hitted (2).
But (2) isn't a certificate creation problem, it's only a certificate installation problem. No server restart, a Bitnami ... But then certificates are deleted, revoked -> that's not a solution.
Normally, the certificate creation may not work, if it is the first certificate. Or there are critical changes and the certificate has expired (adding HSTS, changing the vHost configuration ...).
But I'm sure: Most of the renews ... no problem. There is no need to add an extra check.
PS:
No, a failed validation is done, not pending. A lot of pendings -> buggy client.
Hmm, okay – thanks both, for your input. That’s helpful.
Basically, there’s no way to help users who are hitting rate limits in their predicaments, other than they have to change their processes so that they don’t fail validations repeatedly… I’ll try to get more information and see what exactly they are doing wrong most of the time.
Do you do any preflight checks at all? How many of your users' issues would be solved with the most basic form of preflight?
e.g.
Do a naive local request to the domain to see whether you can see the challenge response (like cert-manager does)
Do a less naive (use external DNS) local request to the domain to see whether you can see the challenge response. (This is what I do).
Do an external request to the domain to see whether you can see the challenge response. There's obviously some privacy and consent implications to this one. This is what Certify the Web does:
If the local request fails (perhaps because the local server can't resolve itself via DNS etc) and if proxy API support is enabled, the app asks the https://api.certifytheweb.com server if it can access the resource instead (which also has the benefit of being external, just like the Let's Encrypt server is).
I considered this with Let's Debug, though it's just for HTTP-01 resources and I didn't want to be part of the critical path for any ACME clients.
Use backoffs for all issuance attempts that are not triggered directly by user action. That allows the user to come back and perform the required intervention without encountering rate limiting issues when they do.
What we do in our case is to begin increasing the interval between failed issuance attempts for any certificate, up to a maximum of 1 week. We don't apply any penalty for the first couple of attempts, to avoid penalizing initial setup hiccups.
Yeah, but we’ve found in practice that DNS lookups are a matter of perspective Even authoritative lookups are seldom implemented the same across clients.
DNS is only one factor – networks, firewalls, software infrastructure, OS configuration, and other moving parts have all been observed to cause ACME failures.
We also use backoffs already, but many users reset the internal rate limiter by clearing caches or restarting the processes, etc, against the manual / recommended best practices.
It seems that the only way to know if a transaction will be successful is to try it. Oh well. Maybe it’s just a user education problem then.
We had this issue with users too, and it went away after we started persisting issuance attempts to BoltDB away from where the main certificate data is.
I still think preflights are valuable despite their lack of reliability. Even Let's Encrypt staging is not completely reliable - I've seen plenty of threads where it produces a different result to production.
Could this be a UX issue as well? This problem is relatively easy to solve when there is a user interface/dashboard/command line that allows issuance attempts to be interactively made ("dry run" button, too), such as in my case. A non-interactive daemon like Caddy seems much harder to meet the same need.