Using staging environment 1:1 before all production endpoint transactions

mholt · January 8, 2020, 6:36pm

We have a lot of users who want to avoid rate limits, but want help knowing whether their DNS, network, server, and other things are set up properly before trying to get a certificate.

Many of these users are resorting to switching to the staging endpoint, testing it, and then switching back to production if it worked; and if it didn’t work, hopefully they can fix it and try it again.

I don’t love this idea, as it slows down transactions and adds more complexity. But users are constantly making mistakes and getting rate limited and are getting desperate for solutions.

Would it be good or bad from Let’s Encrypt’s perspective if ACME clients optionally tried each transaction (mainly just cert issuance) against the staging environment first, then used production if it succeeded? I’m mainly concerned about resource usage. If clients adopted this method universally, wouldn’t a 2x uptick in transactions stress LE’s servers to the point of defeating rate limits?

Are there any other tips that you recommend giving to users who are struggling to get it right on the first try, at scale?

tdelmas · January 8, 2020, 6:52pm

Why not an intermediate approach?

Always first try with the production environment
If it fail, try with the staging environment until success, then try again with the production environment

That way you:

Don’t slow down all successful issuance
Don’t overload the staging server
Avoid bug when the staging environment is ahead of the production one
Still avoid rate limits

mholt · January 8, 2020, 7:04pm

This attempt still counts against rate limits, unfortunately:

There is a Failed Validation limit of 5 failures per account, per hostname, per hour.

Actually, now that I read it again, I'm not sure if "per hostname" means subdomains, like in the other rate limits...

It also cuts this limit in half:

For users of the ACME v2 API you can create a maximum of 300 New Orders per account per 3 hours.

Which is not acceptable for our larger-scale deployments.

And, I am not sure, but would this rate limit apply:

You can have a maximum of 300 Pending Authorizations on your account.

Does a failed validation leave the authz pending?

JuergenAuer · January 8, 2020, 7:13pm

Hi @mholt

I think that's not a good idea.

There are two critical limits:

(1) There is a Failed Validation limit of 5 failures per account, per hostname, per hour.
(2) but they are subject to a Duplicate Certificate limit of 5 per week

(1) isn't really relevant, one hour later it's gone.

(2) is critical. Most of the "I can't create a certificate because of a rate limit" topics have hitted (2).

But (2) isn't a certificate creation problem, it's only a certificate installation problem. No server restart, a Bitnami ... But then certificates are deleted, revoked -> that's not a solution.

Normally, the certificate creation may not work, if it is the first certificate. Or there are critical changes and the certificate has expired (adding HSTS, changing the vHost configuration ...).

But I'm sure: Most of the renews ... no problem. There is no need to add an extra check.

PS:

No, a failed validation is done, not pending. A lot of pendings -> buggy client.

mholt · January 8, 2020, 8:16pm

Hmm, okay – thanks both, for your input. That’s helpful.

Basically, there’s no way to help users who are hitting rate limits in their predicaments, other than they have to change their processes so that they don’t fail validations repeatedly… I’ll try to get more information and see what exactly they are doing wrong most of the time.

_az · January 8, 2020, 9:00pm

Do you do any preflight checks at all? How many of your users' issues would be solved with the most basic form of preflight?

e.g.

Do a naive local request to the domain to see whether you can see the challenge response (like cert-manager does)
Do a less naive (use external DNS) local request to the domain to see whether you can see the challenge response. (This is what I do).
Do an external request to the domain to see whether you can see the challenge response. There's obviously some privacy and consent implications to this one. This is what Certify the Web does:

If the local request fails (perhaps because the local server can't resolve itself via DNS etc) and if proxy API support is enabled, the app asks the https://api.certifytheweb.com server if it can access the resource instead (which also has the benefit of being external, just like the Let's Encrypt server is).

I considered this with Let's Debug, though it's just for HTTP-01 resources and I didn't want to be part of the critical path for any ACME clients.

Use backoffs for all issuance attempts that are not triggered directly by user action. That allows the user to come back and perform the required intervention without encountering rate limiting issues when they do.

What we do in our case is to begin increasing the interval between failed issuance attempts for any certificate, up to a maximum of 1 week. We don't apply any penalty for the first couple of attempts, to avoid penalizing initial setup hiccups.

mholt · January 8, 2020, 9:24pm

Thanks for the reply.

Yeah, but we’ve found in practice that DNS lookups are a matter of perspective Even authoritative lookups are seldom implemented the same across clients.

DNS is only one factor – networks, firewalls, software infrastructure, OS configuration, and other moving parts have all been observed to cause ACME failures.

We also use backoffs already, but many users reset the internal rate limiter by clearing caches or restarting the processes, etc, against the manual / recommended best practices.

It seems that the only way to know if a transaction will be successful is to try it. Oh well. Maybe it’s just a user education problem then.

_az · January 8, 2020, 9:37pm

We had this issue with users too, and it went away after we started persisting issuance attempts to BoltDB away from where the main certificate data is.

I still think preflights are valuable despite their lack of reliability. Even Let's Encrypt staging is not completely reliable - I've seen plenty of threads where it produces a different result to production.

Could this be a UX issue as well? This problem is relatively easy to solve when there is a user interface/dashboard/command line that allows issuance attempts to be interactively made ("dry run" button, too), such as in my case. A non-interactive daemon like Caddy seems much harder to meet the same need.

mholt · January 8, 2020, 10:18pm

Ahh, yes, we also persist this stuff to storage, but many of our users clear it out or use Docker and don't mount persistent volumes.

They only help with DNS issues... and only sometimes. I may consider doing them, but only if it proves to solve a significant number of problems.

orangepizza · January 8, 2020, 10:32pm

As staging doesn’t share rate limit of production server, even that won’t help in duplicate certificate rate limit.

system · February 7, 2020, 10:32pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
One hour rateLimit induces frustration Feature Requests	10	829	February 25, 2023
Staging environment - getting rate limit service busy retry later Help	13	206	December 5, 2024
Rate-limiting due to authz errors? Help	14	31014	July 1, 2017
Rate-limited on staging environment Help	4	64	May 31, 2025
Please solve rate limit message Help	5	522	January 14, 2023

Using staging environment 1:1 before all production endpoint transactions

Related topics