Proposal: Make staging environment stable

Currently, the instructions for the staging environment mention not to use it for development environments, and instead suggest using Pebble. Staging Environment - Let's Encrypt - Free SSL/TLS Certificates

However, I don't think that this is a good solution to the problem in question. We want our development environment to match our production environment as closely as possible. Testing with Pebble does not seem useful with regards to that goal. If not for your rate limits, we would just use Let's Encrypt's production environment, but due to the nature of the kind of testing we have to do in this environment, that is not feasible.

I would like to propose that the staging environment be considered reasonably stable, and the only differences between it and the production environment be the lack of rate limits, and the use of a non-standard root. Let's Encrypt should also have an internal staging environment to test changes before they are released to the public staging and production environment, which would have helped prevent the issue that unfolded yesterday.

3 Likes

I concur with that completely. Most major software houses with critical customer bases (e.g. government entities) have internal sandbox environments (sometimes in multiple stages) that require changes to pass rigorous regression tests before being released for customer consumption.

3 Likes

Agreed.
The "staging" and the "upcoming beta/new release changes" are two very different things that should be done on two completely separate systems.

3 Likes

Years ago I agreed with this. I since reversed, but I do not think Pebble is a proper option. My rationale includes, but is not limited to:

  • staging was supposed to be for testing integrations/deployments, not clients. Nevertheless, many client developers us it for development testing and test suites. Because of this, many clients have implemented anti-patterns and cannot handle edge cases, imperfect scenarios or unexpected results. It has likely caused more problems for end users than prevented them. The new mission of staging addresses this.

  • pebble aims to be different than Boulder whenever it can. It remains compliant to the acme spec, but tries to make different design choices when possible. This has infuriated many developers, myself included. Pebble is great for running lots of tests, and is useful during active development— but one can easily overfit a client to Pebble and make it incompatible with Boulder. You really need to understand pebble and the rfc to use pebble properly.

It would make sense to me if ISRG offered a “development” version of Boulder for client developers that:

  1. Mimics production exactly, except fake roots
  2. Requires signup/registration, with some sort of unique identifying token sent in the headers. That would allow the access to be determined before hitting Boulder. (ie, in a gateway or load balancer, within nginx before a proxy, or some sort of middleware solution)

The audience of people needing staging to deal with deployment/integration is larger than clients developers though. IMHO, we should be in a situation where people use staging to test an integration and it fails because the client is not able to handle a situation properly — so they know to switch clients.

4 Likes

I would also point out that Certbot has --dry-run, and users are typically told that this will give them a pretty accurate prediction of whether a renewal will succeed (diagnosing most potential problems with their ability to complete challenges). Having this almost always work when the production issuance would succeed is really valuable to Certbot users. When the staging environment has outages or failures that are significantly distinct from the production environment's, users may be confused and think that something is wrong with their own setups even when it isn't.

Because of that, I'd also support the idea of having a test environment as close as possible to the production environment in functionality and availability so users can have a meaningful way to simulate production issuance.

(Edit: From my side, this is about uptime and challenge verification, not anything about the staging server's certificate chain.)

5 Likes

As a client developer, I have relied on staging to have downtime occasionally and profited greatly, as the resulting client software has more robust error handling and is better suited for production in the event it has errors too.

A staging environment that never has problems or changes will lead to a production environment that DOES have problems.

When I absolutely need 100% uptime, pebble has mirrored boulder well enough for my needs. Just my $0.02.

4 Likes

Ideally there could be a test environment for developing/configuring ACME clients if an unstable staging environment is allowable. If someone wants to try to run pebble on a GoDaddy shared hosting instance for which my client is specialized and that I actually can afford, I might actually sell tickets and popcorn to watch the attempt. :wink: Personally, I believe that Let's Encrypt's wildly-successful proliferation is largely owed to the limited costs and technical requirements of developing and testing ACME clients with a functional staging environment being an integral part. I, for one, do not have the time or financial resources to dedicate towards a larger-scale endeavor of a non-profit nature. Most of us don't have funding from a for-profit enterprise. I am very grateful for the donations I have received in gratitude for my efforts though. Every little bit helps.

2 Likes

@mholt Do you mean as an ACME client developer? I am not an ACME client developer, but rather an ACME client client. Having staging randomly go down doesn't really help me at all. It just blocks all development. And if I were an ACME client developer, I don't think I'd want to be relying on the off chance that the staging environment goes down in order to find bugs.

To be clear, I'm not saying the staging environment needs 100% uptime. I just mean that it should be reasonably stable, and should not be used for deploying untested changes for the first time, which appears to be what went wrong on Friday.

In short:

  1. ACME client developers should use Pebble/Boulder (or similar) for thorough integration testing of their software.
  2. ACME client developers should use Let's Encrypt production or staging for general integration testing.
  3. ACME client clients should use Let's Encrypt production or staging.
  4. Let's Encrypt developers should use an internal staging environment to verify new changes before pushing them to public staging and production.
3 Likes
3 Likes

@rg305 I went back and forth on that thought, since I do think it would be valuable to know if the ACME client we are using is compatible with whatever upcoming changes. But I think having them be separate as you described adds a lot of flexibility.

3 Likes

I see two separate concerns here; one I agree with, one I don't.

I agree the platform should be stable in terms of availability. It is unfortunate you lost the ability to do integration testing for a period of time.

However, in terms of how the ACME server operates, the "unexpected changes" potentially surfaced errors in the client that you use.

If you started experiencing issues not from availability or ISRG making a mistake, but from a client incompatibility with staging... IMHO the platform worked correctly and just saved you a future crisis.

I say this from personal experience. Previous changes to staging (around the v1->v2 shift) alerted my team to major flaws in our client selection, leading us to build our own clients after the majority of clients we tested had similar problems at the time. There are a lot of bad ACME clients out there. I don't know what the current breakdown is, but at several points in the past few years I could safely say the vast majority of ACME clients had serious issues. Many pinned intermediates or roots. Most formatted requests in violation of the RFC, but within some lax rules that LetsEncrypt previously accepted.

3 Likes

@jvanasco The issue on Friday was a buggy release from Let's Encrypt. Our ACME client was working correctly. Like I mentioned above, I do see value in using staging in a test environment to get early warning about issues with the client, but this is only effective if the staging account is reasonably trustworthy. We spent much of Friday trying to figure out if the issue was Let's Encrypt's fault or cert-manager's.

3 Likes

Please see our full reply, which we've posted to API Announcements for visibility, here:

We're happy to continue the discussion and keep listening to ideas and feedback in this thread, but that should provide a succinct statement of where we're at at this time.

8 Likes