Stability and Purpose of Staging

Over the last few days our staging environment has been the subject of some discussion. Most of the discussion has centered around the idea of what level of service -- and in particular, what level of stability -- it should provide. To arrive at a cohesive answer, we need to examine who the users of Staging are, and what their needs and desires are.

Let’s Encrypt Staff

For Let’s Encrypt staff, Staging provides an important piece of our pre-release validation process. It sits between “Dev” (our internal-only environments where code and configuration changes can be quickly and easily deployed, tested, and thrown away) and “Prod” (our real consumer-facing environment) in our deployment process. All changes that we intend to deploy to Prod must first go through tests, then Dev, and then Staging.

For us, Staging provides two features that are very valuable for maintaining a high-quality service:

  • It isn’t Prod, so breakages don’t affect hundreds of millions of certificates; and
  • It does get significant traffic, so we can examine behavior and performance under realistic conditions.

ACME Client Devs and Large Integrators

For folks who are working on their own ACME clients, or who are building ACME integrations for large products (which often involves developing a custom client), Staging can be an important part of your development and testing process. While it shouldn’t be part of your automated continuous integration tests (nothing that makes network calls should be!), it can still be very valuable as a manual test bed and proving ground.

For these folks, we’d say that Staging provides three valuable features:

  • Realism, in that it reflects the behavior of Prod much more closely than a local Pebble instance does, especially when it comes to exercising actual validation behavior;
  • Unrealism, in that it issues untrusted certificates and has much higher rate limits to do so; and
  • Sneak previews of upcoming features so new client features can be developed against them.

ACME Subscribers

For end-users, Staging is mostly invisible. But many ACME clients do offer a “dry run” mode which conducts issuance against Staging, and some clients instruct users to use the dry run mode when first getting set up.

For these users, we think Staging provides basically one feature:

  • A successful run against staging is a strong indicator that runs against Prod will also be successful.

The Balancing Act

Obviously, many of these benefits are in tension with each other. The more Staging is used as a test bed for the benefit of Let’s Encrypt staff, the less it Just Works for subscribers. The less realistic it is, the less valuable for client devs it is, so it gets less traffic and becomes a less-useful test bed. An unsuccessful run against Staging does not necessarily imply an unsuccessful run against Prod. Given the amount of discussion lately, it seems appropriate to re-examine the balance that has been struck thus far.

Today, we aim for Staging to be useful for everyone without making any promises about its reliability for anyone. It gets new features first, so client devs and end users can see them, but that also means that it gets new breakages first. We do get alerts and pages when it is misbehaving so we can fix it quickly. We don’t always post change announcements ahead of time because sometimes we just need to test something.

The first suggestion we’ve seen from a couple folks is that we should treat Staging a bit more like Prod, particularly when it comes to communicating about it. Keep it as stable as possible, post announcements days or weeks ahead of any major change, and maybe even have a separate community forum section for those Staging announcements.

This is a good suggestion, and one that we’re working on. Frankly, it matches how we try to view Staging already. We do always try to keep it as stable as possible. We don’t intend to create a new forum category, but we will be continuing to try to post Staging announcements as early as we can.

The other suggestion that has been surfaced multiple times is that the current role of Staging should be split into two separate services with different stability profiles. One (let’s call it “PreProd”) would be an unstable test bed for upcoming Let’s Encrypt changes, while the other (let’s call it “SideProd”) would be just as stable as Prod but issue from untrusted roots for the benefit of client developers and end users.

The issue with this suggestion is that it provides benefit for client developers and subscribers while both removing benefit from and adding cost to the Let’s Encrypt team. Running two separate “staging” environments would cost resources that we feel would be better spent elsewhere. And the default assumption is that the vast majority of clients would direct their non-Prod traffic at SideProd, removing all utility from PreProd and decreasing stability of Prod overall. It is worth noting that we cannot simply “tee” traffic from SideProd to PreProd, as the ACME protocol (and nonces in particular) is specifically designed to prevent exactly that.

For now, Staging is going to remain largely as it is: an environment in which we can stage upcoming Production changes, and get some amount of real-ish traffic to confirm whether or not those changes break anything. It has no SLAs, and no performance, uptime, or data integrity guarantees. It exists for the explicit purpose of breaking things, so that we don't accidentally break things in Prod. Staging does and will continue to have more downtime and errors than Prod, and we hope that client developers will continue to use these traits to improve their software’s resilience.

We will continue to read feedback on this topic and consider the set of tradeoffs we’re making here. We will update the website’s description of Staging to more explicitly include language to this effect in the coming days. Thank you everyone for the great and civil discussion!

19 Likes