I have a thought that I'd like to hear from client authors and large integrators about.
Let's Encrypt operates out of two datacenters currently, with database writes handled via one of them as the primary DC. That means that if we ever have an outage of the primary, it may take some amount of time to ensure the secondary DC is ready for promotion. Depending on the nature of the situation, it may be a little while, requiring human intervention.
One thing we could do is offer what I am going to call "client-managed high-availability": These would be ACME accounts which require targeting a specific DC via specific directory URLs. Individually, these accounts will have lower uptime. However, with a pair of accounts in both DC1 and DC2, overall your uptime is going to be higher.
This is putting more requirements on the client authors, though, for what is potentially a pretty marginal gain. I know some clients already support automatic failover between Let's Encrypt and other ACME CAs. Would being able to fail over explicitly between Let's Encrypt DC1 and DC2 be helpful?
Note that this is just me thinking about further-future planning. This isn't likely to be something we offer in the short term, and may never offer it at all. It also doesn't help for cases where we deliberately stop issuance either (eg, due to an issuance bug being found), so failover to multiple ACME CAs is preferable to start with.
While I think for backend DR purposes that there will almost always be some type of impact or ripple felt by frontend clients, my architect-sense tingles when it comes to where the demarcation points are proper. To me, this comes down to a type of distant-load-balancing scenario. While turning up k8s clusters and their associated infrastructure in Azure, I was surrounded by discussion about Availability Zones and ensuring that our nodes in any given cluster were evenly partitioned amongst three zones. This was a fairly static performance concern for us with our client-base being localized to Colorado. I think that cloud infrastructure has come far enough in the current day to allow for a "global-fault-tolerant" approach without a significant performance hit, especially given that performance impacts are, to a degree, negligible at a distance. What I'm saying is that I don't think that "creative" splitting/sharding of traffic may be necessary whether implemented frontend (user-transparent) or backend (user-opaque). Just make sure the hydra has enough heads and redundancy, perhaps?
It's hard for me to picture this being all that helpful for clients that already use multiple CAs. If Let's Encrypt DC1 fails, and I need a cert now (such as for a new service spinning up rather than a renewal that still has some time before the current cert expires), I think failing over to another CA makes a lot more sense than trying DC2 next. Whether the failure is because (1) Let's Encrypt has intentionally stopped issuance entirely, or (2) Let's Encrypt is dealing with some sort of outage in only one datacenter, either way the status of Let's Encrypt is much more "unknown" and riskier than just using another CA. And I don't see why adding more load to Let's Encrypt at that point would really help anything.
(those are fictional, placeholder urls). It would just happen both are backed by the same roots/intermediates.
The only really tricky bit here is account management for us: How do we guarantee that accounts stay in the "right" DC? For now we move that around freely, so it would require some extra tooling or code to do that, but it's not impossible. Or maybe we call it best-effort.
The "why not another CA" reason is a good question, and we'd encourage folks to add multiple-CA redundancy first. That's also part of why I'm posting here, to hear if anyone has compelling arguments to do this.
The "why not global consensus storage on the cloud" is also a good one, but we use an active/passive SQL database setup hosted in physical datacenters today and that's the reality we live in.
They don't need to "stay" anywhere if you add a flag in the account DB which would indicate "DC validity", e.g. "blue", "green" or "both". That way you could replicate everything in your database without a worry.
I guess my initial consultation question would be: to what degree of disaster do you want to be fault-tolerant?
The current blue-greenish proposal seems to me like "distributing the eggs amongst the colored baskets" rather than ensuring that multiple "clones" of each egg exist.
My gut instinct says that your accepted/required constants are going to dictate your fault-tolerance limitations. If they must be baskets and from a particular weaver, that in itself defines the box.
I know it's wishful thinking, but wouldn't it be lovely if there was some way to "blockchain" the whole thing, to decentralize and thus remove the centralized target/point-of-failure. Beyond the current scope, I realize.
Cool, so it sounds like it'll basically be the same thing (for us clients) as configuring another ACME CA.
And maybe I don't understand the problem on the CA side with the accounts, but from the client side, the accounts are already separated: different CA == different ACME account.
All ACME clients are required to be fault tolerant, so downtime should just be expected and handled. If downtime is a whole day (with current certificate lifetimes) that should just be ok, for clients, once you get a few days in it could start getting problematic. The solution there is a well practised disaster recovery plan that's been recently tested and you know how long it takes to carry out.
Personally if designing the same thing I'd have the top level API be a very thin reverse proxy/load balancer and able to be quickly repointed at an alternative, and I suspect this is what you already have. I currently do this for some things just using Cloudflare proxying DNS and cloudflare workers (which handle some things themselves and proxy work back to API servers if required), that was an API doing 350M requests per month, cost $49/month. Lots of other ways to do the same thing.
While I admire the idea of close-to-100% uptime, keep in mind that one of the top alternative ACME CAs merrily runs a service that fails to complete authorizations or finalize orders in a reasonable amount of time on a very regular basis. Not that we shouldn't all strive to be better!
I think this is mainly for obtaining new certificates, not for renewals. For renewals it's no problem even with way shorter lifetimes than now, say 7 days (I'd then start the renewal after roughly half that time). But if you automatically fire up a new VM and want to quickly obtain a certificate for it, or add a new hostname, etc., you don't want to wait a day until the CA is back up.
In any case, I agree with @mholt that for clients this just looks like another CA, so once you have a generic CA fallback there's nothing special to implement for this scheme.
It is a very interesting idea, but I would much rather have my client fall over to another CA than to another LE data center, because I think that would be less likely to suffer from a simultaneous outage. Unfortunately none of the ACME clients I have used had support for that. I did a survey of client support over at Mozilla's dev-security-policy list some days ago, and I don't think the result was great. Maybe LE offering two ACME endpoints would help convince some client authors to add support for it?
@jesperkristensen Caddy and Certify The Web both support CA fallback to different degrees but it's most obviously a feature when you have a client that supports multiple CAs and multiple accounts simultaneously and there is an advantage if your client is a continuously (or very regularly) running service because then you can dynamically respond to changes quite quickly compared to daily cron jobs etc.
Stuff like this reminds me of a situation around 2005-2008. IIRC, a truck crash on an interstate somehow caused a fire that took out the power or data going into a Rackspace(?) facility that an extremely large number of VC backed startups were colocated in or had managed servers in. A lot of popular services were abruptly offline for a few days.
Anyways, I don't like the idea of retrying multiple API endpoints of the same provider. It overcomplicates the logic in clients (and potentially account setup due to the server side concerns mentioned above). I would rather just go to the next provider organization than try another potential fail with the same org.
The 2 ways i imagine a client implementing this:
A CA can have failover API URL(s). Lots of logic to handle using the right one.
A Single Account exists over multiple servers. Work needed to ensure they are in sync (e.g. a rollover on one system affects another)
Looking at how Certbot and a few other popular clients are designed, neither would be a fun PR to implement.
Perhaps there are other ways to leverage this concept. If the ACME server were able to return a response that basically says "this directory is having issues, please use the following backup directory: ___" and provide it in the payload, it might not be that difficult - and clients could either handle that natively or just log it for subscribers to consider.
That might not need to even be in the payload - I don't believe the URLs in the directory object need to be on the same server, so they could point to another server. I am assuming the ACME server can return responses, because DNS could just be switched to a system that only returns error codes and the backup directory mapping.
Something I want to note though, is my perspective from being an enterprise services subscriber in the past. While I love and trust ISRG/LetsEncrypt... I don't really have interest in investing in implementing/supporting a HA system with a vendor that does not have a SLA Promise or Guarantee. If I am spending my time on a HA system, it's going to be with a vendor that gives me some guarantee.
Perhaps LE can offer a paid High Availability directory endpoint for large customers. That would limit the concerns for syncing accounts on the server side (only a small subset need to be in sync), and your team could test out different ways to handle the internal traffic routing, and eventually roll out those changes to the general public.