Disaster Recovery / High Availability

enemystand · March 20, 2018, 8:48am

Hello again!

I wanted to ask if Boulder supports DR/HA innately. If not, what are one’s options when setting up a CA solution with Boulder, to make sure there is a failsafe when by happenstance bad luck hits the fan?

Thank you again.

schoen · March 20, 2018, 7:29pm

Hi @enemystand,

The Let's Encrypt CPS forbids use of Let's Encrypt certificates

For any application requiring fail-safe performance such as a) the operation of nuclear power facilities b) air traffic control systems c) aircraft navigation systems d) weapons control systems e) any other system in which failure could lead to injury, death, or environmental damage.

That obviously doesn't control what technology goes into Boulder, but I think it's a sign that Let's Encrypt hasn't had certain kinds of formal availability criteria in mind as requirements (while obviously trying very hard to ensure the availability of the service in practice).

One idea is that you can have several Boulder instances (conceivably with different intermediate certificates in order to allow a revocation of one without affecting the others) and then use them in a failover setup so that you can redirect your ACME API from one to another by changing DNS records. These instances can, if necessary, be located in different physical locations and have different kinds of network uplinks.

One thing to bear in mind about outages is that OCSP signing can only be done (1) by an instance that has the appropriate intermediate private key, and (2) by an instance that has access to a database to determine whether the subject certificate is still valid. So, in order to avoid OCSP-related outages if a CA becomes available, you may want to use OCSP stapling and must-staple so that a CA outage or disappearance from the network won't require services that were certificate by that CA to go offline (at least during the relevant OCSP validity period). At the same time, we've heard on this forum that a lot of TLS server OCSP stapling implementations are kind of bad because they often fail to cache old valid responses when a new response is unavailable, which can mean that OCSP stapling can increase the chance of a service outage on a TLS service with such a poor implementation. On the bright side, apparently all of the main web servers have some awareness of this and are supposedly eventually going to improve it—but if you take the OCSP stapling suggestion, you should probably also experiment with simulating network outages (and rebooting servers during the simulated outage) to determine how the services will actually perform in these cases.

In general, OCSP is a potentially very important availability issue for PKI availability because it can affect certificates even after issuance, unlike most other kinds of outages which only affect new issuance events (and which can ideally be handled by failover to a different CA instance).

system · April 19, 2018, 7:30pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.