Installing same Let's Encrypt Account to use within entire fleet (2000+ systems)

Hi Folks,

I want to use LE to generate x509 certs for a semi-large fleet of systems. With that said, considering the max limit restrictions and issuing the form to lift these restrictions, I need to tie the private key to the account to all systems (obviously not ideal to replicate the private key but this is the only option I have). I plan on doing this by using a cfg mgt solution and place any sensitive data, such as the private key, in a vault so it can be distributed to all systems that require it. Thus, my ask to the community is as follows:

  1. There are various paths to the information in question on my primary system already having the main account. These are the paths within “/etc/letsencrypt/”: accounts archive csr keys live renewal renewal-hooks. With the “accounts” path, I have both “acme-staging-v02.api.letsencrypt.org” and “acme-v02.api.letsencrypt.org”. Within “acme-v02.api.letsencrypt.org”, I the unique directory name, which then contains the following: meta.json private_key.json regr.json. The full path containing these three files is as follows: “/etc/letsencrypt/accounts/acme-v02.api.letsencrypt.org/directory/some random number”. With that said, do I simply deploy “/etc/letsencrypt/accounts/acme-v02.api.letsencrypt.org/directory/some random number” and the files at “meta.json private_key.json regr.json” on all systems in order to use across the entire fleet?

  2. Considering I do not need to register a new account, do I have to add an additional parameter within my certbot command syntax, or is certbot smart enough to check to see if an account is already registered?

I’d appreciate any feedback the community has on this topic.

Cheers,
Col.

The relevant Certbot payload is as you guessed - /etc/letsencrypt/accounts/acme-v02.api.letsencrypt.org/directory/some random number

That path is computed from /etc/letsencrypt/accounts/{ACME-Server}/{Server Directory Path}/{Account ID}. The account key is JWK encoded in private_key.json.

You should be ready to go with copying that info, however you are likely to run into issues with ratelimits if 2000+ systems are trying to use the account key.

One popular option in your scenario is to route all /.well-known traffic in your network to a single machine, and rsync any new certificates off it.

I had a similar need a few years ago and built an integrated Certificate Manager, ACME client and Dynamic SSL plugin for openresty – https://github.com/aptise/peter_sslers . The certificate manager/client runs on a single node, and is driven by a web interface and/or API access. The OpenResty plugin uses a tiered caching system to failover from Nginx Worker > Nginx Master > Redis > Web API when loading certificates.

Thanks for the feedback. We have worked with LE to increase our account limits so I’m not concerned hitting the rate limits. One qq you mentioned concerning “rsyncing”, how are you protecting the private keys associated with the systems you are installing the x509 LE certificates from? Are you generating them on this single machine and replicating the entire “cert bundle” to the target systems?

We just rely on OS level permissions like Certbot does. If you want to go overboard, there is a project called "lemur" from netflix that goes into this sort of certificate provisioning.

Yes. The ACME Client and core info live on a single machine, which is subject to it's own security concerns; the AccountKeys never leave this machine either. However, the SignedCertificate+PrivateKey+Chain payloads are syndicated to the client machines. If a deployed machine is compromised, the fleet is considered compromised - but the AccountKey is not, so the certs can be revoked.

You can get a lot done with a centralized Certbot and rysnc with a "push" model.

At one point I needed to build a custom solution - we needed API driven certificate generation and access -- so the clients in our fleet just ask the central repository for the active certificate in a "pull" model. We also need to support a scalable number of certificates, which meant replacing Certbot's approach of "a new PrivateKey for each Certificate", to re-use the same PrivateKey for all AcmeOrders made in a given day or week. With this approach, a secondary component of Security is being prepared for a compromise and having a fleet that will quickly time out a compromised PrivateKey and cycle in the new one.

If we know of a leak, the mitigation plan is:

  • All Services offline on Maintenance Mode
  • Revoke old Certificate
  • Clear Redis Cache
  • Generate new Certificate
  • Sites online after X minutes.

X minutes is 7 minutes - the sum of a 60second and 6minute tiered cache. The entire fleet will automatically update their caches when the records go stale so we don't have to jump in and restart machines. There are a handful of things we could do to drop this time down lower, but 7 minutes of downtime for a catastrophic scenario is a decent starting point for our needs.

Thanks. I’m familiar with Lemur and I don’t believe it meets my use cases. Thanks, I’ll check your solution out, but one of my hard requirements is not duplicating private keys.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.