I'm setting up a Docker swarm cluster to run my personal website/homelab, and the organization I've chosen requires a nonstandard certbot setup. I have something working against the Let's Encrypt staging CA but haven't yet tried to issue any real certificates; does this seem like a reasonable architecture, or will something here cause me trouble in the future?
I have one container with certbot installed alongside a lighttpd server with the following cli.ini:
Periodically¹, it is invoked as certbot certonly -vvv -c /home/certbot/cli.ini -n --test-cert -d $DOMAIN to issue single-domain certificates. The deploy hook sends the issued certificate to a cluster-management container which adds the cert as a Docker-managed secret.
Each host's port 80 is served by an haproxy instance that's configured to forward all requests for /.well-known/acme-challenge/* to the certbot container and redirect everything else to HTTPS. In particular, this means that the ACME challenge for any of these domains is also accessible on the others. The challenges are not visible from port 443, as that uses a completely different system.
haproxy also services port 443, using SNI to forward each connection to the appropriate backend container according to the given hostname without terminating TLS. Each backend container is then responsible for one subdomain, and has only that domain's certificate mounted.
¹ Currently, this is every 14 days per domain, but I plan to read the expiration date from the issued certificates instead.
I see some potential issues, but I don't know if they are a problem in practice:
The certbot container doesn't have any persistence-- If it gets moved or restarted, it brings up a completely new certbot installation without any of the previously-issued certificates or ACME account info.
Once the certificates, especially the private keys, have been put into secrets storage, I'd like to remove them from certbot's container, but the documentation seems to indicate this could cause certbot to malfunction.
I don't currently have a system in place to revoke certificates that have been rotated out.
¹ Currently, this is every 14 days per domain, but I plan to read the expiration date from the issued certificates instead.
I strongly recommend having some persistence and using Certbot's logic for deciding when to renew - daily or even twice daily would be a good idea. Certbot uses the expiry date right now, but just landed support for ARI (Acme Renewal Information) to get better information about when to renew.
I am not deeply familiar with options for that on Docker Swarm, but you could plausibly store that state in a Docker secret (as it does have your ACME account credentials in it)
Revoking of certificates just because they're "rotated out" isn't recommended, but of course you're welcome to if you want to.
Thanks. It sounds like I'm headed down a decent path, but have a little bit more work to do.
I don't like the idea of all the private keys sitting at rest in a container that's just one hop away from the public internet.
But maybe I'm being a little too paranoid on this front: Anyone that can get to them there can also write into lighttpd's webroot and run their own ACME challenge for my domains. And that requires them to either jailbreak out of an HAProxy container to the host OS or exploit an RCE in lighttpd via one of the requests that HAProxy lets through.
The quick summary is that the provided system is securely designed but has severe usability limitations: Each secret is immutable and can only be accessed by mounting to a container at startup. Storing a new secret or restarting a container with a different secret mounted effectively requires root privileges for the cluster, which is why I have a separate hardened container for that part of the job.
That could work reasonably for storing account info, which I expect doesn't change that often, but would be a right pain to use for tracking the cert reissuance status.
I am pretty sure if you provide the CSR directly you can no longer use Certbot's automated renew.
If that's the preferred option you'd probably be better off with a different ACME Client. Like lego (link) or even acme.sh.
A different client might be more tolerant of deleting the private key too. Certbot wants it around because it keeps the latest set of cert files as symlinks in its ./live/. folder pointing to the most recent in its ./archive/. It doesn't like not having the full set of files to re-symlink.
You should read about Let's Encrypt rate limits. Needing new certs whenever you cycle the container can catch you out. The 5 / week / identical certs is the one most likely to hit. See: Rate Limits - Let's Encrypt