What is the best strategy to use Let's Encrypt with multiple servers under the same domain name if the servers are under DNS round-robin? I.e. there are multiple IP addresses associated with the same domain name.
I haven't tried this yet but trying to plan the transition from one to many servers and make it as smooth as possible, i.e. with minimum or no downtime. My setup is based on Nginx.
Googling this question brings up some answers including some on this forum, but I still can't grasp the full picture.
My current understanding is: I can set up my new servers, then copy the nginx config files and the certbot files (under /etc/letsencrypt) referenced by the nginx directives (ssl_certificate, etc), from the original server to the new ones.
This will break autorenewal on the new servers, which is kind of OK as long as I can fix it somehow by running certbot again.
However, the acme-challenge thing won't work on the new servers, or at best it will be random due to the round-robin setup in place. Even if I copy the entire /etc/letsencrypt and the cron job, acme-challenge will fail.
So my question is, is there a better way of handling this? Or is it OK to let acme-challenge fail and retry until the right IP address is selected? Seems a bit ugly and unreliable to me.
Or is manal copying of everything on each renewal the only way?
I've used this approach in the past and it's pretty badass: Stateless Mode · acmesh-official/acme.sh Wiki · GitHub. You don't have to use acme.sh, the same approach will work with any client, provided you can get your account thumbprint.
There are a lot of options, but there are a lot of ways of running multiple systems so it's hard to have one-size-fits-all. Here are some that I think people use:
Have your load balancer system handle TLS termination, getting and using the public-facing certificate.
Have one system handle all challenges via HTTP-01, and then script copying it to the other systems via a deploy-hook. This requires having the load balancer always direct requests for .well-known/acme-challenge to that server, or having all other servers use an HTTP redirect to the server that runs challenges, to make sure that requests for the challenge always get to the system running the ACME client.
Similar to 2, but use DNS-01 challenge. So one system automatically updates the DNS server, gets the cert, and then copies to the other systems that need it as a deploy hook.
Have each server get its own separate certificate, using the DNS-01 challenge. This could run into some rate limit issues if you scale a lot, but for a couple of servers it can work fine.
And there are probably a lot of variations on the above you can try, especially around if you want to "push" configuration to each of your servers or have them each "pull" from one central place, and which one you prefer may depend on how you're handling keeping the rest of the configuration for your servers in sync with each other.
In practice, this means that each of your webservers would need to be using the same ACME (Let's Encrypt) account. This can be achieved by e.g. with Certbot, copying the /etc/letsencrypt/account/ directory from your first server, to your other servers.
Then, any of the servers will be able to issue a certificate without actually having to deploy any challenge files.
Watch out for rate limits though. This works okay if you have 3 servers, not so much if you have 10.
So given the rate limits and that Let's Encrypt is a free service after all, I think the best approach would be to have one server run certbot and copy everything on each renewal to other servers.
I wouldn't feel guilty about issuing duplicative certificates for a handful of servers. Copying a single certificate over is a perfectly valid approach as well. There's just a few more moving parts to worry about in that case.
Since you have DNS round-robin, very likely you do not have load-balancer in front of the server farm. It is possible to configure the nginx farm that all servers would be able to fulfill the HTTP-01 challenges. There is a configuration snippet that should be deployed on each web server. For description see: https://github.com/bruncsak/ght-acme.sh#setup-challenge-response. That solution is not restricted to the ACME client I am maintaining. Somewhere (not necessarily on any of the web server) you run the ACME client to get the certificate, then you deploy the certificate on all instant of the web servers.
Funny thing is, I didn't know where to get my account thumbprint from, built a script based on what I found on the Internet but it turned out to be wrong. However, running certbot with a wrong acme challenge response gives away the correct one in the error message! I copied it and voila, everything worked.
I think certbot could have an option to print the account thumbprint for this kind of use cases.
The best is the one which works best in your environment. Since, as @bruncsak observed, DNS round robin almost certainly precludes a single point for TLS termination, the second best choice is to use your configuration and/or content management system to distribute the certificates to the web servers. This may be something as sophisticated as Ansible playbooks, or simple as shared storage containing the files*.
How best to request and update the certificates will depend on your configuration and/or content management system infrastructure, but an ACME client on a suitable host and DNS-01 challenge should make this fairly simple and robust.
*[In this case be careful to separate configuration from content at each layer.]