Let's Encrypt behind ACTIVE/BACKUP keepalived (VRRP)

Hi,

I'll start by giving a bit of context about my setup. I have a domain called example.com that has a DNS A record for 100.100.100.100. This 100.100.100.100 address is Destination NATing to a private 192.168.2.211 VRRP address. I have Ubuntu 22.04 APACHE2 Server-01 that is 192.168.2.11 as the VRRP MASTER using the 192.168.2.211 address and I have Ubuntu 22.04 APACHE2 Server-02 that is 192.168.2.111 as the VRRP BACKUP that won't use the 192.168.2.211 address until Server-01 fails. This is intended to be a HA setup so if Server-01 dies then Server-02 should take over and have a valid HTTPS certificate to serve example.com

I have been able to use certbot to get a valid Let's Encrypt HTTPS certificate for the VRRP ACTIVE 192.168.2.11 Server-01, but when I try use certbot to get a valid Let's Encrypt HTTPS certificate on the VRRP BACKUP 192.168.2.111 Server-02 I am getting an error message and it does not get a certificate.

What is the best way to have a valid Let's Encrypt HTTPS certificate on both Server-01 and Server-02 so if Server-01 fails then Server-02 can serve the example.com website straight away? Do I need to use cronjobs to grab the valid Server-01 certicate and put it on Server-02? This doesn't sound like a great solution to me because then the certificate would never renew on Server-02 without action on my part if it failed. Surely there is a simple setup for my HA setup?

Welcome @ralphwatson

There are lots of ways to handle that. I am guessing the reason for the failure to get a cert on Server2 is because you are using an HTTP Challenge and requesting that on Server2. Certbot prepared that server to reply to the Let's Encrypt server but since Server1 is running it saw the incoming HTTP challenge and replied "Not Found" (since the challenge token is on server2).

Assuming that, you could look at using the DNS Challenge. Each server could get its own cert through DNS TXT records. The main issue is you need to run server2 often enough to keep the certs fresh. Also, if server2 starts up and needs a fresh cert various technical reasons could cause the request to fail or take a long time (LE Servers down for example).

I think your better option is to have server1 copy the cert files to server2 each time it gets them. Or place them in a reliable place so server2 can grab them when it starts. Certs are renewed with 30 days remaining so server2 would not need to freshen the cert unless it is running longer than 30 days. I assume you would get server1 running by then.

If either server could be the active server for a long time then a DNS Challenge is probably more viable and just cope with the issues I described.

Anyway, these are just ideas. The details often vary among setups and there are other ideas too.

Any setup involving multiple servers handling the same domain name gets tricky fast. Adding High Availability to the mix adds even more complexity (such as not tolerating cert request failure on server2 startup and similar timing issues).

4 Likes

Thanks for putting in the effort to provide such a great response.

Yeah this is certainly a headache.

All those options you listed seem hard and I think I had another idea that might simply it. So Server-02 is failing at the moment because it is the VRRP BACKUP. If I have a cronjob that tries to grab the certificate on Server-02 every 5mins or so it should fail over and over which in theory should be fine? When Server-02 becomes the VRRP ACTIVE then that cronjob will work. The issue with this is when Server-02 is VRRP ACTIVE it will be renewing the cert every 5mins, which isn't great. I think I read Let's Encrypt will rate-limit so eventually that 5min cronjob will start failing due to the rate-limit and maybe that is fine?

The goal of all this is to have ZERO single points of failure and getting a valid HTTPS certificate on both Server-01 and Server-02 is the final battle to achieve that. Certainly proving to be a headache to achieve this and I'll keep on chugging away.

2 Likes

If you're using a reasonable client like certbot on Server 2, running the cron job every 5 minutes should be fine: if it has already gotten a certificate, it will see that its expiration is far in the future, and not bother issuing a new one.

That said, this approach has a drawback: for the five minutes between Server 1 failing and Server 2 getting its own cert, no one will be able to visit your site securely. Ideally, your backup server should be really to go already, not have a 5-minute spin-up time.

My suggestion would be for Server 2 to have a shell script cron job which runs once each day:

  1. Check to see if it is the primary
  2. If yes, run certbot renew
  3. If no, rsync the certbot directory over from Server 1
4 Likes

One thing to consider in general when doing this is that certbot renew checks OCSP to see if any existing certificate has been revoked by the CA. That means there is some delay per-certificate and also some use of network resources associated with each time it's run. In some contexts, that could be trivial or irrelevant.

I did see a server once with a really huge number of certificates managed by Certbot, and certbot renew ended up taking an unreasonably large amount of time because of the OCSP queries even when there were no renewals due. This and other phenomena have convinced me that Certbot doesn't scale up very well for managing large numbers of certificates.

I'm not suggesting that these considerations are relevant to @ralphwatson's situation, but I just want to note this whenever someone suggests that there's no reason not to run certbot renew on an arbitrarily short interval. With enough OCSP checks in play, you would actually start to miss some of the certbot renew invocations—they would give up immediately as an older renewal process still held the Certbot lock!

3 Likes

You guys are legends.

I'm ok with accepting an up to 5min outage on Server-01 failure and I think having a cronjob on Server-02 running certbot for my use case should be fine. I will look into the solution @aarongable suggested as a primary, but if it too complicated for my little brain I'll stick with the cronjob solution.

In my use case I will only have 8 domains so I don't think I will hit that scalability issue @schoen is talking about, but it is interesting to know.

2 Likes

Make sure to avoid a 1 Hour outage though.

If the cert on server2 needs renewing (too old or revoked (*1)) your certbot renew running every 5min on server2 will fail when it is the backup.

But, Let's Encrypt has a rate limit of 5 failures per hour with a one hour lockout. Your 12 failures per hour will trigger this. See Rate Limits.

As I noted earlier other factors can cause a cert request to fail. You should really design to have a valid cert on Server2 at all times so it can wakeup instantly. Getting a fresh cert at startup is fragile and is difficult to test that Server2 will behave as you wish under all circumstances. It is much easier to test / monitor that it always has a valid cert.

(*1) As noted by @schoen the certbot renew does an OCSP check to see if your cert was revoked. Let's Encrypt may revoke certs and has had to do this (rarely). I think just last year a batch were revoked as they were issued in error. So, even though the certbot renew doesn't usually request a fresh challenge / cert until after 60 days it can happen. And, there is talk of shortening cert lifetimes and there is also ARI coming probably soon. Lots of things affect how renewal work. Not a great thing to design around during a HA backup server wakeup.

2 Likes

On Ubuntu 22.04 and a standard Apache2 setup any advice/guides on what directories/files I need to copy over to the BACKUP Server-02 for it to work? I tried copying the Server-01 /etc/letsencrypt/* to /etc/letsencrypt/ on Server-02 and copied the /etc/apache2/sites-available/example.com-le-ssl.conf file from Server-01 to Server-02 and then failed it over, but it wasn't loading for some reason. sudo a2ensite example.com.conf was done and a sudo systemctl restart apache2 was done as well. I by no means am an apache2 expert or letsencrypt expert either so could be something simple I am doing wrong.

You have to make sure the copy keeps the original symlinks.

What reason did it show?

Hard to say what went wrong without more details.

4 Likes

Still having issues, but you were right the symbolic links weren't being copied over and I had to TAR and then un-TAR the files to get the symbolic links to copy over too.

HTTP is working, but HTTPS is just saying "This site can’t be reached".

I don't think I am competent enough to do this copy solution and will probably resort to my cronjob solution and accept a 1 hour outage if/when Server-01 fails.

Unless someone is able to provide clarity on what exact files/folders need to be copied with a default Ubuntu LTS 22.04 and APACHE2 setup. I've tried copying /etc/letsencrypt/* on Server-01 to /etc/letsencrypt/ and have also made sure the /etc/apache2/sites-available/ stuff is the same on Server-01 and Server-02 and doing that still gets "This site can’t be reached" when Server-01 is failed over to Server-02.

Not even ChatGPT or Bard could give me clarity on what to do. Surely this is a common thing and there is a guide or video on what exact stuff you need to copy from Server-01 to Server-02 to get it to work :frowning: but I guess not.

Do you have symlinks from sites-enabled to sites-available on Server1 and did you create those on Server2?

On your local network, are you able to do https://(server2-Local-IP) using, say, curl? You will get a cert name mis-match warning but the https connect should work. If that fails it points to an Apache setup problem. If it works it points to a comms or port routing problem.

Are you able to use HTTPS to Apache on Server2 using self-signed certs? If that works then you know it is just something about the /etc/letsencrypt folders.

There are lots of ways to debug your problems and work through them. Doing high availability servers with failover and all that is a sophisticated endeavor. You may want to re-assess what you can accomplish given your current skill set.

3 Likes

Still having issues, but you were right the symbolic links weren't being copied over and I had to TAR and then un-TAR the files to get the symbolic links to copy over too.

Sorry I missed this thread. This is one of the setups that I am very familiar with.

I just want to go over a few things:

1- Do not use tar to copy/sync the folders. tar - with symlink support - is the correct tool for offline backups (encrypt it & toss it in the cloud, etc). You are overcomplicating things by using it for this situation.

2- The best thing to use is rsync as suggested by @aarongable. That's going to allow you to copy over only the changes as needed, and everything is automated via ssh. Here is one of the simpler guides: How To Copy Files With Rsync Over SSH | DigitalOcean

3- I would actually not use a cronjob for this. Instead I would use a --deploy-hook on Server1 to invoke the rsync. --deploy-hook only runs on success, so you're only running it when the certificate changes.

4- Server2 needs to restart to reflect the changes. That can either be done via the --deploy-hook on Server1 that invokes the rsync command OR via a daily cronjob on the Server2.

Personally, I really like using the Python library Fabric to write these automation scripts. I find Fabric to be very fast to write these scripts with, and the Python is exceptionally clear to read and understand when it comes to maintenance. it's just a few lines to write a cron script in Fabric that will use SSH to rsync and open a shell in the other server to restart it.

I don't foresee any availability issues from running this on --deploy-hook. The certs should have at least one month left on their lifespan when a failure occur, so the last cert should still be valid even if failure happens mid-renewal. Even when the short-avail certs are offered, it will be a matter of days leftover. The way certificate revocation works, there is around 7-10 days before most browsers will get that info – unless the browsers consider you a a high-priority site and use their private channels to push the revocation info. Realistically, there should be no issue on failover, and you should have a minimum of 2 days to triage (10 day certs) and 30 days to triage with current 90 day certs.

Something you can also do in these situations is to use a daily cron on Server2 to check uptime on Server1 and toggle a semaphore on disk. This is popular on nginx due to how it implements filesystem checks, but I'm not sure if Apache users do it much. Basically you just touch/delete a filepath, and the server checks to see if the filepath exists and changes behavior. nginx implements this in a way that uses kernel memory, so there is almost no overhead, resulting in it being very popular to have a semaphore check on every request to toggle maintenance mode or inject a service degredation message into html pages.

4 Likes

Did you really expect them to think ?
They just regurgitate what they find on the Internet.
[which is mostly filled with garbage]

3 Likes

Ok, I have got this HA ACTIVE/BACKUP setup working how I want it.

To get it working I had to brush up on my Ubuntu Apache2 fundamentals. The issue that stumped me was that certbot was doing some under the hood magic which I wasn't copying over to my Server-02. The way I got it working was to make Server-02 ACTIVE and then run the normal certbot Let's Encypt setup which does something I don't fully understand. The part of the certbot setup I don't understand is it creates a example.com--le-ssl.conf file in /etc/apache2/sites-available and I have no clue if it does a a2ensite example.com--le-ssl.conf command under the hood to get it working or what. Maybe @schoen can shed some light on what commands certbot is running under the hood to setup that example.com--le-ssl.conf linking for HTTPS. If I had to setup another server without the ability to bring it ACTIVE then I would need to understand what ubuntu commands certbot is doing under the hood when installing a certificate.

On Server-02 I have a cronjob that will check if it is the VRRP BACKUP and if it is it will do a rsync to Server-01 grabbing everything in the /etc/letsencrypt directory and putting it in the /etc/letsencrypt directory of Server-02. If Server-01 fails then Server-02 is currently taking over fine and I am very happy. Currently if Server-01 fails then there will be a few manual things I will have to do before bringing it back into production, but I am at peace with that scenario.

I'm a Network Engineer so it has been fun diving into Ubuntu, Let's Encrypt, and Apache in more detail and appreciate everyone's help/advise. All the best.

It would only create the XXssl.conf file when using the --apache plugin without the certonly option. And, only when running a command to issue a cert not just a renew

As for how it creates the symlink I am not sure. But, running a2ensite is the common command but doing one manually is possible too.

But, it wouldn't create one at all if you had properly copied everything about Apache from Server1 to Server2 before running Certbot. It creates that ssl.conf file when you only have a VirtualHost for port 80 and then the --apache plugin will create one for port 443. I doubt that's really what you want as I am guessing you have custom config in your actual port 443 VHost and Certbot would only be cloning your port 80 one.

Something seems amiss in your scheme.

Personally, I would not be using the --apache plugin and favor certonly --webroot. Using --webroot prevents Certbot from making any changes to your Apache config. You can also use certonly --apache which won't make any permanent changes to Apache config but it will make temp changes to satisfy the HTTP Challenge. Doing this means certbot renew will also gracefully reload Apache to pickup the new cert. If using certonly --webroot you need to add a --deploy-hook for that command or do something else to reload Apache.

3 Likes

I checked the Certbot source code, and (when run with --apache) Certbot itself does explicitly create a symlink into sites-enabled on "Debian-like" systems; on other systems, it apparently adds an Include directive to the main Apache configuration file. I didn't check which systems are considered "Debian-like".

3 Likes