Multiple servers for same domain with multiple DNS servers - Best practice for issue/renew?

Hi,

Our websites and DNS are served by multiple sets of geographically diverse servers. What we are wondering is the best way of achieving automated registration and renewal.
One critical factor is that in this high-availability system failure of a DNS server or webserver should not cause disruption to certificate issue/renewal.

In essence we have:

2 DNS servers serving the same domains as authorities.
2 webservers serving the same website.

As I understand it the issue with automated issue/renewal is that modifications need to be made to the webroot or the DNS.

The problem is that when making a change that change would obviously need to be made to both locations prior to letting the challenge occur - as it is not possible to determine which server the challenge with arrive at.
The DNS are not on the same machines as the webservers. Loss of any machine must not affect operations.

Is there any advice for the best way forward in such high availability scenarios?

How can we get automated issue/renewal? Or is the answer a lot of scripting/unison/scp?

Cheers
Barry

I think the description of your system is too vague.

For example, when one of the web servers is down, is the associated A record also withdrawn? Is this automatic or manual? This would influence whether HTTP validation is even viable within your setup, since the validation servers will not ignore downed hosts.

What do your nameservers run? Can you update them with RFC2136 or a PowerDNS-compatible remote interface?

Do you need the web servers to run unique SSL key pairs? Are you ok with authorizing your web servers to update your DNS zones? If not, you may wish to centralize your certificate renewals to a privileged host, pre-distribute the certificate private key to your web servers, and just fetch the certificate from the privileged host nightly.

1 Like

Also, how important is that? If you rarely create new certificates, and renew them with plenty of lead time, an outage isn't usually a problem.

1 Like

I would actually push everything to a single-point of failure with CNAMES (DNS-01) and A records (HTTP-01) onto an unchanging host/ip.

Like @mnordhoff said, there should be plenty of lead time in case of an outage. You’re likely to create more issues with auth by overcomplicating your deployment. If you try to renew every 60 days, you have 30 to deal with the renewal.

Bind9 with RC2136 updates available.

Yes associated A records are withdrawn when the network/system of the target not reachable.
No requirement for web servers to run unique key pairs - but each location (not server necessarily) needs to be able to obtain its certificates (even if the certificates themselves are shared between locations) - i.e. complete telecommunications failure on an extended basis is mitigated.

Each location runs a pair of DNS servers and a number of webservers.

Yes the webservers could make update DNS zones including cross-location - all locations have internal VPN available. The sites effectively provide mirroring - each site is completely autonomous with information sharing between sites as required.

We do not mind using a privileged host per server farm, but not one serving the entire infrastructure. No single point of failure regardless of timescale of outage. This ensures that outages in telecommunications of significant duration (and it has happened - and has not affected us) does not affect business.

The avoidance of single point of failure has been a fundamental principle of the systems design for the last 15 years and has proven itself - we cannot compromise that. If it involves a significant amount of software development to produce a system that works we are happy to do the development.

Sounds like purchasing a few max validity duration certificates and managing them out-of-band could be a safer route.

It also sounds like you could do anything that’s been suggested above.

Centralizing certificate management (whether globally or per PoP) may be required to avoid rate limits, which is a factor to consider when dealing with redundancy.

Have you also considered diversifying across certificate authorities? There have been a number of Let’s Encrypt OCSP server outages in the last couple of years which have affected availability of websites serving their certificates (because popular web servers handle OCSP outages poorly).

If you actually only have 2 web servers, some web servers make it easy to serve a local file or redirect to the other web server.

With Nginx you could do:

location /.well-known/acme-challenge/ {
    try_files $uri @redirect;
}
location @redirect {
    return http://other-server.example.net$request_uri;
}

Then both servers could independently pass HTTP-01 validation.

If one server was down, validation would fail randomly 50% of the time until the broken IPs were withdrawn from DNS.

You possibly can do round-robin redirects – or reverse proxies – with more than 2 servers but it gets more and more awkward.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.