I'm trying to set up a mailserver and since having 3 cloud regions, there is 1 postfix per region. Each needs a certificate and they need one for the combined entry, e.g. out-$region.mail.syseleven.net as well as out.mail.syseleven.net.
We have a certbot setup where it spawns a DNS server locally and since the _acme-challenge is set to the server itself, it answers its own dns01-challenge, not requiring credentials for our DNS itself.
This relies on having the _acme-challenge.out.mail.syseleven.net and the _acme-challenge.out-$region.mail.syseleven.net delegated via NS records to the servers, e.g. with 3 regions, I have 3 NS records for _acme-challenge.out.mail.syseleven.net pointing to the servers in the 3 regions.
When certbot runs, 1 of the 3 NS records will start to "work" and return the TXT record. The Problem: it's random if this works. It appears either Letsencrypt resolvers only try 1 of the NS records and stop if that one timeouts, or letsencrypt has a timeout of like 30secs, so certbot itself does not wait long enough until the resolvers try all 3.
Can someone answer:
if a) letsencrypt follows all NS delegations to reach the _acme-challenge subzone and will iterate through the records even if all but 1 timeout (not NXDOMAIN/Error, just "no DNS server active there)?
if b) letsencrypt does it, but has a high timeout and so certbot needs to wait more than the default 30secs for verification? (I have not found a --timeout in the help-page for certbot)
I'd assume a) is true, as this is why you have multiple DNS servers - if one is down, you get a timeout and can try the other ones?
The fact that this doesn't work reliably is probably a good thing. And even if it did work reliably, you're abusing an implementation detail of the protocol which may very likely change over time as implementations and libraries change. It also feels like mild abuse of Let's Encrypt resources as DNS timeouts take longer for the validation servers to process and you're purposefully making those timeouts happen.
I have no answer for a) because I haven't looked at the code. For b), the timeout configuration on the validation servers (if there is one) is definitely not configurable by the ACME client.
I'd suggest re-thinking your architecture to avoid needing to purposefully have authoritative nameservers offline.
I see that the LE-servers are "hogged" during that timeout, I thought the open socket waiting for replies would not be a limited resource, as it is not CPU-time or bandwith.
I see both, that it's "unnecessary" for the LE servers when you already know it will timeout, but also the DNS specification für NS records, which clearly states "try until 1 answers". So I disagree with abusing an implementation detail, but also think there could be a better way.
I will try to implement a solution where there is an idle-dns-server there which will resolve the query internally to the other NS records and return the result. This way, we have a cascade inside our network and don't use the public letsencrypt resources for that. Maybe this can be implemented with normal dnsmasq configuration, so that every server running certbot will resolve queries for the _acme-challenge meant for other servers.
Understand that a negative answer is also an answer.
Which could fail the process immediately.
Does it "work" about 1 out of every 3 times?
I wonder why?
You feel that the requester should have to try all authoritative name servers (if needed) before it can determine if something fails/timesout?
I have domains with more than a dozen authoritative nameservers - that makes it very taxing to check all of them [could even become a DoS over time].
You should consider lowering the TTL on the CNAME and update them to only point to the region that can fulfill the TXT record request.
Running an acme-dns service as an auth subdomain (which your _acme-challenge CNAMES will point into) sounds like the simplest option in this case. So in this case your _acme-challenge record would not be it's own zone, it's just a dynamically served TXT record in a domain/subdomain of your choice, which your ACME client knows how to update (using an acme-dns compatible plugin etc).
Doing super custom stuff with your own DNS servers feels like extra work and as others have pointed out LE will not tolerate a SERVFAIL, timeout or NXDOMAIN etc during any of it's domain validation and it won't retry another nameserver just because one is being problematic.
LE is not really trying to use your DNS in the conventional way, it's kind-of validating it (by checking responses from multiple vantage points etc).
Note that you can also usually delegate your auth subdomain to a different zone on a different cloud provider (e.g. AWS Route 53 etc) and just use that zone for all your auth, which would avoid having to run acme-dns (which is a potential point of failure, although there is a heartbeat endpoint you can check).
As an aside, If you really want to build your own solution it's entirely possible to develop your own implementation of acme-dns (I've done this myself using a set of cloud based services) because it's only a couple of http API endpoints, a data store and a basic dynamic DNS service of your own making (OK, I'm simplifying a little). The hardest part for me was getting a cloud provided solution that would load balance UDP across the DNS server instances (they mostly just want to load balance TCP).
Now we can just point DNS (or tell customers to point their DNS) to a specific predictable subdomain on our acme-dns server (the fqdn in reverse order, dots converted to dashes). When we validate, I just run a hook after /register to update the acme-dns datastore to swap the uuid subdomain with our predictable subdomain.
That gave us everything we needed to get the job done, without having to build/test/maintain a new application - while still leveraging the benefit of the acme-dns community. If you have many future needs, that might be an option for you.
that'd be another interesting solution, return a CNAME to the next NS server if not found. Conceptually, it's the same as the alternative solution with resolving the record yourself, you need a NS server active even when certbot on your server is not... since it can be active on another server.
acme-dns seems to be a very nice tool indeed for delegating _acme-challenge away to not have your main credentials for the main DNS laying around everywhere, without deploying another fully-fledged DNS server.
But it is again a single server with a single point of failure, from the design perspective.
I am looking for a solution where all servers can run independently, ideally without really knowing each other. You could scale the system without touching already deployed servers when every server can answer letsencrypt challenges, be it by querying all NS delegations (where letsencrypt does only 1) or by building a list of CNAME which are followed till someone answers.
if every server has a dnsmasq running CNAME-chaining to the next one, the system will loop - but no one will ask anyway. when any certbot on any server in that loop starts a challenge, that one breaks the loop, returning the answer (from localhost), so we don't need to update anything on central DNS...
Then try throwing in DDNS.
All the authoritative NS need is to CNAME to a name that is dynamic.
Then you bring that name up at any location (wait for DDNS to sync the IP) and do w/e cert actions you need from that IP and shut it back down.
but is DDNS in this regard just "have the API credentials for a DNS server handy"? isn't this back to the start of "give every server API credentials for the nameserver to CNAME themselves" or is the feature of DDNS specifically meant as a "1 credential with permission to update just 1 record"?
I always thought DDNS was just a DNS API with permission management (for non-techies / App-Integration)?
acme-dns as a single point of failure in this situation is actually a benefit, and the goals you noted are extreme over-engineering that are extremely likely to hide critical flaws.
Certificates are valid for 90 days, with renewal recommended to occur on day 60 - meaning a single point of failure should never create a tangible business issue and there are at least 4 full business weeks to fix any problems. It is a proven system, actively used by hundreds of organizations and recommended by most cloud hosting providers.
Should you have any concern for enrolling domains, it is still the best option currently available and one could use well known multi-zone and failover techniques to automatically switch DNS to a secondary installation.
It's honestly hard for me to view any of the above comments as anything other than attempts to abuse implementation details and specification minutiae to overengineer a solution to a simple problem.
No, the CNAME entry would be a one-time/permanent thing.
No; DDNS means: Dynamic DNS.
Where the permanent CNAMEd entry points to a entry with a dynamic IP (likely in another domain).
The IP is updated via client software - similar to making a DNS entry but technically nowhere near it; As you would NOT be touching your authoritative name servers.
For the Certify DNS product (which is what this became) I needed disposable/sacrificial fault tolerance on the DNS side (as being a public DNS service is a very hostile environment), very scalable storage, and I needed auth (which you could also do with a front end reverse-proxy) to match registrations against known user accounts. I also wanted to have a complete understanding of how it all worked at every level. It would have been nice to use an existing cloud dns provider for the actual DNS part but they're update latency on records are all far too slow for the job. The end result is still acme-dns compatible, it just relies on the client allowing basic auth in the service url.
[The actual implementation is a Cloudflare worker for the API, some functions/data store stuff on a cloud provider and a few dynamically scaling/recycling container instances for the DNS servers, which are behind a load balanced IP forwarding UDP/TCP].
@jvanasco had the correct answer: "extreme over-engineering"
Thanks for telling me, you are right.
I tried using acme-dns as a single, separate acme-solver for all servers, but it requires delegation of the _acme-challenge.domain to the specific .domain with the id generated by the first registering against its API. It has no way builtin to bootstrap accounts, so it wasn't the right choice for an internal DNS server updated by certbot/lego with provisioned credentials.
So, I went with setting up a separate PowerDNS with wildcard-zones for all domains I need and deploy the API key to my certbot/lego clients. We can improve that setup with powerdnsadmin and its API keys per zone, but right now, a stateless powerdns container is the perfect blank canvas to run acme-challenges on.