+ Adding the following to the zone definition of [example.com]:
_acme-challenge.[example.com]. 300 IN TXT "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
+ Adding the following to the zone definition of [example.com]:
_acme-challenge.[example.com]. 300 IN TXT "Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
"Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g"
Generally, when you use an ACME client like Certbot or dehydrated, the client will give you the final value you need, saving you the trouble of steps 1-4.
This is the new log for the wildcard case. The token is $4 now. The TTL is down to 30 sec, and the DNS tests are done for 10 seconds after the TTL. Google is slow to pick it up, but Cloudflare is spot on. As you can see, Letsencrypt is also slow. It would be useful to have a Letsencrypt diagnostic page, to see the full log from the server side.
Processing example.com with alternative names: *.example.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting new certificate order from CA...
+ Received 2 authorizations URLs from the CA
+ Handling authorization for example.com
+ Handling authorization for example.com
+ 2 pending challenge(s)
+ Deploying challenge tokens...
fqdn = example.com
token 1 = YdIkxG-2QznRkDUw7t_l-TMHX97ACkZdgXyiX3WCFMc
token 2 = BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI
+ Adding the following to the zone definition of example.com:
_acme-challenge.example.com. 30 IN TXT "BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
+ sleeping 30sec, to allow the CA to pick it up...
fqdn = example.com
token 1 = NZYR87hqZgfKUJbU2RQICaTpxllciFazXkF0TwotTCo
token 2 = VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w
+ Adding the following to the zone definition of example.com:
_acme-challenge.example.com. 30 IN TXT "VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
"VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
+ sleeping 30sec, to allow the CA to pick it up...
+ Responding to challenge for example.com authorization...
+ ERROR: invalid challenge for *.example.com
CA server response:
{
"type": "dns-01",
"status": "invalid",
"error": {
"type": "urn:ietf:params:acme:error:dns",
"detail": "DNS problem: NXDOMAIN looking up TXT for _acme-challenge.example.com - check that a DNS record exists for this domain",
"status": 400
},
"url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/3351771990/XI1JWQ",
"token": "YdIkxG-2QznRkDUw7t_l-TMHX97ACkZdgXyiX3WCFMc"
}
On Letsencrypt DNS
Is it possible to tell LE to read the token directly from the master, instead of the slaves or third party DNSs?. We use DNSSEC with DANE, each zone signature resets the SOA serial and it takes time for the slaves to pick it up.
Let’s Encrypt queries your authoritative nameservers directly, it has a very negligible recursive resolver cache (60s, or your TTL, whichever is lower).
What seems likely is that one of your slaves was not yet serving the updated zone. That would also be consistent with Cloudflare picking it up and Google not - it’s just luck about which of your nameservers they hit.
Let’s Encrypt also tends to expose nameserver desynchronizations more often than common recursive resolvers, due to (under some circumstances) comparing responses between nameservers.
Our slaves are slow. Reading from the master is the only way to get past the verification. However, LE fails to read the master, as you can see from the log. The log shows the LAN address. The query from the public IP of the master is in sync. I raised the waiting time to 2x the TTL (30 sec), without joy.
You don't know what nameserver Let's Encrypt's resolver is taking its decision from. For all you know, it is checking all 3 and taking a quorum decision.
Anyway,
This isn't an option. SOA MNAME is not used as any kind of hint by recursive resolvers - only for dynamic DNS updates.
You need to wait for your slaves to update before responding to the challenge, or pull your slaves.
LE should always prefer the master (SOA MNAME), especially when its records are signed (DANE).
On resolving, a simple "dig @$master +dnssec +short -tTXT _acme-challenge.$fqdn" would do, with no need to wait for the dns global databases to pick up LE's temporary RRs.
1200 ; SOA Refresh: slaves must refresh (learn zone changes) after 1200--43200 seconds
7200 ; SOA Retry: slaves must retry contacting master up to 120-7200 seconds
604800 ; SOA Expire: slaves must revalidate after 604800--1209600 seconds
3600 ; SOA Minimum: slaves must flush negative responses after 3600--86400 seconds
I prefer my 20min to your 12h refresh.
I still find it unreasonable for LE to force me to wait SOA Refresh + some, especially because you are doing it twice, for the fqdn and for the wildcard.
I am still not connecting the dots on wtf DANE has to do with how DNS recursors perform their queries. As far as I can tell, recursors don't care, and have never cared about SOA MNAME.
The idea of a primary master is only used in [RFC1996] and
[RFC2136]. A modern interpretation of the term "primary master"
is a server that is both authoritative for a zone and that gets
its updates to the zone from configuration (such as a master file)
or from UPDATE transactions.
RFC1996 and RFC2136 being DNS NOTIFY and DNS UPDATE, neither relevant for recursors.
You can play with unboundtest.com if you like, it's the same recursor + similar configuration to what Let's Encrypt use for their VA - lots of verbose logging.
Why is LE using a dns recursor when a simple “dig @$master +dnssec +short -tTXT _acme-challenge.$fqdn” would do, with no need to wait for the dns global databases to pick up LE’s temporary RRs?
With DANE, both ports 25 and 443 are signed in the DNS using a hash of their respective TLS certificates, who happen to be those you are updating from LE. LE could be smarter in this case, with no need for temporary acme RRs.
You presume that the master is accessible from the Internet - that is NOT a requirement.
It only needs to be accessible to the slaves.
Your plan puts “all (DNS) eggs in one (MASTER) basket”.
And would require the DNS resolver to do a series of “if then else” logic tests/steps.
[You are essentially rewriting DNS]
How does Let’s Encrypt even discover what your primary server is (in the hypothetical world where that means something)? It does it by descending from the DNS root zone (.), until it eventually finds your SOA record.
That’s why you use a recursor. It does the chain of lookups for you.
You have assigned meaning to SOA MNAME that simply doesn’t exist. Your three nameservers are completely equivalent to each other (in terms of priority and authoritativeness), to every recursor on the internet. You need to deal with the slave lag by sleeping.
This is what people who use e.g. Linode DNS hosting do - they literally put 20 minute sleeps into their renewal scripts, because Linode’s slave lag is so bad.
Or try speeding that up with change notifications.
From a top-down recursive view, all authoritative nameserver are provided by the level above via DNS Glue records [which are all created equally].
There is no "Super Glue" record that points to the Master.