DNS-01 problem with dehydrated

RuGa · March 13, 2020, 11:09pm

I have the same problem.

I use DNSSEC.
I wrote a hook for dehydrated with debugging notes.
In the example below, you can see:

the tokens provided by Letsencrypt, to be used in the TXT record;
the record added to the DNS, with the original token;
the test on our master DNS, returning the record above;
the propagation of the record to both Cloudflare and Google;
Letsencrypt responding that the record is not correct!

[example.com]

token 1 = CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM
token 2 = zFD6--UX3XYoGMpppLocbvxbYGCTo7SqoCqcptmfi-8

+ Adding the following to the zone definition of [example.com]:
_acme-challenge.[example.com]. 300 IN TXT "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"

192.168.1.6 (master): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
1.1.1.1 (Cloudflare): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
8.8.8.8 (google): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"

[*.example.com]

token 1 = Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g
token 2 = 6aJSDNn-GBlqUOXjdm8NZSxL6PKFT3pRTOhCRRi4Lp0

+ Adding the following to the zone definition of [example.com]:
_acme-challenge.[example.com]. 300 IN TXT "Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g"
+ Updating the zone...
+ Signing the zone...

+ Checking the RR on the live DNS... OK
"CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
"Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g"

...

192.168.1.6 (master): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
"Rflnf-GHKaZWuclGLf92LL8jkMKgpSvLxFIwGcUun1g"
1.1.1.1 (Cloudflare): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
8.8.8.8 (google): "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"

Letsencrypt

+ Responding to challenge for [example.com] authorization...
+ ERROR: invalid challenge for *.[example.com]

CA server response:
{
"type": "dns-01",
"status": "invalid",
"error": {
"type": "urn:ietf:params:acme:error:unauthorized",
"detail": "Incorrect TXT record "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM" (and 1 more) found at _acme-challenge.[example.com]",
"status": 403
},
"url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/3342861489/8qsaTQ",
"token": "CaxlSTmwudKMcVH9R_-X0DTJWYdVRV0b7dPZiGGtAeM"
}

Summary

I have production domains with expired certificates, and cannot renew.

_az · March 13, 2020, 11:11pm

I moved your post to a new thread as it’s a separate issue.

_az · March 13, 2020, 11:19pm

So, from the look of things, you are taking the token from the challenge resource, and using it as the value of your TXT record.

This is not how the token is used.

For the DNS-01 challenge (RFC 8555 - Automatic Certificate Management Environment (ACME)), you:

Take the challenge token
Derive the key authorization value using (1)
Take the SHA-256 digest of the value from (2)
Take the base64url encoding of the value from (3)
Set your TXT record to the value from (4)

Generally, when you use an ACME client like Certbot or dehydrated, the client will give you the final value you need, saving you the trouble of steps 1-4.

Looking at https://github.com/dehydrated-io/dehydrated/blob/master/docs/dns-verification.md ,

$3 is a "challenge token" (which is not needed for dns-01), and
$4 is a token which needs to be inserted in a TXT record for the domain.

It sounds like you are using $3, but need to be using $4.

RuGa · March 14, 2020, 9:56am

You are right, my fault.

This is the new log for the wildcard case. The token is $4 now. The TTL is down to 30 sec, and the DNS tests are done for 10 seconds after the TTL. Google is slow to pick it up, but Cloudflare is spot on. As you can see, Letsencrypt is also slow. It would be useful to have a Letsencrypt diagnostic page, to see the full log from the server side.

Processing example.com with alternative names: *.example.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting new certificate order from CA...
+ Received 2 authorizations URLs from the CA
+ Handling authorization for example.com
+ Handling authorization for example.com
+ 2 pending challenge(s)
+ Deploying challenge tokens...

fqdn = example.com
token 1 = YdIkxG-2QznRkDUw7t_l-TMHX97ACkZdgXyiX3WCFMc
token 2 = BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI
+ Adding the following to the zone definition of example.com:
_acme-challenge.example.com. 30 IN TXT "BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
+ sleeping 30sec, to allow the CA to pick it up...

192.168.1.6 (master): "BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
1.1.1.1 (Cloudflare): "BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
8.8.8.8 (google):

fqdn = example.com
token 1 = NZYR87hqZgfKUJbU2RQICaTpxllciFazXkF0TwotTCo
token 2 = VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w
+ Adding the following to the zone definition of example.com:
_acme-challenge.example.com. 30 IN TXT "VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
+ Updating the zone...
+ Signing the zone...
+ Checking the RR on the live DNS... OK
"BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
"VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
+ sleeping 30sec, to allow the CA to pick it up...

192.168.1.6 (master): "BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
"VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
1.1.1.1 (Cloudflare): "VhgxNC87qDBp9-HcKkATfCUkFb516stf4Mv0CPldM2w"
"BXbg0mcLMzdubtX3FoC-OhsEqYCIcu-d2J0f9q4pQqI"
8.8.8.8 (google):

+ Responding to challenge for example.com authorization...
+ ERROR: invalid challenge for *.example.com

CA server response:
{
"type": "dns-01",
"status": "invalid",
"error": {
"type": "urn:ietf:params:acme:error:dns",
"detail": "DNS problem: NXDOMAIN looking up TXT for _acme-challenge.example.com - check that a DNS record exists for this domain",
"status": 400
},
"url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/3351771990/XI1JWQ",
"token": "YdIkxG-2QznRkDUw7t_l-TMHX97ACkZdgXyiX3WCFMc"
}

On Letsencrypt DNS

Is it possible to tell LE to read the token directly from the master, instead of the slaves or third party DNSs?. We use DNSSEC with DANE, each zone signature resets the SOA serial and it takes time for the slaves to pick it up.

_az · March 14, 2020, 10:14am

Let’s Encrypt queries your authoritative nameservers directly, it has a very negligible recursive resolver cache (60s, or your TTL, whichever is lower).

What seems likely is that one of your slaves was not yet serving the updated zone. That would also be consistent with Cloudflare picking it up and Google not - it’s just luck about which of your nameservers they hit.

Let’s Encrypt also tends to expose nameserver desynchronizations more often than common recursive resolvers, due to (under some circumstances) comparing responses between nameservers.

RuGa · March 14, 2020, 10:17am

Our slaves are slow. Reading from the master is the only way to get past the verification. However, LE fails to read the master, as you can see from the log. The log shows the LAN address. The query from the public IP of the master is in sync. I raised the waiting time to 2x the TTL (30 sec), without joy.

_az · March 14, 2020, 10:22am

You don't know what nameserver Let's Encrypt's resolver is taking its decision from. For all you know, it is checking all 3 and taking a quorum decision.

Anyway,

This isn't an option. SOA MNAME is not used as any kind of hint by recursive resolvers - only for dynamic DNS updates.

You need to wait for your slaves to update before responding to the challenge, or pull your slaves.

RuGa · March 14, 2020, 10:34am

This isn't an option.

LE should always prefer the master (SOA MNAME), especially when its records are signed (DANE).

On resolving, a simple "dig @$master +dnssec +short -tTXT _acme-challenge.$fqdn" would do, with no need to wait for the dns global databases to pick up LE's temporary RRs.

ndilieto · March 14, 2020, 12:12pm

Do you use NSD? If so these may be worth a try

RuGa · March 14, 2020, 1:16pm

12h ; refresh
2h ; retry
2w ; expire
1h ; min TTL

These are my RFC sane setting:

1200 ; SOA Refresh: slaves must refresh (learn zone changes) after 1200--43200 seconds
7200 ; SOA Retry: slaves must retry contacting master up to 120-7200 seconds
604800 ; SOA Expire: slaves must revalidate after 604800--1209600 seconds
3600 ; SOA Minimum: slaves must flush negative responses after 3600--86400 seconds

I prefer my 20min to your 12h refresh.

I still find it unreasonable for LE to force me to wait SOA Refresh + some, especially because you are doing it twice, for the fqdn and for the wildcard.

9peppe · March 14, 2020, 1:29pm

If not SOA Expire...

You have Retry > Refresh, is it on purpose?

RuGa · March 14, 2020, 1:38pm

I am within the RFC timing boundaries.

_az · March 14, 2020, 8:10pm

According to who? (I genuinely don’t know)

RuGa · March 15, 2020, 6:56am

According to anybody who knows what DANE is and knows how to query it.

ndilieto · March 15, 2020, 7:09am

_az · March 15, 2020, 7:37am

I am still not connecting the dots on wtf DANE has to do with how DNS recursors perform their queries. As far as I can tell, recursors don't care, and have never cared about SOA MNAME.

To cite RFC 8499 - DNS Terminology ,

The idea of a primary master is only used in [RFC1996] and
[RFC2136]. A modern interpretation of the term "primary master"
is a server that is both authoritative for a zone and that gets
its updates to the zone from configuration (such as a master file)
or from UPDATE transactions.

RFC1996 and RFC2136 being DNS NOTIFY and DNS UPDATE, neither relevant for recursors.

You can play with unboundtest.com if you like, it's the same recursor + similar configuration to what Let's Encrypt use for their VA - lots of verbose logging.

RuGa · March 15, 2020, 7:48am

Why is LE using a dns recursor when a simple “dig @$master +dnssec +short -tTXT _acme-challenge.$fqdn” would do, with no need to wait for the dns global databases to pick up LE’s temporary RRs?

With DANE, both ports 25 and 443 are signed in the DNS using a hash of their respective TLS certificates, who happen to be those you are updating from LE. LE could be smarter in this case, with no need for temporary acme RRs.

rg305 · March 15, 2020, 7:48am

You presume that the master is accessible from the Internet - that is NOT a requirement.
It only needs to be accessible to the slaves.

Your plan puts “all (DNS) eggs in one (MASTER) basket”.
And would require the DNS resolver to do a series of “if then else” logic tests/steps.
[You are essentially rewriting DNS]

_az · March 15, 2020, 7:48am

https://www.cloudflare.com/learning/dns/what-is-dns/ .

How does Let’s Encrypt even discover what your primary server is (in the hypothetical world where that means something)? It does it by descending from the DNS root zone (.), until it eventually finds your SOA record.

That’s why you use a recursor. It does the chain of lookups for you.

You have assigned meaning to SOA MNAME that simply doesn’t exist. Your three nameservers are completely equivalent to each other (in terms of priority and authoritativeness), to every recursor on the internet. You need to deal with the slave lag by sleeping.

This is what people who use e.g. Linode DNS hosting do - they literally put 20 minute sleeps into their renewal scripts, because Linode’s slave lag is so bad.

rg305 · March 15, 2020, 7:56am

Or try speeding that up with change notifications.

From a top-down recursive view, all authoritative nameserver are provided by the level above via DNS Glue records [which are all created equally].
There is no "Super Glue" record that points to the Master.

Topic		Replies	Views
Dns-01 use cached reply from own letsencrypt ns Help	16	2003	July 2, 2020
Using dehydrated, cloudflare dns-01 cert renewal Help	2	933	November 27, 2022
Error with dehydrated and letsecrypt Help	12	988	April 3, 2024
SERVFAIL looking up TXT (IDNA or DNSSEC issues?) Server	10	2581	January 30, 2019
Unable to renew certificate using dehydrated Help	16	609	August 7, 2024