LetsEncrypt isn't reading my TXT RRs (RFC2136)

Hello, I'm using certbot and rfc2136 with a bind9 server on debian. certbot is correctly adding and removing the DNS proofs but LetsEncrypt CA doesn't seem to care.

I can run:

sudo certbot certonly --dns-rfc2136 --dns-rfc2136-credentials  /path/to/rfc2136.ini --domain ki9.gf4.pw --domain '*.ki9.gf4.pw'

And the process starts. named logs report that the text record has been set:

Oct 11 11:40:45 myhost named[1278889]: client @0x7f97f4045568 <LOCAL_IP> #58478/key letsencrypt: updating zone 'gf4.pw/IN': adding an RR at '_acme-challenge.ki9.gf4.pw' TXT "FShM6LHSWAu0n4J8Komjn764BlhsEj5phDPm_Fb89jk"

The certbot command says Waiting 60 seconds for DNS changes to propagate. During this time, I can log on to any machine on the internet (not necessarily the server running bind), and dig the record:

$ dig _acme-challenge.ki9.gf4.pw TXT | grep TXT
; <<>> DiG 9.16.1-Ubuntu <<>> _acme-challenge.ki9.gf4.pw TXT
;_acme-challenge.ki9.gf4.pw.    IN      TXT
_acme-challenge.ki9.gf4.pw. 120 IN      TXT     "FShM6LHSWAu0n4J8Komjn764BlhsEj5phDPm_Fb89jk"

Cool! There it is, for the world to see. But after 65 seconds, certbot fails thusly:

Waiting for verification...
Challenge failed for domain ki9.gf4.pw
dns-01 challenge for ki9.gf4.pw
Cleaning up challenges
Some challenges have failed.

 - The following errors were reported by the server:

   Domain: ki9.gf4.pw
   Type:   unauthorized
   Detail: No TXT record found at _acme-challenge.ki9.gf4.pw

   To fix these errors, please make sure that your domain name was
   entered correctly and the DNS A/AAAA record(s) for that domain
   contain(s) the right IP address.

What do you mean "No TXT record found at _acme-challenge.ki9.gf4.pw"? We all fucking saw it.

Despite failing, certbot cleans up its old record successfully:

Oct 11 11:41:50 myhost named[1278889]: client @0x7f97f404b318 <LOCAL_IP>#58656/key letsencrypt: updating zone 'gf4.pw/IN': deleting an RR at _acme-challenge.ki9.gf4.pw TXT

The nameservers associated with your domain don't seem to agree (by a decent amount) on the SOA serial number. Are you sure your update process is getting the change to both of them?

>dig gf4.pw ns +noall +answer
gf4.pw.                 43136   IN      NS      krow.gf4.pw.
gf4.pw.                 43136   IN      NS      ksn.gf4.pw.

>dig gf4.pw soa @krow.gf4.pw +noall +answer
gf4.pw.                 604800  IN      SOA     ksn.gf4.pw. bind-ksn.gf4.pw. 129 604800 86400 2419200 604800

>dig gf4.pw soa @ksn.gf4.pw +noall +answer
gf4.pw.                 604800  IN      SOA     ksn.gf4.pw. bind-ksn.gf4.pw. 150 604800 86400 2419200 604800

Notice how the SOA serial is 129 when asking krow.gf4.pw and 150 when asking ksn.gf4.pw. I realize it's possible for different nameservers to host the same copy of the zone using different serials. But it's somewhat unusual and would imply your TXT updates aren't making it to both copies of the zone considering the problem you're having.


@rmbolger Thanks. I think you're right; the issue is that the slave, krow.gf4.pw isn't updating. I checked the propagation with dnschecker.org and saw only a few of the servers resolving; the rest were using the slave that isn't updating.

After running certbot again, it magically worked. I supposed this time it happened to use the ksn.gf4.pw nameserver.

So that must have been the problem. I have my certificates for now and will go fix the other ns.


Let's Encrypt uses 4 vantage points for validation: 1 primary and 3 secondary vantage points. So with two nameservers of which half is working, the entire validation has a chance of success of: ½ · ½ · ½ · ½ = 1/16.

So statistically, one out of sixteen attempts would succeed.


Presuming that any one of them is actually correctly updated at all - LOL
[that remains to be proven]

In that case, I'll just set the renewal cronjob to run 16x more frequently and call it a day. :grinning_face_with_smiling_eyes: