My Letsencrypt certificate fails to renew randomly

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. https://crt.sh/?q=example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: sebbe.eu

I ran this command: (custom script)

It produced this output:
First run:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “www.sebbe.eu”! at ./certrenew.pl line 156.

Second run:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “mail.sebbe.eu”! at ./certrenew.pl line 156.

My web server is (include version): NGINIX (not applicable since I use dns-01)

The operating system my web server runs on is (include version):
Linux sebastian-desktop 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

My hosting provider, if applicable, is:

I can login to a root shell on my machine (yes or no, or I don’t know): yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel):
no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you’re using Certbot):
Script source:
https://pastebin.com/dTrqbeMq

zonefile (signed):
https://pastebin.com/rJ4EZxJn

zonefile (unsigned):
https://pastebin.com/Uievsynr

NOTE: The script have worked previously. Last good certificate from that script is:
https://crt.sh/?id=1488350275

First fail of above script: 2019-06-13

(Did something at letsencrypt change between 2019-05-13 to 2019-06-13 ?)

NOTE: Machine dns1.sebbe.eu and dns2.sebbe.eu is the SAME physical machine. Thus theres no zone transfers or delays involved, all DNS changes go live immidiately unless theres a cache in front of lets encrypt. (Theres a reverse NAT in front of that machine that ensures requests for 2001:470:dff1:1:10::1, 2001:470:dff1:1:10::2, 193.187.91.106 and 185.86.106.232 is routed to the very same machine)
The reason for the reverse NAT is to bypass registrar limitations that you need TWO operational nameservers with different IPs to able to set custom nameservers for the domain.

Let's Encrypt will respect your records' TTLs, up to a ceiling of 60 seconds.

Looking at your script, it looks like you use 3600s.

Maybe try setting a TTL of 0s or 1s and see whether that addresses your problem.

tried changing line 111 to:
print ZONEFILE “_acme-challenge.”.$domain.". 1 IN TXT “$b64”\n";

and rerun:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “imap.sebbe.eu”! at ./certrenew.pl line 156.
root@sebastian-desktop:/etc/nsd#

The results are always random. Its a random domain failing at line 156.

Latest zonefile looks like this now (unsigned):
https://pastebin.com/JgfBuZzM

NOTE: You need to subtract 23 lines, due to the censored private key. Line 156 is 133 in the censored version.

If @jsha wants to debug/try my script with the uncensored private key, I can arrange for a transfer of my LE private key, my certificate private key, and a ability to access root on my server to @jsha . What I need for this, is an email for @jsha, his PGP public key, and also an authorized IP adress for his personal computer, so I can add that IP in firewall as permitted to access SSH.

Can you post the order URL as well? The actual problem details could help narrow down the nature of the problem (more than just status == invalid anyway).

eeh? Order urls? This is a v1 client, not v2 client.

Ah: End of Life Plan for ACMEv1

In that case, the authorization URL for the authz that failed?

I did fetch the authorization URL and got following:

{
“identifier”: {
“type”: “dns”,
“value”: “mail.sebbe.eu”
},
“status”: “invalid”,
“expires”: “2019-07-30T01:43:17Z”,
“challenges”: [
{
“type”: “dns-01”,
“status”: “invalid”,
“error”: {
“type”: “urn:acme:error:dns”,
“detail”: “DNS problem: SERVFAIL looking up TXT for _acme-challenge.mail.sebbe.eu”,
“status”: 400
},
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767850”,
“token”: “lb9CZBSTVMcHhNhg7zEbwP1RwF1ttIl3mRKoWDmh8_w”
},
{
“type”: “tls-alpn-01”,
“status”: “invalid”,
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767851”,
“token”: “Zj1qUYqa8wNpYz5R0yqpG6Qj5U6V2o-E5g0y-YAFuPs”
},
{
“type”: “http-01”,
“status”: “invalid”,
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767853”,
“token”: “9VQtaYwSQbzOEIaCSGDPl7b1Er-xpXqiaxAImUZ10kE”
}
],
“combinations”: [
[
0
],
[
1
],
[
2
]
]
}

Seems to give random SERVFAIL (guess its DNSSEC failures, but why?)

Checking with a online checker gives positive result:
https://dnslookup.org/_acme-challenge.mail.sebbe.eu/TXT/#delegation
https://dnslookup.org/_acme-challenge.mail.sebbe.eu/TXT/#dnssec

nameserver check returns positive result (except for some crying about nameserver IPs being in the same ASN):


Dnsstuff also returns positive result (except for the mailserver checks which seem to fail due to my phishing filter):

Hmm. I wonder if Let’s Encrypt’s Unbound deployment is wigging out on the combination of your specific and wildcard TXT records. Maybe some kind of interaction with 0x20 case randomization.

Edit: managed to hit a SERVFAIL on Unboundtest: https://unboundtest.com/m/TXT/_acme-challenge.mail.sebbe.eu/ZFQFJZSC . As you mention, it’s sporadic/random.

thats weird. I have not made any changes to the configuration, and NSD do support 0x20 randomization.

Seems letsencrypt have made changes to their configuration causing it to randomly fail the DS/DNSKEY validation. Why it fails, I dont understand.

Note that dns1.sebbe.eu and dns2.sebbe.eu IS the exact same physical machine, so they should emit the same results everytime.

I also did some packet checks against my IPs and it seem my IPv4 IPs have a slight/rare packetloss. Is it that lets encrypt will fail if the DNSSEC validation packet gets lost? Shouldn’t it resend a query or response if packets are lost, even if its UDP?

To somewhat further complicate matters, when I try authorize your domain from Certbot, I don’t hit a SERVFAIL, I just hit a wrong TXT record (which means that DNSSEC and everything else passed).

Just for shits and giggles, could you stick a 60 second sleep immediately before your do_challenge loop?

$ sudo certbot-auto certonly -a manual --preferred-challenges dns -d mail.sebbe.eu --dry-run
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for mail.sebbe.eu

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.

Are you OK with your IP being logged?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(Y)es/(N)o: Y

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Please deploy a DNS TXT record under the name
_acme-challenge.mail.sebbe.eu with the following value:

V1FSgSl7lVO6ttHPAn3rQyorVZ2mjBw0Y154aruMC2c

Before continuing, verify the record is deployed.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue
Waiting for verification...
Challenge failed for domain mail.sebbe.eu
dns-01 challenge for mail.sebbe.eu
Cleaning up challenges
Some challenges have failed.

IMPORTANT NOTES:
- The following errors were reported by the server:

  Domain: mail.sebbe.eu
  Type:   unauthorized
  Detail: Incorrect TXT record
  "V7k5TqVof83cpSQFSxvOsaw6LHf09AV_Ii10JbbgC8U" found at
  _acme-challenge.mail.sebbe.eu

  To fix these errors, please make sure that your domain name was
  entered correctly and the DNS A/AAAA record(s) for that domain
  contain(s) the right IP address.

There’s definitely some weird timing going on with your zone rectification.

Just now I was spamming you with authorizations, and they were mass-failing with SERVFAIL (DNSSEC failures).

Then all of a sudden, they started working all at once.

1 Like

Weird. Now it worked with a 60 second delay. I tried first with 10 second delay but didn’t work. I will try again in a week or two and see if it still holds.

Hi @sebastiannielsen

your configuration is instable ( https://check-your-website.server-daten.de/?q=sebbe.eu ):

You have 4 ip addresses:

Host T IP-Address is auth. ∑ Queries ∑ Timeout
sebbe.eu A 185.86.106.232 Malmo/Skåne/Sweden (SE) - Obenetwork Network Hostname: dns2.sebbe.eu yes 1 0
A 193.187.91.106 Malmo/Skåne/Sweden (SE) - Obenetwork AB Hostname: dns1.sebbe.eu yes 1 0
AAAA 2001:470:dff1:1:10::1 Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC yes
AAAA 2001:470:dff1:1:10::2 Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC yes

your name servers have the same ip addresses

Domain	Nameserver	NS-IP
www.sebbe.eu
	•  dns1.sebbe.eu
		•
sebbe.eu
	•  dns1.sebbe.eu
	193.187.91.106
Malmo/Skåne/Sweden (SE) - Obenetwork AB	•

	• 
	2001:470:dff1:1:10::1
Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	•

	•  dns2.sebbe.eu
	185.86.106.232
Malmo/Skåne/Sweden (SE) - Obenetwork Network	•

	• 
	2001:470:dff1:1:10::2
Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	•

but checking your urls the 185 ip address doesn't answer.

Domainname Http-Status redirect Sec. G
http://sebbe.eu/
185.86.106.232 -2 1.103 V
ConnectFailure - Unable to connect to the remote server No connection could be made because the target machine actively refused it 185.86.106.232:80
http://sebbe.eu/
193.187.91.106 200 0.087 H
http://sebbe.eu/
2001:470:dff1:1:10::1 200 0.077 H
http://sebbe.eu/
2001:470:dff1:1:10::2 200 0.080 H

Same with /.well-known/acme-challenge.

So if a client tries to connect your 185.* as name server, there may be the same timeout.

But it's random because your ipv6 works.

What's with your 185.* address? Routing problem?

I'm afraid our recursive resolver configuration hasn't been changed recently.

This is just random unresearched guessing, but have you made any changes to the DNSSEC configuration recently? Changed the DNSKEY or DS records? Rolled algorithms?

Is it possible some of the TLD nameservers might be returning out-of-data records?

If I remember correctly, it’s not a problem to use NSEC with algorithm 7 – it adds support for NSEC3 but doesn’t require that you use it. IIRC. [Edit: This was correct.]

Are you running the most recent version of NSD? Any chance it’s fixed relevant bugs recently?

That Unbound error message looks unusual, but I’m not sure if it really is unusual, or under when and why it happens.

validator/validator.c:

        /* If the key entry isBad or isNull, then we can move on to the next
         * state. */
        if(!key_entry_isgood(vq->key_entry)) {
                if(key_entry_isbad(vq->key_entry)) {
                        if(vq->restart_count < VAL_MAX_RESTART_COUNT) {
                                val_blacklist(&vq->chain_blacklist, 
                                        qstate->region, origin, 1);
                                qstate->errinf = NULL;
                                vq->restart_count++;
                                vq->key_entry = old;
                                return;
                        }
                        verbose(VERB_DETAIL, "Did not match a DS to a DNSKEY, "
                                "thus bogus.");
                        errinf(qstate, reason);
                        errinf_origin(qstate, origin);
                        errinf_dname(qstate, "for key", qinfo->qname);
                }

I think a comment elsewhere in Unbound said it would log why something was “bad” at a higher log level.

:confused:

Edit:

Is it just me, or does it look like unboundtest is resolving sebbe.eu./DNSKEY more than once, apparently discarding the result without logging why, and then trying again? That’s weird, right?

If I resolve my own domain, it only does “resolving example.com. DNSKEY IN” once.

Could the NAT stuff be corrupting DNS responses?

Could the network or DNS server be doing rate limiting, and sometimes replacing their responses with something useless?

(RRL?)

1 Like

aha. I checked and I had forgot to open HTTP (port 80) on the 185. IP. But should it have any effect? Because I use dns-01 challenge. Could the failure to connect on port 80 have any issues in validation process?

I have throughtly checked the nameserver ports and theyre fine, no problems there. I use http://zonemaster.iis.se and they report my nameservers as fine.

Have now fixed the 80 and 443 ports so now it should be consistent.

No -- if you're using DNS validation, Let's Encrypt doesn't try to connect to your web server. You don't have to be running one. You don't even have to have A or AAAA records.

DNSViz also said it was fine, but there could be some intermittent or regional or unusual problem that has gone undetected.

wait, I remember that I enabled ratelimit on NSD due to some security scanner saying i was vulnerable to DDoS.

Could these pose a problem?

rrl-ratelimit: 25
rrl-slip: 4

Plausibly.

It’s plausible that Let’s Encrypt could send you 25 queries in 1 second – though the resolver would probably send some using IPv4 and some using IPv6, so NSD would count it as at least two separate sources.

Dropping 75% of queries could plausibly lead to Unbound concluding the servers are all down and giving up.

I might misunderstand the consequences of rrl-slip?
What I have understand, is that when rrl-ratelimit IS hit, nsd will of course drop queries exceeding the ratelimit. When 3 such packets are dropped, rrl-slip=4 means the fourth packet will then contain a “slipped” response (that ask the recursive resolver to retry with TCP, basically a packet with a “fake” oversize response that tells the recursive resolver the packet would be too large to send over UDP which tickles the recursive resolver to use TCP instead - even for packets that are good size).
If the recursive resolver then continues sending packets via UDP, the next 3 packets after the slipped response will be dropped and the eight packet (from when rrl-ratelimit was hit) will contain a new slipped response.
rrl-slip=1 means then that every packet above ratelimit will be a “slipped” response - which might be bad as it means the DNS server can still be used as a DDoS tool by using the “slipped” packets itself as DoS, and rrl-slip=0 disables the slip mechanism alltogether = 100% packets are dropped aftter hitting the ratelimit, but then legit resolver will never get a remediation when ratelimit is hit.

Have now upped the ratelimit to 50 QPS, should be enough.