My Letsencrypt certificate fails to renew randomly

sebastiannielsen · July 23, 2019, 1:13am

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. https://crt.sh/?q=example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: sebbe.eu

I ran this command: (custom script)

It produced this output:
First run:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “www.sebbe.eu”! at ./certrenew.pl line 156.

Second run:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “mail.sebbe.eu”! at ./certrenew.pl line 156.

My web server is (include version): NGINIX (not applicable since I use dns-01)

The operating system my web server runs on is (include version):
Linux sebastian-desktop 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

My hosting provider, if applicable, is:

I can login to a root shell on my machine (yes or no, or I don’t know): yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel):
no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you’re using Certbot):
Script source:
https://pastebin.com/dTrqbeMq

zonefile (signed):
https://pastebin.com/rJ4EZxJn

zonefile (unsigned):
https://pastebin.com/Uievsynr

NOTE: The script have worked previously. Last good certificate from that script is:
https://crt.sh/?id=1488350275

First fail of above script: 2019-06-13

(Did something at letsencrypt change between 2019-05-13 to 2019-06-13 ?)

NOTE: Machine dns1.sebbe.eu and dns2.sebbe.eu is the SAME physical machine. Thus theres no zone transfers or delays involved, all DNS changes go live immidiately unless theres a cache in front of lets encrypt. (Theres a reverse NAT in front of that machine that ensures requests for 2001:470:dff1:1:10::1, 2001:470:dff1:1:10::2, 193.187.91.106 and 185.86.106.232 is routed to the very same machine)
The reason for the reverse NAT is to bypass registrar limitations that you need TWO operational nameservers with different IPs to able to set custom nameservers for the domain.

_az · July 23, 2019, 1:23am

Let's Encrypt will respect your records' TTLs, up to a ceiling of 60 seconds.

Looking at your script, it looks like you use 3600s.

Maybe try setting a TTL of 0s or 1s and see whether that addresses your problem.

sebastiannielsen · July 23, 2019, 1:26am

tried changing line 111 to:
print ZONEFILE “_acme-challenge.”.$domain.". 1 IN TXT “$b64”\n";

and rerun:
root@sebastian-desktop:/etc/nsd# ./certrenew.pl
Creating challenge for sebbe.eu
Creating challenge for www.sebbe.eu
Creating challenge for dns1.sebbe.eu
Creating challenge for dns2.sebbe.eu
Creating challenge for printer.sebbe.eu
Creating challenge for mail.sebbe.eu
Creating challenge for smtp.sebbe.eu
Creating challenge for imap.sebbe.eu
Writing challenges to zone file
Signing DNSSEC data…
Submitting challenges for validation…
Getting validation results…
Failed authorization for “imap.sebbe.eu”! at ./certrenew.pl line 156.
root@sebastian-desktop:/etc/nsd#

The results are always random. Its a random domain failing at line 156.

Latest zonefile looks like this now (unsigned):
https://pastebin.com/JgfBuZzM

NOTE: You need to subtract 23 lines, due to the censored private key. Line 156 is 133 in the censored version.

If @jsha wants to debug/try my script with the uncensored private key, I can arrange for a transfer of my LE private key, my certificate private key, and a ability to access root on my server to @jsha . What I need for this, is an email for @jsha, his PGP public key, and also an authorized IP adress for his personal computer, so I can add that IP in firewall as permitted to access SSH.

_az · July 23, 2019, 1:35am

Can you post the order URL as well? The actual problem details could help narrow down the nature of the problem (more than just status == invalid anyway).

sebastiannielsen · July 23, 2019, 1:36am

eeh? Order urls? This is a v1 client, not v2 client.

_az · July 23, 2019, 1:38am

Ah: End of Life Plan for ACMEv1

In that case, the authorization URL for the authz that failed?

sebastiannielsen · July 23, 2019, 1:45am

I did fetch the authorization URL and got following:

{
“identifier”: {
“type”: “dns”,
“value”: “mail.sebbe.eu”
},
“status”: “invalid”,
“expires”: “2019-07-30T01:43:17Z”,
“challenges”: [
{
“type”: “dns-01”,
“status”: “invalid”,
“error”: {
“type”: “urn:acme:error:dns”,
“detail”: “DNS problem: SERVFAIL looking up TXT for _acme-challenge.mail.sebbe.eu”,
“status”: 400
},
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767850”,
“token”: “lb9CZBSTVMcHhNhg7zEbwP1RwF1ttIl3mRKoWDmh8_w”
},
{
“type”: “tls-alpn-01”,
“status”: “invalid”,
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767851”,
“token”: “Zj1qUYqa8wNpYz5R0yqpG6Qj5U6V2o-E5g0y-YAFuPs”
},
{
“type”: “http-01”,
“status”: “invalid”,
“uri”: “https://acme-v01.api.letsencrypt.org/acme/challenge/4n1LvH4klmTaNFE4AeFtSc4wPQQJcqZv0rHJWDkKCZY/18612767853”,
“token”: “9VQtaYwSQbzOEIaCSGDPl7b1Er-xpXqiaxAImUZ10kE”
}
],
“combinations”: [
[
0
],
[
1
],
[
2
]
]
}

Seems to give random SERVFAIL (guess its DNSSEC failures, but why?)

Checking with a online checker gives positive result:
https://dnslookup.org/_acme-challenge.mail.sebbe.eu/TXT/#delegation
https://dnslookup.org/_acme-challenge.mail.sebbe.eu/TXT/#dnssec

nameserver check returns positive result (except for some crying about nameserver IPs being in the same ASN):

Dnsstuff also returns positive result (except for the mailserver checks which seem to fail due to my phishing filter):

_az · July 23, 2019, 1:56am

Hmm. I wonder if Let’s Encrypt’s Unbound deployment is wigging out on the combination of your specific and wildcard TXT records. Maybe some kind of interaction with 0x20 case randomization.

Edit: managed to hit a SERVFAIL on Unboundtest: https://unboundtest.com/m/TXT/_acme-challenge.mail.sebbe.eu/ZFQFJZSC . As you mention, it’s sporadic/random.

sebastiannielsen · July 23, 2019, 2:01am

thats weird. I have not made any changes to the configuration, and NSD do support 0x20 randomization.

Seems letsencrypt have made changes to their configuration causing it to randomly fail the DS/DNSKEY validation. Why it fails, I dont understand.

Note that dns1.sebbe.eu and dns2.sebbe.eu IS the exact same physical machine, so they should emit the same results everytime.

I also did some packet checks against my IPs and it seem my IPv4 IPs have a slight/rare packetloss. Is it that lets encrypt will fail if the DNSSEC validation packet gets lost? Shouldn’t it resend a query or response if packets are lost, even if its UDP?

_az · July 23, 2019, 2:13am

To somewhat further complicate matters, when I try authorize your domain from Certbot, I don’t hit a SERVFAIL, I just hit a wrong TXT record (which means that DNSSEC and everything else passed).

Just for shits and giggles, could you stick a 60 second sleep immediately before your do_challenge loop?

$ sudo certbot-auto certonly -a manual --preferred-challenges dns -d mail.sebbe.eu --dry-run
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for mail.sebbe.eu

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.

Are you OK with your IP being logged?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(Y)es/(N)o: Y

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Please deploy a DNS TXT record under the name
_acme-challenge.mail.sebbe.eu with the following value:

V1FSgSl7lVO6ttHPAn3rQyorVZ2mjBw0Y154aruMC2c

Before continuing, verify the record is deployed.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue
Waiting for verification...
Challenge failed for domain mail.sebbe.eu
dns-01 challenge for mail.sebbe.eu
Cleaning up challenges
Some challenges have failed.

IMPORTANT NOTES:
- The following errors were reported by the server:

  Domain: mail.sebbe.eu
  Type:   unauthorized
  Detail: Incorrect TXT record
  "V7k5TqVof83cpSQFSxvOsaw6LHf09AV_Ii10JbbgC8U" found at
  _acme-challenge.mail.sebbe.eu

  To fix these errors, please make sure that your domain name was
  entered correctly and the DNS A/AAAA record(s) for that domain
  contain(s) the right IP address.

_az · July 23, 2019, 2:19am

There’s definitely some weird timing going on with your zone rectification.

Just now I was spamming you with authorizations, and they were mass-failing with SERVFAIL (DNSSEC failures).

Then all of a sudden, they started working all at once.

sebastiannielsen · July 23, 2019, 2:20am

Weird. Now it worked with a 60 second delay. I tried first with 10 second delay but didn’t work. I will try again in a week or two and see if it still holds.

JuergenAuer · July 23, 2019, 6:54am

Hi @sebastiannielsen

your configuration is instable ( https://check-your-website.server-daten.de/?q=sebbe.eu ):

You have 4 ip addresses:

Host	T	IP-Address	is auth.	∑ Queries	∑ Timeout
sebbe.eu	A	185.86.106.232 Malmo/Skåne/Sweden (SE) - Obenetwork Network Hostname: dns2.sebbe.eu	yes	1	0
	A	193.187.91.106 Malmo/Skåne/Sweden (SE) - Obenetwork AB Hostname: dns1.sebbe.eu	yes	1	0
	AAAA	2001:470:dff1:1:10::1 Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	yes
	AAAA	2001:470:dff1:1:10::2 Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	yes

your name servers have the same ip addresses

Domain	Nameserver	NS-IP
www.sebbe.eu
	•  dns1.sebbe.eu
		•
sebbe.eu
	•  dns1.sebbe.eu
	193.187.91.106
Malmo/Skåne/Sweden (SE) - Obenetwork AB	•

	• 
	2001:470:dff1:1:10::1
Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	•

	•  dns2.sebbe.eu
	185.86.106.232
Malmo/Skåne/Sweden (SE) - Obenetwork Network	•

	• 
	2001:470:dff1:1:10::2
Gothenburg/Västra Götaland/Sweden (SE) - Hurricane Electric LLC	•

but checking your urls the 185 ip address doesn't answer.

Domainname	Http-Status	Sec.	G
• http://sebbe.eu/
185.86.106.232	-2	1.103	V
ConnectFailure - Unable to connect to the remote server No connection could be made because the target machine actively refused it 185.86.106.232:80

• http://sebbe.eu/
193.187.91.106	200	0.087	H

• http://sebbe.eu/
2001:470:dff1:1:10::1	200	0.077	H

• http://sebbe.eu/
2001:470:dff1:1:10::2	200	0.080	H

Same with /.well-known/acme-challenge.

So if a client tries to connect your 185.* as name server, there may be the same timeout.

But it's random because your ipv6 works.

What's with your 185.* address? Routing problem?

cpu · July 23, 2019, 1:19pm

I'm afraid our recursive resolver configuration hasn't been changed recently.

mnordhoff · July 23, 2019, 1:50pm

This is just random unresearched guessing, but have you made any changes to the DNSSEC configuration recently? Changed the DNSKEY or DS records? Rolled algorithms?

Is it possible some of the TLD nameservers might be returning out-of-data records?

If I remember correctly, it’s not a problem to use NSEC with algorithm 7 – it adds support for NSEC3 but doesn’t require that you use it. IIRC. [Edit: This was correct.]

Are you running the most recent version of NSD? Any chance it’s fixed relevant bugs recently?

That Unbound error message looks unusual, but I’m not sure if it really is unusual, or under when and why it happens.

validator/validator.c:

        /* If the key entry isBad or isNull, then we can move on to the next
         * state. */
        if(!key_entry_isgood(vq->key_entry)) {
                if(key_entry_isbad(vq->key_entry)) {
                        if(vq->restart_count < VAL_MAX_RESTART_COUNT) {
                                val_blacklist(&vq->chain_blacklist, 
                                        qstate->region, origin, 1);
                                qstate->errinf = NULL;
                                vq->restart_count++;
                                vq->key_entry = old;
                                return;
                        }
                        verbose(VERB_DETAIL, "Did not match a DS to a DNSKEY, "
                                "thus bogus.");
                        errinf(qstate, reason);
                        errinf_origin(qstate, origin);
                        errinf_dname(qstate, "for key", qinfo->qname);
                }

I think a comment elsewhere in Unbound said it would log why something was “bad” at a higher log level.

Edit:

Is it just me, or does it look like unboundtest is resolving sebbe.eu./DNSKEY more than once, apparently discarding the result without logging why, and then trying again? That’s weird, right?

If I resolve my own domain, it only does “resolving example.com. DNSKEY IN” once.

Could the NAT stuff be corrupting DNS responses?

Could the network or DNS server be doing rate limiting, and sometimes replacing their responses with something useless?

(RRL?)

sebastiannielsen · July 24, 2019, 1:20am

aha. I checked and I had forgot to open HTTP (port 80) on the 185. IP. But should it have any effect? Because I use dns-01 challenge. Could the failure to connect on port 80 have any issues in validation process?

I have throughtly checked the nameserver ports and theyre fine, no problems there. I use http://zonemaster.iis.se and they report my nameservers as fine.

Have now fixed the 80 and 443 ports so now it should be consistent.

mnordhoff · July 24, 2019, 1:40am

No -- if you're using DNS validation, Let's Encrypt doesn't try to connect to your web server. You don't have to be running one. You don't even have to have A or AAAA records.

DNSViz also said it was fine, but there could be some intermittent or regional or unusual problem that has gone undetected.

sebastiannielsen · July 24, 2019, 1:48am

wait, I remember that I enabled ratelimit on NSD due to some security scanner saying i was vulnerable to DDoS.

Could these pose a problem?

rrl-ratelimit: 25
rrl-slip: 4

mnordhoff · July 24, 2019, 2:11am

Plausibly.

It’s plausible that Let’s Encrypt could send you 25 queries in 1 second – though the resolver would probably send some using IPv4 and some using IPv6, so NSD would count it as at least two separate sources.

Dropping 75% of queries could plausibly lead to Unbound concluding the servers are all down and giving up.

sebastiannielsen · July 24, 2019, 2:21am

I might misunderstand the consequences of rrl-slip?
What I have understand, is that when rrl-ratelimit IS hit, nsd will of course drop queries exceeding the ratelimit. When 3 such packets are dropped, rrl-slip=4 means the fourth packet will then contain a “slipped” response (that ask the recursive resolver to retry with TCP, basically a packet with a “fake” oversize response that tells the recursive resolver the packet would be too large to send over UDP which tickles the recursive resolver to use TCP instead - even for packets that are good size).
If the recursive resolver then continues sending packets via UDP, the next 3 packets after the slipped response will be dropped and the eight packet (from when rrl-ratelimit was hit) will contain a new slipped response.
rrl-slip=1 means then that every packet above ratelimit will be a “slipped” response - which might be bad as it means the DNS server can still be used as a DDoS tool by using the “slipped” packets itself as DoS, and rrl-slip=0 disables the slip mechanism alltogether = 100% packets are dropped aftter hitting the ratelimit, but then legit resolver will never get a remediation when ratelimit is hit.

Have now upped the ratelimit to 50 QPS, should be enough.

Topic		Replies	Views
Certificate renewal failed Help	6	132	January 1, 2025
Let’s Encrypt SSL failed to renew certificate Help	19	814	September 13, 2022
Renewal fails on existing letsencrypt certificate Help	18	964	November 30, 2020
Having difficulty renewing my certificate Help	10	1303	May 18, 2020
Certificate renewal failing - IPv6? Help	16	320	May 18, 2024

My Letsencrypt certificate fails to renew randomly

Related topics