DNS Manual Failure Going on 4.75 hours so far

I have 8 certificates, 50 hostnames. I run my own DNS server (and have for the last 30+ years).

I don't have the option of running acme on my web-server. So I use the DNS challenge. I've been doing using Let's Encrypt for the last couple of years, and I am incredibly grateful for the Let's Encrypt service.

But wholly flapp… why should it take hours and hours, of renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing, renewing, editing, reloading, checking, failing?

I just need the last of 50 hosts validated. I've got 49 of 50 validated. Why is this so hard? Why does this happen every time? I have to go through the same hours long process of doing everything right and still failing. I am doing everything correctly. I've been at this for the last 4.75 hours.

Come on guys. This is so frustrating.

Why do I get punished for your servers having problems?

Renew, edit zone, reload server, verify acme-challenge TXT record, renew, FAIL, RINSE, REPEAT!

WOOHOO! FINALLY GOT THE LAST HOST VERIFIED 5 HOURS LATER!

Come on. There has to be a better way.

If you're needing to manually edit the TXT records every time you renew, then no, you are not "doing everything correctly."

I'm sure there is. Do you want help, or do you just want to rant? If you want help, you've given us literally no information to work with. Really, all of the questions you were asked when you opened this thread would be helpful, but at a minimum, a few of the affected domains, and the error messages you're getting, would be a start.

6 Likes

Couple of things:

  • Do not stuff many names in one cert, you will experience less reliability this way and one broken domain will fail your entire renewal. Sometimes people do this because their mail server can't load more than one cert, fix the mailserver and use multiple certs.

  • Only use automated DNS validation, do not use manual DNS edits. Allow enough time before checking answers, which is generally at least 1 minute and up to 15 minutes on some DNS provides.

  • As a workaround, get a cert via the same account for the one domain that is failing. Let's Encrypt use cached validations, so your overall order will then pass.

7 Likes

Literally, the last hostname to be validated was for a certificate with just the domain name and the subdomain "www". Just two entries total. It took 15 attempts to get it validated.

Exactly how do I not need to edit the dns record when in order to get validated, it gives a new token? Every time validation failed, it prompted me to put a new token in the DNS record.

Here is a sample of the errors:

[Wed Sep 24 02:26:08 EDT 2025] response='{"identifier":{"type":"dns","value":"www.example.com"},"status":"invalid","expires":"2025-09-26T07:16:36Z","challenges":[{"type":"dns-01","url":"https://acme-v02.api.letsencrypt.org/acme/chall/...","status":"invalid","validated":"2025-09-24T06:26:06Z","error":{"type":"urn:ietf:params:acme:error:dns","detail":"DNS problem: SERVFAIL looking up TXT for _acme-challenge.www.example.com - the domain's nameservers may be malfunctioning","status": 400},"token":"ABCDEFG"}]}'

[Wed Sep 24 02:27:11 EDT 2025] response='{"identifier":{"type":"dns","value":"www.example.com"},"status":"invalid","expires":"2025-10-01T06:26:17Z","challenges":[{"type":"dns-01","url":"https://acme-v02.api.letsencrypt.org/acme/chall/...","status":"invalid","validated":"2025-09-24T06:27:07Z","error":{"type":"urn:ietf:params:acme:error:dns","detail":"During secondary validation: While processing CAA for www.example.com: DNS problem: SERVFAIL looking up CAA for example.com - the domain's nameservers may be malfunctioning"},"token":"ABCDEFG","validationRecord":[{"hostname":"www.example.com","addressUsed":""}]}]}'

[Wed Sep 24 02:27:12 EDT 2025] Please refer to libcurl - Error Codes for error code: 2

[Wed Sep 24 02:27:18 EDT 2025] response='{"status":"pending","expires":"2025-10-01T06:27:18Z","identifiers":[{"type":"dns","value":"example.com"},{"type":"dns","value":"www.example.com"}],"authorizations":["https://acme-v02.api.letsencrypt.org/acme/authz/...","https://acme-v02.api.letsencrypt.org/acme/authz/..."],"finalize":"https://acme-v02.api.letsencrypt.org/acme/finalize/..."}'

Those errors (sanitized here) were repeated multiple times. Literally thousands of lines of errors.

% cat /tmp/acme/example.com_cert/acme_issuecert.log | grep 'Wed Sep 24' | wc -l
10491

That's just for one of the eight certs, the one with just the domain name and "www".

No, domain nameserver was not malfunctioning, as I did a:

$ dig _acme-challenge.www.example.com TXT +short @dns.nameserver.net +norec

from a remote (outside my network) terminal, for every challenge to ensure that the DNS record was updated before I attempted to renew after issuing. None of the digs failed to get a response, and the updated tokens appeared in the DNS record every time. No, there wasn't a spike in usage on the DNS server at 2 AM (I double checked) which would have caused an issue. The DNS server has never experienced a spike of usage that it couldn't comfortably handle. And I again, none of the dozens of digs I performed from outside my network failed.

example.com. 21600 IN CAA 0 issue "letsencrypt.org"

It makes no sense, that like half of the certs were issued the first time. Why would it take one of the simple certs (two entries) 5 hours to validate?

Both during primary and secondary validation, for txt and caa? Why is it responding servfail?

Check here: https://unboundtest.com/

5 Likes

And also here: https://dnsviz.net

Your DNS looks wrongly configured

3 Likes

From unboundtest.com (sanitized):

;; ANSWER SECTION:
example.com. 0 IN CAA 0 issue "letsencrypt.org"

;; ANSWER SECTION:
example.com. 0 IN A 198.51.1.100

"Your DNS looks wrongly configured"

Friend, that's a bold statement from someone who has never seen my DNS. I've been doing this for more than three decades.

Well, basically every time we've seen "the domain's nameservers may be malfunctioning" errors, and intermittent problems that sometimes are "During secondary validation" and sometimes not, it's because, in fact, the domain's nameservers are malfunctioning. Maybe a delegation that's inconsistent, maybe a glue record that was supposed to get updated but didn't, maybe giving bogus DNSSEC responses, sometimes not sending SOA records when it's supposed to, or maybe not responding over TCP. Lots of possibilities. And sometimes it's precisely those servers that have been running for three decades that aren't handling current DNS expectations correctly. If you actually want help understanding what's going wrong, rather than to just rant, please provide some actual domain names, and describe what DNS software you're using and how it's configured. I assure you, Let's Encrypt's DNS resolving systems aren't having problems with most people's DNS servers (they have pretty good monitoring around that), so if you're consistently having problems, it's a good bet that it's something on your end.

6 Likes

@petercooperjr responded nicely.

When seeing SERVFAIL responses it is almost certainly some kind of DNS configuration issue. Please provide an actual domain name or run https://dnsviz.net test yourself and see.

4 Likes

% named -V
BIND 9.18.33-1~deb12u2-Debian

I run tests every time I make changes to any of my servers. There are no errors, no warnings, all tests passed.

Yeah, my first draft was more aggressive too. I hoped a "check the name server logs, you that have access" with that "why is it responding servfail?"

3 Likes

Great, go you. Your DNS is still reporting errors to Let's Encrypt. But since you refuse to share the information we'd need to diagnose the problem, you're going to be pretty much on your own.

6 Likes

Also a possibility that it's not related to the authoritative servers for the domain directly, but a misconfiguration on a higher-level. We've seen everything from a TLD that specifically blocked Let's Encrypt's traffic to their DNS server, to a TLD that set a CAA record, to a TLD that didn't support TCP. But, most commonly these intermittent issues are from something like inconsistent delegations where some of the servers are not answering authoritatively, and so if one of those servers gets picked then it won't work (or might time out before retrying enough times to find a server that does work).

But yeah, your options are either troubleshooting it yourself, or having other people help you troubleshoot it. Or not bothering and just hoping it continues to occasionally work by accident.

5 Likes

I am not at liberty to share hosts and domain names.

You asked for errors, I replied with errors.

You asked me to run tests. I ran tests and reported the (non) results.

You asked me what DNS server I was running and I posted that.

My server is the authoritative server, so I've got the glue records in the zone files.

I have no control over TLD and Root servers.

Thank you for trying to help.

Ok, but we still have no idea what causes the SERVFAIL -- and our experience tells us that's usually the nameserver making uncommon assumptions or having peculiar configs.

A quick workaround to reduce the impact (and maybe eliminate the problem altogether, if it's caused by some kind of ratelimiting mechanism on the nameserver side) would be to use more certificates with fewer names each, which was one of the first things we told you:

Also:

Nameservers have APIs and acme clients have hooks, you should automate that.

5 Likes

Did you run the test at https://dnsviz.net
Some of its messages are difficult to interpret. Let us know if you have questions on any Errors or Warnings

This EDNS test site can be helpful too but only once resolving any other issues: EDNS Compliance Tester
The output from this can be especially difficult to interpret. Other tools (like dnsviz) are better for the first round of debugging but this is sometimes helpful

7 Likes

Anyhoo, the main issue seems like you are using Manual DNS domain validation, whereby the client prompts you to update a TXT record and you do it manually, you should not do that because it's very error prone for humans to work with. [You'll need to fix the SERVFAIL stuff, but that's just part of having working DNS and might just be a restart needed]

Instead you should use automated DNS updates via an API, e.g. rfc2136 (nsupdate) if you are hosting your own DNS - there are ACME client plugins to help do that and they make all the changes for you.

When getting a cert for apexdomain.com and *.apexdomain.com they will both need to update the same _acme-challenge record (that's just how it is in the ACME protocol), which either requires having 2 TXT record values or completing challenges one at a time.

5 Likes

Automated DNS validation typically takes about 15 seconds if done correctly.
Aside from the issues of using manual verification, it sounds like your nameserver(s) aren't updating correctly globally with the new challenge value.

4 Likes

Here is the pertinent section of my bind config.

// Options

match-clients { any; };
recursion no;
allow-recursion { none; };
notify yes;
    rate-limit { 
	slip 2;				// Every other response truncated
	window 15;			// Seconds to bucket
            responses-per-second 5;		// # of good responses per prefix-length/sec
	referrals-per-second 5;		// referral responses
	nodata-per-second 5;		// nodata responses
	nxdomains-per-second 5;		// nxdomain responses
	errors-per-second 5;		// error responses
	all-per-second 20;		// When we drop all
	log-only no;			// Debugging mode
	qps-scale 250;			// x / query rate * per-second
					// = new drop limit
	exempt-clients {127.0.0.1;};
	ipv4-prefix-length 24;		// Define the IPv4 block size
	ipv6-prefix-length 56;		// Define the IPv6 block size
	max-table-size 20000;		// 40 bytes * this number = max memory
	min-table-size 500;		// pre-allocate to speed startup
};

I verified that they were updated correctly using an external terminal to do an _acme-challenge TXT record pull.