Some Domains Suddenly Failing Renewal or New Cert Issue

We have WHM/cpanel server with hundreds of domains on it. As of Friday March 6th, some domains are unable to complete cert renewals. Some new sites are also not able to get issued a cert at all. For example:

DNSLookupFailed
A fatal issue occurred during the DNS lookup process for tylermobilemechanics.com/CAA.
DNS response for tylermobilemechanics.com/CAA did not have an acceptable response code: SERVFAIL 

We have spent several hours trying to debug this with cPanel Support to no avail. We first thought it was a bock from CSF firewall but after disabling it we learned that is not the issue. There are no other external firewalls or blocking methods. DNS look-ups are working fine when checked with the MXToolbox and other DNS tools.

It super weird. If I check 10 domains on the same server with Let's Debug 50% work, and 50% don't work. The ones that don't work are coming up with a DNS lookup issue. I've double checked the sites that don't work on MxToolBox and there is no DNS issue. Besides, if its the same server, you should see the same DNS lookup issues for all sites:

For example these don't work:

But these work fine:

Why would some sites, on the same server, pass the DNS check but some sites not?

Any help would be greatly appreciated as I have wasted almost the whole day trying to resolve this new issue.

Joe G

Is that the error you get from the Let's Encrypt server? If not, please post what you see from that. I don't know WHM very well but it should have a log showing the cert request error messages.

The Let's Debug test is failing its own DNS queries. That may or may not reflect what is happening with Let's Encrypt itself.

Other tools we use: unboundtest.com, dnsviz.net, and ednscomp are all showing valid results for domains you show failing. Unboundtest is intended to mimic LE but it doesn't always.

When we see sporadic failures for a large system the first thought is that there is some kind of rate limit imposed by your DNS servers. But, if the same domains consistently fail it is not likely that.

I am sure other volunteers or staff will have better insights :slight_smile:

5 Likes

@joegold100 I spoke w/the Let's Debug developer (@Nummer378) and he has quite a lot to share. He was going offline so asked me to share it. Mind you, these are results from Let's Debug so may not reflect exactly what Let's Encrypt sees. But, I think this is likely pointing to the underlying problem. Note especially that he reproduced an error using @1.1.1.1 too.

Let's Debug and Let's Encrypt both use unbound for resolving. However, their configurations may well be slightly different. Further, LE validates from several locations around the world mostly simultaneously which can cause issues not seen by a system doing single queries.

This post focuses on the failures. The next post focuses on why some domains fail and not others.

=====================================================

(... some earlier log debugging omitted for brevity ...)

Why does it not have any nameservers? The logs clearly show it trying to query several nameservers:

Mar 09 23:16:33 letsdebug letsdebug-server[1448376]: [1773098193] libunbound[1448376:0] info: response for ns2.ecservers.net. AAAA IN
Mar 09 23:16:33 letsdebug letsdebug-server[1448376]: [1773098193] libunbound[1448376:0] info: reply from <net.> 192.26.92.30#53
Mar 09 23:16:33 letsdebug letsdebug-server[1448376]: [1773098193] libunbound[1448376:0] info: query response was REFERRAL
Mar 09 23:16:33 letsdebug letsdebug-server[1448376]: [1773098193] libunbound[1448376:0] info: processQueryTargets: ns2.ecservers.net. AAAA IN

Well, no real idea but there's this one suspicious line in the logs:

info: skipping target due to dependency cycle (harden-glue: no may fix some of the cycles) ns2.ecservers.net. A IN

So something seems to go wrong while resolving the nameservers themselves. And then I suddenly could reproduce even on cloudflare's 1.1.1.1:

dig NS ecservers.net @1.1.1.1

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> NS ecservers.net @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 11697
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 23 (Network Error): (54.67.108.165:53 returned REFUSED for ecservers.net NS)
;; QUESTION SECTION:
;ecservers.net.                 IN      NS

;; Query time: 344 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Mon Mar 09 23:27:54 UTC 2026
;; MSG SIZE  rcvd: 102
dig NS ns6.ecservers.net +dnssec @1.1.1.1

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> NS ns6.ecservers.net +dnssec @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50460
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; EDE: 23 (Network Error): (54.67.108.165:53 returned REFUSED for ecservers.net DNSKEY)
;; QUESTION SECTION:
;ns6.ecservers.net.             IN      NS

;; ANSWER SECTION:
ns6.ecservers.net.      86400   IN      NS      ns1.ecservers.net.
ns6.ecservers.net.      86400   IN      NS      ns2.ecservers.net.

;; Query time: 340 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Mon Mar 09 23:28:54 UTC 2026
;; MSG SIZE  rcvd: 146

After some further digging, I found a severly broken configuration:

ecservers.net is hosted on ns{1}.ecservers.net according to the net nameservers. That's a cycle (which unbound doesn't like), as one needs to know the nameserver for ecservers.net to get the nameservers for ns1/2.ecservers.net. The .net nameservers return a "glue record" with the A/AAAA IPs to solve this problem - many standard nameservers will just accept the glue and move on, if everything else is in order. But from experience, unbound likes to be picky and tries to validate the glues, which fails since the ns1/2.ecservers.net nameservers seem to drop all DNS responses for "ecservers.net":

dig A ecservers.net @54.67.108.165
;; communications error to 54.67.108.165#53: timed out
;; communications error to 54.67.108.165#53: timed out
;; communications error to 54.67.108.165#53: timed out

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> A ecservers.net @54.67.108.165
;; global options: +cmd
;; no servers could be reached

(At least from Let's Debugs perspective)

So the nameservers do not reply for their main domain, but according to delegation they should. This will cause issues when unbound tries to validate the referral and also grabbing DNSSEC keys for ecservers.net. Therefore Let's Debug won't accept the nameserver IPs and thus can't query the final domain name, as it has no validated nameserver to ask.

5 Likes

Below is info directly from @Nummer378 with just mild formatting by me :slight_smile:

==================================================================

Next question: Why do some domains work while others don't?

Well, there seems to be a difference on how the domains are setup at the .net TLD nameservers. The working domains have "shortcut" glue records in place such that one can sidestep resolving *.ecservers.net altogether:

dig NS ascenciodesigns.net @a.gtld-servers.net

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> NS ascenciodesigns.net @a.gtld-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30868
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ascenciodesigns.net.           IN      NS

;; AUTHORITY SECTION:
ascenciodesigns.net.    172800  IN      NS      ns5.ecservers.net.
ascenciodesigns.net.    172800  IN      NS      ns6.ecservers.net.

;; ADDITIONAL SECTION:
ns5.ecservers.net.      172800  IN      A       52.52.90.18
ns6.ecservers.net.      172800  IN      A       52.52.90.18

;; Query time: 32 msec
;; SERVER: 2001:503:a83e::2:30#53(a.gtld-servers.net) (UDP)
;; WHEN: Mon Mar 09 23:45:04 UTC 2026
;; MSG SIZE  rcvd: 126

But the broken ones do not have good glue records in place apparently:

dig NS a1commercialroofingnewyork.com @a.gtld-servers.net

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> NS a1commercialroofingnewyork.com @a.gtld-servers.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5435
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;a1commercialroofingnewyork.com.        IN      NS

;; AUTHORITY SECTION:
a1commercialroofingnewyork.com. 172800 IN NS    ns5.ecservers.net.
a1commercialroofingnewyork.com. 172800 IN NS    ns6.ecservers.net.

;; Query time: 32 msec
;; SERVER: 2001:503:a83e::2:30#53(a.gtld-servers.net) (UDP)
;; WHEN: Mon Mar 09 23:45:29 UTC 2026
;; MSG SIZE  rcvd: 108

Therefore the resolver needs to resolve ecservers.net first which does have a glue in place. But since the responsible nameserver does not reply properly unbound doesn't like this glue record.

6 Likes

I find this odd:

nslookup -q=ns ecservers.net a.gtld-servers.net
ecservers.net   nameserver = ns1.ecservers.net
ecservers.net   nameserver = ns2.ecservers.net
ns1.ecservers.net       internet address = 54.67.108.165
ns2.ecservers.net       internet address = 54.67.108.165
nslookup -q=ns ecservers.net 54.67.108.165
*** UnKnown can't find ecservers.net: Query refused
3 Likes

It appears that ns1.ecservers.net, ns2.ecservers.net, ns3.ecservers.net, ns4.ecservers.net, ns5.ecservers.net and ns6.ecservers.net have all been setup as separate zones while ecservers.net has not been setup as a zone.

This means that while ns1.ecservers.net and ns2.ecservers.net (which resolve to the same IP address) are responsive to queries for ns*.ecservers.net, they refuse queries to ecservers.net.

Do you manage ecservers.net?

2 Likes

Thank you to all of you who responded on this issue that I inherited from another team who setup these name servers and DNS years ago. It is not quite clear to me what the solution is from your analysis.

Here is what I can see:

The domain ecservers.net has DNS setup on GoDaddy. The domain itself uses name servers ns1.ecservers.net and ns2.ecservers.net. Those name servers point to one of our other WHM servers 54.67.108.165. There is no DNS record setup on [54.67.108.165] for the root domain ecservers.net nor is their one on 52.52.90.18 which is the server that starting having this issue with Let's Encrypt as of Friday. The other server 54.67.108.165 does not seem to be having any issue at this time with Let's Encrypt DNS verification.

In GoDaddy, the domain ecservers.net has multiple host names setup which point to name servers of 3 different WHM servers. Here are the important servers for this issue:

NS2 54.67.108.165
NS1 54.67.108.165
NS5 52.52.90.18 (server with issue)
NS6 52.52.90.18 (server with issue)

If I'm understanding your solution, I need to add a DNS Zone for ecservers.net on 54.67.108.165 so that ecservers.net resolves. Is that correct?

Again, thank you all so much for your help! I truly appreciate it!

Joe

Note: I created a new account for the domain ecservers.net under 54.67.108.165 and now that domain resolves. However I previously indicated that the other server had no issues verifying Let's Encrypt certs but it is now popping the same error as 52.52.90.18:

1:03:38 PM ERROR TLS Status: Defective
Certificate expiry: 3/10/27, 8:02 PM UTC (365 days from now)
ERROR Defect: OPENSSL_VERIFY: The certificate chain failed OpenSSL’s verification (0:18:DEPTH_ZERO_SELF_SIGNED_CERT).
1:04:16 PM WARN “Let’s Encrypt™” HTTP DCV error (ecservers.net): Timeout after 30 seconds!
1:04:50 PM ERROR “Let’s Encrypt™” DNS DCV error (*.ecservers.net): 400 urn:ietf:params:acme:error:dns (There was a problem with a DNS query) (During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.ecservers.net)
ERROR “Let’s Encrypt™” DNS DCV error (ecservers.net): 400 urn:ietf:params:acme:error:dns (There was a problem with a DNS query) (During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.ecservers.net)

So now I have 2 servers not able to issue or renew some certs.

I really don't know what to do next....

The issue might be caused by missing name server records for ns6.ecservers.net in the ecservers.net zone.

To fix the issue, add these records to ecservers.net.

ns1.ecservers.net. NS ns1.ecservers.net.
ns1.ecservers.net. NS ns2.ecservers.net.
ns2.ecservers.net. NS ns1.ecservers.net.
ns2.ecservers.net. NS ns2.ecservers.net.
ns3.ecservers.net. NS ns1.ecservers.net.
ns3.ecservers.net. NS ns2.ecservers.net.
ns4.ecservers.net. NS ns1.ecservers.net.
ns4.ecservers.net. NS ns2.ecservers.net.
ns5.ecservers.net. NS ns1.ecservers.net.
ns5.ecservers.net. NS ns2.ecservers.net.
ns6.ecservers.net. NS ns1.ecservers.net.
ns6.ecservers.net. NS ns2.ecservers.net.

While this would work, I'd recommend

  1. Including the contents of the ns*.ecservers.net zones in the zonefile for ecservers.net and removing the subzones.
  2. Consider moving to a primary/secondary setup where ns1.ecservers.net, ns3.ecservers.net and ns5.ecservers.net all serve the same content, this would add redundancy for when one of your servers goes offline.

Edit: I've fully qualified the domain names (put a dot after them).

2 Likes

I resolved this issue. It turns out that it was the firewall of the sever hosting ecservers.net which is the name sever's root domain (ie; ns1.ecservers,net, ns3, ns4, etc, etc). After several hours of removing country code blocks, one by one, waiting 10 mins after each for the firewall to restart, I found that Let's Encrypt is using Finland (FI) as the polling country of origin. This country was the source of a brute force attack on our severs last week and I had banned it. Once I removed the block, now the certs are able to verify.

This brings up a serious issue. Let's Encrypt needs to publish the IP addresses of the servers that it is using to poll for DNS verification for certs so that we can whitelist them in our firewalls.

We freakin' wasted about 12 hours between 4 people internally, not to mention all of the time you guys who helped analyze and respond too. In addition, on Cpanel Support there were 7 different analysts all the way up to level V which tried to solve this issue but could not! So that is a lot of people and a lot of time wasted.

It all could have been avoided if a list of server IP's to whitelist could have been posted on Let's Encrypt's website....

  • Joe

Let's Encrypt does not publish their IP addresses, and for the reason. They can change anytime.

3 Likes

So as it turns out, both of our servers still can't issue new certs through Let's Encrypt. Today we added 2 new accounts and both easily clear the validation:

https://letsdebug.net/24hourfasttowing.com/2761727

https://letsdebug.net/excavationservicesvirginia.com/2761726

However, with or without the firewalls on, we can't issue a cert (although it seems to say it was a success) no cert is issued:

1:59:44 PM ERROR “Let’s Encrypt™” DNS DCV error (*.excavationservicesvirginia.com): 403 urn:ietf:params:acme:error:unauthorized (The client lacks sufficient authorization) (Incorrect TXT record "nyP4xuMmEcuSNx4oJJf44Ej1gWVsYEqpth1CjomzGpo" found at _acme-challenge.excavationservicesvirginia.com)

1:59:45 PM SUCCESS “Let’s Encrypt™” DNS DCV OK: excavationservicesvirginia.com

Retrying DCV without the failed wildcard domain …

1:59:46 PM SUCCESS Let’s Encrypt DCV for “excavationservicesvirginia.com” is valid until 4/11/26, 8:59 PM UTC.

SUCCESS “Let’s Encrypt™” DCV OK: excavationservicesvirginia.com

Any ideas now on what the problem is?

Thank you.

 1:59:53 PM ERROR “Let’s Encrypt™” DNS DCV error (www.excavationservicesvirginia.com): 400 urn:ietf:params:acme:error:dns (There was a problem with a DNS query) (During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.www.excavationservicesvirginia.com - the domain's nameservers may be malfunctioning)

Hi Mike,

As the team leader I hope you can relay my response back to the developers. We corrected the DNS issue that your developer pointed out and removed a country code block on Finland which was ultimately causing the cert not to verify. Now that all domains clear the DNS verification, for example:

https://letsdebug.net/24hourfasttowing.com/2761727

We still can't get a new cert to issue on the server.

I've attached the AutoSSL log as a txt file:
Log for the AutoSSL.txt (7.7 KB)

If there is any additional insight your developers can provide on what is causing this issue it would be greatly appreciated.

Thank you,

Joe

Forget it. The certs finally issued after 4 1/2 hours...

Let me address at least this much ...

I am not a "team leader". The "leader" next to my name is a standing in this Discourse forum. It was granted to me by LE for my long-term contributions on this community and recommendations from other well-regarded contributors. That allows me some extra admin duties within this forum.

This forum is community based. Unless you see "LE Staff" next to their id they are volunteers, like me, who offer their time and experience for free.

As to Finland and your firewall ... Let's Encrypt does not have a validation center in Finland last I checked. They do have one in Sweden (or did) but LE does not publish its infrastructure details nor does it recommend people design around any particular infra. LE has 5 validation centers of which 4 must succeed to grant a certificate. LE may change their location or number without notice.

Let's Debug is a helpful tool but is just that. A tool. As I have said a couple times already it may or may not produce the same result as Let's Encrypt itself. I think you should avoid over-fitting any fixes solely to get a proper result in Let's Debug.

All that said, the info from @Nummer378 indicated likely faulty glue records. And, info from @MaxHearnden suggested a fix to get your DNS to a more reliable state.

Both of these people have far more experience about DNS than I do and their suggestions look like good ones to follow to me.

That your cert request worked after 4.5 hours indicates there is still fragility in your setup.

5 Likes

Hi @joegold100,

maintainer of Let's Debug here. As @MikeMcQ pointed out, we are all volunteers here and not affiliated with Let's Encrypt. Let's Debug is also not run or sponsored by Let's Encrypt (as noted at the bottom of the website). If Let's Debug can successfully connect, then that's a good sign but not a guarantee that Let's Encrypt can as well. Let's Debug is hosted in Finland (as one can easily see by checking the IP to which letsdebug.net resolves - the outbound IP is equal to the inbound IP). Let's Encrypt uses multiple vantage points in several places in the world.

Let's Debug was previously unable to resolve the ecservers.net domain as the nameserver indicated by the TLD were not answering queries for that domain (Let's Debug just got timeouts when querying 54.67.108.165, Cloudflare reported REFUSED errors) . This resulted in resolution failures for all domains that required resolving this domain. This prevented unbound (the resolver used by both Let's Debug and Let's Encrypt) from validating the glue records provided by the .net TLD servers. This is the glue record I'm referring to:

dig NS ecservers.net @g.gtld-servers.net
[...]
;; AUTHORITY SECTION:
ecservers.net.          172800  IN      NS      ns1.ecservers.net.
ecservers.net.          172800  IN      NS      ns2.ecservers.net.

;; ADDITIONAL SECTION:
ns1.ecservers.net.      172800  IN      A       54.67.108.165
ns2.ecservers.net.      172800  IN      A       54.67.108.165

There's a "glue record" here (the additional section) to resolve the cycle where resolving ns{1,2}.ecservers.net requires resolving the zone for ecservers.net (which in turn is normally required to resolve *.ecservers.net, thus forming a cycle). Some resolvers will blindly trust this glue, but unbound likes to confirm such glues* by querying the SOA for ecservers.net - but this query wasn't being answered by your nameservers, causing the lookup failures.

Some of your domains had direct "shortcut glue" records on their TLDs that resolved to a different nameserver/IP pair:

dig NS ascenciodesigns.net @a.gtld-servers.net
[...]
;; AUTHORITY SECTION:
ascenciodesigns.net.    172800  IN      NS      ns5.ecservers.net.
ascenciodesigns.net.    172800  IN      NS      ns6.ecservers.net.

;; ADDITIONAL SECTION:
ns5.ecservers.net.      172800  IN      A       52.52.90.18
ns6.ecservers.net.      172800  IN      A       52.52.90.18

These nameservers ns{5,6}.ecservers.ne seem to use a different IP which was never blocked - Let's Debug could query that successfully. Hence all domains where the TLD provided this glue could be resolved succesfully by both Let's Encrypt and Let's Debug.

But some of your (newer?) domains had no such shortcut glue records on the TLDs - this indicates that your nameserver was inaccessible to the TLD's nameservers when the nameservers were last changed on the TLD. Cloudflare's 1.1.1.1 resolver was also unable to resolve ecservers.net when I last checked 3 days ago, so it looked like a widespread issue, not an isolated block. Without the shortcut glue the resolver needs to query the NS for ecservers.net as above, which results in the resolver querying ns{1,2}.ecservers.net (both on 54.67.108.165) - that failed as no reply was received.

Since then you have made changes to the setup (IP blocking?) it seems, and now Let's Debug can succesfully resolve that domain, as can 1.1.1.1 now. Whatever changes you have made seems to have helped. If Let's Encrypt still has trouble that may indicate that you're still blocking DNS responses where you shouldn't. I would advise hosting DNS servers in a redundant fashion on otherwise isolated systems to avoid damage in case of compromise, but refrain from blocking IPs on DNS services - this is practice that will cause lots of issues as the internet is a global thing as you've probably noticed by now.


*Technically this is an optional feature of unbound which can be disabled, but it's enabled on Let's Encrypts servers and thus on Let's Debug as well, as LD attempts to mimick the setup of Let's Encrypt)

7 Likes

Definitely, not an A.I. reply. We have been going nuts for the last 7 weeks with new cert and cert renewal issues. We were first told it was DNS configuration issue on the server, which it turned out it was not. We later found that the issue was possibly because we were blocking so many international countries from accessing our U.S. server. We went through country by country unblocking and testing until we found a country that was causing validation to fail. However, as you keep changing IP addresses/country of origins we are still having problems with new certs and renewals. Over the last 2 weeks we had over a hundred domains fail renewal. I disabled country code blocking on the firewall, tried running autoSSL but the certs were still not validating. That's when we started digging again and found that too many connections was likely getting a temporary ban on on CSF. Without knowing the IP addresses of the servers you are using we can't white list them.

We never had this problem until the aroundFebruary 15th which on Feb 18 there was blog post on your site stating the validation process has changed.

We are just going nuts on this.

@joegold100 I moved the new info about this situation from the thread about Validation server IP addresses to here. We prefer the same problem to remain in one thread so we can see the history easier.

I was replying to a post that is now hidden on the other thread you created. I plan to delete mine once it disappears completely.

The quote I used from you was to highlight that the now hidden post did not address your situation.

Sorry for the confusion

There is a blog post from Feb18 regarding a new type of challenge: dns-persist-01

But, that would be the 4th type of challenge after http-01, tls-alpn-01, and dns-01. It does not replace any of them. And, it is not even yet in production only in the Staging system.

3 Likes

A broader solution is to use an application aware firewall, one that can inspect http.

That way you can explicitly allow all incoming http requests to any domain where the GET matches /.well-known/acme-challenge/* and all HTTP domain validation for any CA will then work with zero IP whitelisting, but all other http conversations can be optionally blocked.

3 Likes