DNS problem: SERVFAIL looking up A

My domains are:


I ran this command:
I’ve tried multiple times over the past 48 hours to issue new certs via the LetsEncrypt Plesk extension

It produced this output:
Error: Could not issue a Let’s Encrypt SSL/TLS certificate for pheromoneadvantage.com. Authorization for the domain failed.
Invalid response from https://acme-v02.api.letsencrypt.org/acme/authz-v3/1712756888.
Details:
Type: urn:ietf:params:acme:error:dns
Status: 400
Detail: DNS problem: SERVFAIL looking up A for pheromoneadvantage.com


Invalid response from https://acme-v02.api.letsencrypt.org/acme/authz-v3/1706230720.
Details:
Type: urn:ietf:params:acme:error:dns
Status: 400
Detail: DNS problem: SERVFAIL looking up A for dramend.com

My web server is (include version):
VPS / CPU – QEMU Virtual CPU version 1.5.3 (2 core(s))
Running Apache / Nginx

The operating system my web server runs on is (include version):
CentOS Linux 7.7.1908 (Core)

My hosting provider, if applicable, is:
White Label IT Solutions

I can login to a root shell on my machine (yes or no, or I don’t know):
Yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel):
Plesk Obsidian v18.0.21_build1800191128.17 os_CentOS 7

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you’re using Certbot):
N/A ??

Just migrated 2 days ago to new hosting account.

Migrated from Plesk Onyx 17 to Plesk Obsidian 18.

Near as I can tell, the DNS fully propagated worlwide within an hour and a half.

That was over 48 hours ago.

Every tool I check with reports the new IP and nameservers, including whois records:
New IP: 199.38.245.235
Both sites are set up with “custom” nameservers NS1 and NS2.

I’ve triple checked the glue records are set up correctly at the registrar, GoDaddy.

Both sites are working properly on the new DNS - have been all along. There was no downtime at all during the propagation.

I manage several VPS accounts at White Label IT Solutions for my clients and have NEVER run into this issue with LetsEncrypt before. – Typically within an hour of initiating the DNS propagation I can issue their new certs without any problem.

SIDE NOTE: I have found that the .well-known/acme-challenge/ folder and files are being blocked by the Nginx domain configuration, due to the dot prefix. – It results in a 403 Forbidden Nginx error.

This part of the error output makes me think the domain configuration is the problem:
“Authorization for the domain failed…”

It seems to me LetsEncrypt cannot complete the acme challenge for domain name ownership validation.

At any rate, hosting tech support keeps telling me this is due to DNS propagation not yet completed and to wait another 24 hours . . . ??

Let me know if there is any further info you need to help me resolve this.

Thank you!!

1 Like

I think that your Plesk server (which hosts both your website and your single nameserver) has blocked the Let’s Encrypt validation servers.

Let me walk you through why I think that’s the case:

First, basic test sites unboundtest.com, letsdebug.net, check-your-website.server-daten.de don’t report any issues with DNS. 2/3 of these use a similar DNS configuration to Let’s Encrypt.

To confirm whether this a networking issue, I pointed a random domain name (xxzx.fleetssl.com) to your server’s IP address (199.38.245.235).

What this does is take your domain name and its DNS server out of the equation.

At this point, I ask Let’s Encrypt to try perform HTTP validation on my domain, which is indirectly asking Let’s Encrypt to connect to your Plesk server on port 80. The result:

- The following errors were reported by the server:

  Domain: xxzx.fleetssl.com
  Type:   connection
  Detail: Fetching
  http://xxzx.fleetssl.com/.well-known/acme-challenge/PP60k7j2tPjf016n5REss8jBT9k_l9KVUsRIQvRQo88:
  Timeout during connect (likely firewall problem)

A connection timeout to port 80.

We can confirm the error and the IP address by looking at the authz resource: https://acme-v02.api.letsencrypt.org/acme/authz-v3/1714887579

I would check your Plesk server’s firewall to see whether you have any blocked addresses in there. Try temporarily disabling it and seeing whether that makes a difference.

4 Likes

Awesome! Thank you for your reply. I’ll follow your suggestions and post back the results.

We’ve tried to drop all firewall rules but the issue with DNS still exists.

1 Like

What if you install tcpdump on your server and run this in SSH;

tcpdump -i eth0 'host 66.133.109.36 or host 34.222.229.130 or host 52.15.254.228 or host 52.28.236.88 or host 35.162.100.107 or host 54.244.166.87 or host 3.133.161.228'

and in your browser, try to issue the certificate.

What do you see?

Here is the dump file:

listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:16:21.308310 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [S], seq 145391042, win 26883, options [mss 1260,sackOK,TS val 3306994465 ecr 0,nop,wscale 7], length 0
21:16:21.308405 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [S.], seq 2543198563, ack 145391043, win 24960, options [mss 1260,sackOK,TS val 60333544 ecr 3306994465,nop,wscale 7], length 0
21:16:21.326234 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [.], ack 1, win 211, options [nop,nop,TS val 3306994483 ecr 60333544], length 0
21:16:21.326407 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [P.], seq 1:275, ack 1, win 211, options [nop,nop,TS val 3306994484 ecr 60333544], length 274: HTTP: GET /.well-known/acme-challenge/QFoK6pimuWxj53aCeMQNWuqoBREDfxbL83TgybngaBw HTTP/1.1
21:16:21.326483 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [.], ack 275, win 204, options [nop,nop,TS val 60333562 ecr 3306994484], length 0
21:16:21.326995 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [P.], seq 1:250, ack 275, win 204, options [nop,nop,TS val 60333562 ecr 3306994484], length 249: HTTP: HTTP/1.1 200 OK
21:16:21.327217 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [FP.], seq 250:337, ack 275, win 204, options [nop,nop,TS val 60333563 ecr 3306994484], length 87: HTTP
21:16:21.345142 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [.], ack 250, win 219, options [nop,nop,TS val 3306994502 ecr 60333562], length 0
21:16:21.345179 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [F.], seq 275, ack 338, win 219, options [nop,nop,TS val 3306994502 ecr 60333563], length 0
21:16:21.345219 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [.], ack 276, win 204, options [nop,nop,TS val 60333581 ecr 3306994502], length 0
21:16:21.440673 IP ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652 > localhost.localdomain.http: Flags [S], seq 1113256565, win 26883, options [mss 1260,sackOK,TS val 2732409239 ecr 0,nop,wscale 7], length 0
21:16:21.440823 IP localhost.localdomain.http > ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652: Flags [S.], seq 3898670125, ack 1113256566, win 24960, options [mss 1260,sackOK,TS val 60333676 ecr 2732409239,nop,wscale 7], length 0
21:16:21.504605 IP ec2-35-162-100-107.us-west-2.compute.amazonaws.com.47240 > localhost.localdomain.domain: 63622% [1au] Type257? pHeroMOnEAdvaNtAgE.CoM. (51)
21:16:21.504704 IP ec2-35-162-100-107.us-west-2.compute.amazonaws.com.29430 > localhost.localdomain.domain: 48420% [1au] AAAA? Ns1.pHerOMoNeadvanTagE.COM. (55)
21:16:21.504835 IP ec2-35-162-100-107.us-west-2.compute.amazonaws.com.55014 > localhost.localdomain.domain: 33387% [1au] AAAA? ns2.pHeroMONEADvaNTaGe.coM. (55)
21:16:21.505253 IP localhost.localdomain.domain > ec2-35-162-100-107.us-west-2.compute.amazonaws.com.47240: 63622*- 0/1/1 (129)
21:16:21.505404 IP localhost.localdomain.domain > ec2-35-162-100-107.us-west-2.compute.amazonaws.com.29430: 48420*- 0/1/1 (133)
21:16:21.505472 IP localhost.localdomain.domain > ec2-35-162-100-107.us-west-2.compute.amazonaws.com.55014: 33387*- 0/1/1 (133)
21:16:21.525509 IP ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652 > localhost.localdomain.http: Flags [.], ack 1, win 211, options [nop,nop,TS val 2732409324 ecr 60333676], length 0
21:16:21.525621 IP ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652 > localhost.localdomain.http: Flags [P.], seq 1:275, ack 1, win 211, options [nop,nop,TS val 2732409324 ecr 60333676], length 274: HTTP: GET /.well-known/acme-challenge/QFoK6pimuWxj53aCeMQNWuqoBREDfxbL83TgybngaBw HTTP/1.1
21:16:21.525664 IP localhost.localdomain.http > ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652: Flags [.], ack 275, win 204, options [nop,nop,TS val 60333761 ecr 2732409324], length 0
21:16:21.526009 IP localhost.localdomain.http > ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652: Flags [P.], seq 1:250, ack 275, win 204, options [nop,nop,TS val 60333761 ecr 2732409324], length 249: HTTP: HTTP/1.1 200 OK
21:16:21.526349 IP localhost.localdomain.http > ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652: Flags [FP.], seq 250:337, ack 275, win 204, options [nop,nop,TS val 60333762 ecr 2732409324], length 87: HTTP
21:16:21.610688 IP ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652 > localhost.localdomain.http: Flags [.], ack 250, win 219, options [nop,nop,TS val 2732409409 ecr 60333761], length 0
21:16:21.610929 IP ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652 > localhost.localdomain.http: Flags [F.], seq 275, ack 338, win 219, options [nop,nop,TS val 2732409409 ecr 60333762], length 0
21:16:21.610997 IP localhost.localdomain.http > ec2-52-28-236-88.eu-central-1.compute.amazonaws.com.48652: Flags [.], ack 276, win 204, options [nop,nop,TS val 60333846 ecr 2732409409], length 0
21:16:21.779321 IP ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030 > localhost.localdomain.http: Flags [S], seq 4123703496, win 26883, options [mss 1260,sackOK,TS val 3783957005 ecr 0,nop,wscale 7], length 0
21:16:21.779571 IP localhost.localdomain.http > ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030: Flags [S.], seq 576150643, ack 4123703497, win 24960, options [mss 1260,sackOK,TS val 60334015 ecr 3783957005,nop,wscale 7], length 0
21:16:21.862111 IP ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030 > localhost.localdomain.http: Flags [.], ack 1, win 211, options [nop,nop,TS val 3783957088 ecr 60334015], length 0
21:16:21.862332 IP ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030 > localhost.localdomain.http: Flags [P.], seq 1:275, ack 1, win 211, options [nop,nop,TS val 3783957088 ecr 60334015], length 274: HTTP: GET /.well-known/acme-challenge/QFoK6pimuWxj53aCeMQNWuqoBREDfxbL83TgybngaBw HTTP/1.1
21:16:21.862412 IP localhost.localdomain.http > ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030: Flags [.], ack 275, win 204, options [nop,nop,TS val 60334098 ecr 3783957088], length 0
21:16:21.862823 IP localhost.localdomain.http > ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030: Flags [P.], seq 1:250, ack 275, win 204, options [nop,nop,TS val 60334098 ecr 3783957088], length 249: HTTP: HTTP/1.1 200 OK
21:16:21.863040 IP localhost.localdomain.http > ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030: Flags [FP.], seq 250:337, ack 275, win 204, options [nop,nop,TS val 60334098 ecr 3783957088], length 87: HTTP
21:16:21.945218 IP ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030 > localhost.localdomain.http: Flags [.], ack 250, win 219, options [nop,nop,TS val 3783957171 ecr 60334098], length 0
21:16:21.945383 IP ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030 > localhost.localdomain.http: Flags [F.], seq 275, ack 338, win 219, options [nop,nop,TS val 3783957172 ecr 60334098], length 0
21:16:21.945445 IP localhost.localdomain.http > ec2-34-222-229-130.us-west-2.compute.amazonaws.com.46030: Flags [.], ack 276, win 204, options [nop,nop,TS val 60334181 ecr 3783957172], length 0

and the result of that attempt was the same as the error in your original post? The SERVFAIL specifically?

Could you also please tell me what this shows:

cat /proc/sys/net/ipv4/tcp_tw_recycle

Yes, the cert error was the same = SERVFAIL

cat shows this:

[root@localhost ~]# cat /proc/sys/net/ipv4/tcp_tw_recycle
0

1 Like

@lestaff is the current multi-VA setup in production supposed to succeed if any of the VAs succeed? Or is there a specific one (i.e. the non-AWS VA) that has to succeed?

In the above pcap, you can see that in at least a couple of cases, the VA managed to resolve the domain and subsequently deliver the HTTP request:

21:16:21.326407 IP ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998 > localhost.localdomain.http: Flags [P.], seq 1:275, ack 1, win 211, options [nop,nop,TS val 3306994484 ecr 60333544], length 274: HTTP: GET /.well-known/acme-challenge/QFoK6pimuWxj53aCeMQNWuqoBREDfxbL83TgybngaBw HTTP/1.1

and from the perspective of the server, a response was sent:

21:16:21.326995 IP localhost.localdomain.http > ec2-52-15-254-228.us-east-2.compute.amazonaws.com.35998: Flags [P.], seq 1:250, ack 275, win 204, options [nop,nop,TS val 60333562 ecr 3306994484], length 249: HTTP: HTTP/1.1 200 OK

I'm not sure how we get from here to:

SERVFAIL looking up A

unless one of VAs is more important than the others.

2 Likes

From What is the current status of the implementation of multi-viewpoint validation? - #4 by cpu

3 Likes

Hi @RowdyRhonda

checking your domain dramend.com with "check your website" works, no problem to find the A-record. But using Website Uptime Test: Check Website Status | Uptrends something looks wrong.

That's a tool that uses different places to check.

Most is good.

But Amsterdam and San Diego - there is a connection failed error (TCP-problem).

And Edinburg has a DNS Lookup error.

Looks like your "special name server configuration" doesn't work completely.


Oh, what's that? Reading the complete output of https://check-your-website.server-daten.de/?q=dramend.com

Warning: Control chars (Ascii 11) found in Html-Content

Normally, that happens only if

  • the code is very buggy (wrong replacement) (or)
  • it's a spam bot detection: Placing some Control chars in the html code -> a spam bot crashes.

Is there such a system running?

2 Likes

Hi @JuergenAuer,
Thank you so much for your feedback on these issues! Much appreciated.

I will post this to tech support immediately and provide their reply to you soon as I hear back from them.

In the meantime, there is one particular, specific, issue that continues to stand out to me and I’d like to ask you about it, please.

In the output of check-your-website, I’m referrencing this specific warning:

Warning: Not existing ACME-file, but Server sends 200, not 404 or redirect. May be a problem creating a Letsencrypt certificate. Checking /.well-known/acme-challenge/random-filename - a http status 404 - Not Found - is expected. If your server sends content and a http status 200, the validation file (87 bytes, token, dot and the hash of the public part of the account key) may be invisible, so Letsencrypt can’t validate your domain. If it is an application that sends this content, perhaps create an exception, so /.well-known/acme-challenge sends raw files. Or create a redirect to another domain and / or port 443, but your Letsencrypt client must support such a solution.

I have 2 questions in regards to this:

QUESTION 1 - Currently, the Nginx domain configuration is blocking the /.well-known/acme-challenge/ FOLDER with 403 Forbidden access, due to the dot prefix.

Is this contributing to the problem with LetsEncrypt not being able to complete the acme challenge for domain name ownership validation?

QUESTION 2 - Both sites are built on WordPress and both are running a basic 404 plugin that redirects all 404’s to the homepage.

When you attempt to access a FILE within the /.well-known/acme-challenge/ folder it does in fact trigger a 404 and redirects to the home page, which results in a https status 200 OK instead of 404 Not Found.

Is this contributing to the problem with LetsEncrypt not being able to complete the acme challenge for domain name ownership validation?

It seems to me that one or both of these issues are contributing to the LetsEncrypt issues and I’d appreciate your feedback on this.

Thank you!

1 Like

Hi @_az,

Thank you for your help on these issues!

I appreicate you escalating your concerns up to the staff for clarification.

Hi @Phil ,

Thank you for your clarification on the status of the multi-viewpoint validation. Much appreciated.

1 Like

I don't know. First you must have an A-record. That may be the next problem.

That's in general a wrong configuration. Google would say, it's a "soft error":

Why does it matter?

Returning a success code, rather than 404/410 (not found) or 301 (moved), is a bad practice. A success code tells search engines that there’s a real page at that URL. As a result, the page may be listed in search results, and search engines will continue trying to crawl that non-existent URL instead of spending time crawling your real pages.

A 404 shouldn't redirect to another file (duplicated content) and shouldn't end in a http status 200. 200 says - "the first url is ok", but it's not.

If an url is wrong: Don't redirect, don't send a wrong http status code. It's possible to show a page with a search. But the answer should be 404, not 200.

2 Likes

@JuergenAuer - Thank you for your response.

Yes, I’m aware of these http status issues. That’s why I’m asking if they might be contributing to the LetsEncrypt issues.

The A records are in place. Have been all along and “pass” all common DNS record lookups, digs and pings with the correct IP and NS records response.

Back to http status . . . this is an issue I’ve been battling with WordPress for a long time. I’ve initiated support conversations with several of the 404 type WP plugin developers asking if they can/will correct the 404 status code . . . pretty much to no avail so far.

There is another serious issue that comes into play on this as well . . . caching plugins :frowning: On the first time an actual 404 fires the caching plugins let it return correctly as a 404 status code . . . thereafter, though, those URL’s are stored and served up from cache as though they are existing URL’s, consequently returing a 200 OK status code. – On these too, I’ve initiated support conversations with several of the cache type WP plugin developers asking if they can/will correct this fallacy . . . pretty much to no avail on these either . . . so far anyway.

I’m going to disable the 404 plugin and try issuing a cert again . . . wild shot in the dark to see if this is in fact causing the problem. – I’ll let you know if it works.

Well @JuergenAuer,

I disabled the 404 plugin, confirmed the URL .well-known/acme-challenge/random-filename sent the proper 404 Not Found status code and tried again to issue an LE SSL certificate.

It produced the same error output:

Error: Could not issue a Let's Encrypt SSL/TLS certificate for pheromoneadvantage.com. Authorization for the domain failed.
Details
Invalid response from https://acme-v02.api.letsencrypt.org/acme/authz-v3/1733744475.
Details:
Type: urn:ietf:params:acme:error:dns
Status: 400
Detail: DNS problem: SERVFAIL looking up A for pheromoneadvantage.com

Grrrrr . . . :face_with_symbols_over_mouth:

@JuergenAuer, @_az, and @Phil – This is a BUMP to see if any of you can help me with my last post (above) please?

Thank you!

I believe the issue is still the one I originally proposed: Let’s Encrypt’s main validation server (66.133.109.36, currently) can’t connect to your server (for both DNS lookups and for HTTP traffic).

Phil’s response helped narrow down the issue to just that one IP, not any of the non-primary validation servers.

The cause could be anything: firewall rules, routing issues, something else. But I strongly suspect that the blame lies with firewall device on or in front of your server, because other servers hosted in the same /24 as your server do not suffer from the same issue.

Try the tcpdump again but restrict the hosts we are interested to only the primary VAs:

tcpdump -i eth0 'host 66.133.109.36 or host 64.78.149.164'

Check whether any output appears when you try issue a certificate from Plesk.

3 Likes

Thank you so much @_az! :sunglasses:

I will keep working on it from this angle to see if I can get it sorted out.

1 Like

Hi @_az, @JuergenAuer and @Phil,

Hope you guys had wonderful holidays :slight_smile:

I’m back to pick up where I left off with this task and currently working on dramend.com.

I’ve made a few minor changes on the server, so I’m running back through your instructions from the top down.

Here’s where I’m at now:
__

FIRST TCPDUMP TEST:
tcpdump -i eth0 ‘host 66.133.109.36 or host 34.222.229.130 or host 52.15.254.228 or host 52.28.236.88 or host 35.162.100.107 or host 54.244.166.87 or host 3.133.161.228’

Same output as above in the terminal and same error

Error: Could not issue a Let’s Encrypt SSL/TLS certificate for dramend.com. Authorization for the domain failed.
Details
Invalid response from https://acme-v02.api.letsencrypt.org/acme/authz-v3/2201140459.
Details:
Type: urn:ietf:params:acme:error:dns
Status: 400
Detail: DNS problem: SERVFAIL looking up A for dramend.com

__

CAT COMMAND:
cat /proc/sys/net/ipv4/tcp_tw_recycle

Same output

[root@localhost ~]# cat /proc/sys/net/ipv4/tcp_tw_recycle
0
__

SECOND TCPDUMP TEST:
tcpdump -i eth0 ‘host 66.133.109.36 or host 64.78.149.164’

No output in the terminal and a different error this time - query timed out

Error: Could not issue a Let’s Encrypt SSL/TLS certificate for dramend.com. Authorization for the domain failed.
Details
Invalid response from https://acme-v02.api.letsencrypt.org/acme/authz-v3/2201219670.
Details:
Type: urn:ietf:params:acme:error:dns
Status: 400
Detail: DNS problem: query timed out looking up A for dramend.com
__

Please let me know any additional information you need on this.

Thank you!

1 Like