Dns-01: Remote PerformValidation RPC failed

As part of my free hosting service InfinityFree, I integrated Let’s Encrypt in my panel for users to issue SSL certificates. However, since roughly February 27, many SSL requests have failed with errors like this:

During secondary validation: Remote PerformValidation RPC failed

The setup works by having users create a CNAME record for their _acme-challenge subdomain which points to {{ token }}.acme.infinityfree.net. When they request a certificate, I request a DNS-01 challenge for the domain, upload the CNAME token to the DNS server. Let’s Encrypt should traverse the CNAME record and verify the TXT record present on acme.infinityfree.net.

Other topics about this error message indicated IPv6 routing issues. But the only IPv6 involved here is the IPv6 support on Cloudflare’s and Amazon’s nameservers, which I doubt are (or hope are not) the issue.

Does anyone have any idea what’s causing this vague error message, and have any idea how to solve it?


My domain is: adema.io

I ran this command: N/A

It produced this output:

{
	"type": "dns-01",
	"status": "invalid",
	"error": {
		"type": "urn:ietf:params:acme:error:serverInternal",
		"detail": "During secondary validation: Remote PerformValidation RPC failed",
		"status": 500
	},
	"url": "https:\/\/acme-v02.api.letsencrypt.org\/acme\/chall-v3\/3091866796\/9KFYAw",
	"token": "nGV5Mlx_Cx1wENjYXowe1Z-lgETEZkTNqXvKjwbYyL8",
	"validationRecord": [
		{
			"hostname": "adema.io"
		}
	]
}

My web server is (include version): N/A

The operating system my web server runs on is (include version): N/A

My hosting provider, if applicable, is: InfinityFree / Google Cloud / Cloudflare / Amazon Web Service

I can login to a root shell on my machine (yes or no, or I don’t know): yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel): InfinityFree client area, latest version. Build with acmephp/core: 1.2.0.

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you’re using Certbot): acmephp/core: 1.2.0

When I see errors with “RPC” in it, I always think it might me an issue on Let’s Encrypt side. Not sure of course. Perhaps @lestaff can shed some light on this error message?

Hi @Grendel

checking your TXT manual that's correct.

So infinityfree doesn't work with the new Multi-Perspective validation.

Letsencrypt servers see your TXT entry, the other validation servers not.

-->> Infinityfree has to fix that.

Thank you for the explanation. I own InfinityFree, so I will need to (and would like to) fix this.

I checked the article you linked to, but I don’t understand what exactly I need to do to support Multi-Perspective Validation. Is this something that the client library should do something with? Or would this require any changes in the DNS setup used?

just every dns edge node have right record when it checks txt record, so wait enough until you flush all record change to any authoritative dns server for acme.infinityfree.net
and letsencrypt caches dns record for 60 second, so it may be problem for your setting (all subdomain 's acme text record dedicated to a single subdomain may be cached and not rechecked,.

Thanks all for the replies.

I’ve increased the finalize delay from 60 seconds to 180 seconds. I hope that this increase with 60s TTL, 60s cache and 60s internet randomness is enough to mitigate this issue.

did the same, but its really unreliable since the check is activated .

i have 4 dns servers under my control, restart them to use the new txt records and i’m sure, that the correct entries are active, but i get from 10 tries, 8 with failures…

1 Like

This is a good instinct in general. In this particular case: The "Remote PerformValidation RPC failed" error usually results from a timeout when our primary validation servers talk to the remote validation servers. Typically that's a result of the validation process taking too long.

We try to nest our timeouts so that the remote validation servers return a specific error like "SERVFAIL looking up XXX" before the primary validation server decides the RPC itself has timed out, but that hasn't turned out to be reliable in practice. We should definitely take a second look at the timeout nesting code and see if we can make it more robust so we deliver a more effective error message.

I looked in the logs for adema.io, and I do see that the remote perspectives are seeing timeouts when querying the Unbound instance for _acme-challenge.adema.io.

I see that the NS records for acme.infinityfree.net. point at Route53, which is generally fine and probably not the problem:

$ dig +short ns acme.infinityfree.net.
ns-1032.awsdns-01.org.
ns-1999.awsdns-57.co.uk.
ns-432.awsdns-54.com.
ns-580.awsdns-08.net.

However, I encountered some problems looking up the NS records for adema.io:

dig +short ns adema.io.
<no output>

If you look at _acme-challenge.adema.io | DNSViz, it has a warning:

io to adema.io: The following NS name(s) were found in the delegation NS RRset (i.e., in the io zone), but not in the authoritative NS RRset: ns1.epizy.com, ns2.epizy.com

I'm not positive this is the problem, but it's one avenue to investigate.

If you've been having this problem for multiple domain names, would you mind providing a sample of others having the same problem?

4 Likes

FYI, I went to double-check our timeout settings based on this thread, and I found that our timeouts to the remote perspectives were actually set too low. That increases the likelihood that when your servers are slow, you would get the uninformative “PerformValidation RPC” failed.

We’re working on a fix right now, but there’s a good chance that once the fix is deployed, the more-informative error you get will indicate that the DNS lookup timed out, or received a SERVFAIL.

One simple test that suggests this isn’t just a problem with our remote perspectives:

$ dig TXT _acme-challenge.adema.io @8.8.8.8

; <<>> DiG 9.11.5-P4-5.1ubuntu2.1-Ubuntu <<>> TXT _acme-challenge.adema.io @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31873
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_acme-challenge.adema.io.      IN      TXT

;; Query time: 2297 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sun Mar 01 17:23:35 PST 2020
;; MSG SIZE  rcvd: 53

Note that I don’t always get SERVFAIL from 8.8.8.8; some of the time I get a success. But the fact that I get it some of the time suggests there really is some issue.

4 Likes

We’ve now deployed a fix so you should get a better error message. Want to try again?

6 Likes

Looks working today… thanks… i renew 4 certificates today and it’s working…

for all others… i checked in my pre-dns-hook with the
host -t txt on all my dns server if the entry is active and set a ttl time of 0 to forbid caching to disable the dns cache.

the last test i do in the post commit is to wait until the entry is visible on the dns server 8.8.8.8 (google)

host -t txt xxxxxx 8.8.8.8

1 Like

Thank you very much for the help!

I get different error messages now:

During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.www.modahealthquotes.com - the domain's nameservers may be malfunctioning

and

During secondary validation: DNS problem: SERVFAIL looking up CAA for www.promotechcomputersolutions.com - the domain's nameservers may be malfunctioning

This message is at least a lot more informative than the "RPC failed" one. And does support your claim of this possibly being a problem with my nameservers.

Annoyingly, I don't get any errors from 8.8.8.8. The nameservers typically used by my customers are on a geo distributed system which has not always worked terribly well. So if I had to guess, the POP which at least one of your secondary locations is connecting from could be malfunctioning.

In order to figure out whether this is the case, I have a few more questions which I hope you can answer.

  • Does Let's Encrypt always use Google Public DNS to resolve the domain names? Or do you use other resolvers as well or instead? Or maybe not use any resolvers at all?
  • Can you see which secondary locations are reporting SERVFAIL error for the domains on my namesevers? Does the issue center around any specific POP(s)?
  • If the issue does center around any specific Let's Encrypt POPs, could you share a traceroute from that POP to ns1.epizy.com? From that, I should be able to derive our POP to check.

Here is a sample of a few more recent failed domains:

www.modahealthquotes.com
www.snoopbuy.com
www.mochinga.live
www.webamela.ga
www.khyrulkabir.ml
www.ozerileri.com.tr
www.promotechcomputersolutions.com

Checking that domain there is a "not so good" configuration - https://check-your-website.server-daten.de/?q=promotechcomputersolutions.com

The two name servers:

promotechcomputersolutions.com
	•  ns1.epizy.com / ns1.byet.org
	198.251.86.152
Amsterdam/North Holland/Netherlands (NL) - FranTech Solutions	•

	•  ns2.epizy.com / ns1.byet.org
	198.251.86.153
Amsterdam/North Holland/Netherlands (NL) - FranTech Solutions	

Not really different networks.

And delegation / zone is inconsistent:

Fatal: Inconsistency between delegation and zone. The set of NS records served by the authoritative name servers must match those proposed for the delegation in the parent zone.: ns1.epizy.com (198.251.86.152): Delegation: ns1.epizy.com,ns2.epizy.com, Zone: ns2.epizy.com

Both problems are not completely critical. Isn't there something like a blocking firewall, if multi perspective validation is used?

2 Likes

Let's Encrypt runs their own resolvers (using an off-the-shelf recursive DNS server).

3 Likes

They are anycast IPs. Not in different subnets, sure, but changing that would be purely cosmetic. IMO the idea that using different subnets makes the service any more reliable is quite outdated.

That's one of the issues which @jsha also raised. I do agree that the lack of NS records is not correct or nice, but I haven't heard any complaints for the months or years where this has been the case.

To my knowledge, there is no classic firewall restricting access from IPs. But, come to think of it, there is a DDoS protection system which might interfere with the DNS queries.

Or it might be a broken POP. Which, given that Google's resolvers also returned errors for @jsha, sounds a bit more likely.

But it's just a guess from my end if I don't have any network diagnostic information.

That's the Unbound instance which @jsha was referring to? Too bad, it makes it harder to reproduce issues locally.

1 Like

That happens sometimes. Let's Encrypt is especially capable of triggering paranoid DDoS systems because it can send a modestly large amount of queries, for a "weird" type (CAA), and for things that don't exist (CAA again, and misconfigured domains).

Yes.

https://unboundtest.com/ is a public service with an ~identical configuration, but it's totally unrelated infrastructure.

1 Like

That

may be the problem.

1 Like

I'll just confirm what @mnordhoff and @JuergenAuer have said. They are, as always, spot on. :slight_smile:

I just spun up a couple more instances of unboundtest.com in different regions:

https://ams.unboundtest.com/
https://blr.unboundtest.com/

Try doing some lookups from there and let me know if it helps diagnose the problem. If people wind up finding these useful I'll update the HTML on the main page to link out to them.

Edit: I've removed ams.unboundtest.com and blr.unboundtest.com to reduce maintenance burden.

2 Likes

That's great, thanks!

And there is a result:

Checked promotechcomputersolutions.com

https://unboundtest.com/m/A/promotechcomputersolutions.com/N7QJAE3X

2 seconds.

https://blr.unboundtest.com/m/A/promotechcomputersolutions.com/TL3XQCTX

8 seconds.

Rechecked - 7 seconds.

1 Like

@jsha

Nice! :smile:

It looks like blr doesn’t have working IPv6, though. :grimacing: E.g. @JuergenAuer’s test has a lot of:

Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:500:2d::d port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:500:2::c port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:dc3::35 port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:7fd::1 port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:7fd::1 port 53

(Those are root nameservers.)

(ams works perfectly, though!)

2 Likes