Dns-01: Remote PerformValidation RPC failed

Thanks all for the replies.

I’ve increased the finalize delay from 60 seconds to 180 seconds. I hope that this increase with 60s TTL, 60s cache and 60s internet randomness is enough to mitigate this issue.

did the same, but its really unreliable since the check is activated .

i have 4 dns servers under my control, restart them to use the new txt records and i’m sure, that the correct entries are active, but i get from 10 tries, 8 with failures…

1 Like

This is a good instinct in general. In this particular case: The “Remote PerformValidation RPC failed” error usually results from a timeout when our primary validation servers talk to the remote validation servers. Typically that’s a result of the validation process taking too long.

We try to nest our timeouts so that the remote validation servers return a specific error like “SERVFAIL looking up XXX” before the primary validation server decides the RPC itself has timed out, but that hasn’t turned out to be reliable in practice. We should definitely take a second look at the timeout nesting code and see if we can make it more robust so we deliver a more effective error message.

I looked in the logs for adema.io, and I do see that the remote perspectives are seeing timeouts when querying the Unbound instance for _acme-challenge.adema.io.

I see that the NS records for acme.infinityfree.net. point at Route53, which is generally fine and probably not the problem:

$ dig +short ns acme.infinityfree.net.
ns-1032.awsdns-01.org.
ns-1999.awsdns-57.co.uk.
ns-432.awsdns-54.com.
ns-580.awsdns-08.net.

However, I encountered some problems looking up the NS records for adema.io:

dig +short ns adema.io.
<no output>

If you look at https://dnsviz.net/d/_acme-challenge.adema.io/dnssec/, it has a warning:

io to adema.io: The following NS name(s) were found in the delegation NS RRset (i.e., in the io zone), but not in the authoritative NS RRset: ns1.epizy.com, ns2.epizy.com

I’m not positive this is the problem, but it’s one avenue to investigate.

If you’ve been having this problem for multiple domain names, would you mind providing a sample of others having the same problem?

4 Likes

FYI, I went to double-check our timeout settings based on this thread, and I found that our timeouts to the remote perspectives were actually set too low. That increases the likelihood that when your servers are slow, you would get the uninformative “PerformValidation RPC” failed.

We’re working on a fix right now, but there’s a good chance that once the fix is deployed, the more-informative error you get will indicate that the DNS lookup timed out, or received a SERVFAIL.

One simple test that suggests this isn’t just a problem with our remote perspectives:

$ dig TXT _acme-challenge.adema.io @8.8.8.8

; <<>> DiG 9.11.5-P4-5.1ubuntu2.1-Ubuntu <<>> TXT _acme-challenge.adema.io @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31873
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_acme-challenge.adema.io.      IN      TXT

;; Query time: 2297 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sun Mar 01 17:23:35 PST 2020
;; MSG SIZE  rcvd: 53

Note that I don’t always get SERVFAIL from 8.8.8.8; some of the time I get a success. But the fact that I get it some of the time suggests there really is some issue.

4 Likes

We’ve now deployed a fix so you should get a better error message. Want to try again?

6 Likes

Looks working today… thanks… i renew 4 certificates today and it’s working…

for all others… i checked in my pre-dns-hook with the
host -t txt on all my dns server if the entry is active and set a ttl time of 0 to forbid caching to disable the dns cache.

the last test i do in the post commit is to wait until the entry is visible on the dns server 8.8.8.8 (google)

host -t txt xxxxxx 8.8.8.8

1 Like

Thank you very much for the help!

I get different error messages now:

During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.www.modahealthquotes.com - the domain's nameservers may be malfunctioning

and

During secondary validation: DNS problem: SERVFAIL looking up CAA for www.promotechcomputersolutions.com - the domain's nameservers may be malfunctioning

This message is at least a lot more informative than the “RPC failed” one. And does support your claim of this possibly being a problem with my nameservers.

Annoyingly, I don’t get any errors from 8.8.8.8. The nameservers typically used by my customers are on a geo distributed system which has not always worked terribly well. So if I had to guess, the POP which at least one of your secondary locations is connecting from could be malfunctioning.

In order to figure out whether this is the case, I have a few more questions which I hope you can answer.

  • Does Let’s Encrypt always use Google Public DNS to resolve the domain names? Or do you use other resolvers as well or instead? Or maybe not use any resolvers at all?
  • Can you see which secondary locations are reporting SERVFAIL error for the domains on my namesevers? Does the issue center around any specific POP(s)?
  • If the issue does center around any specific Let’s Encrypt POPs, could you share a traceroute from that POP to ns1.epizy.com? From that, I should be able to derive our POP to check.

Here is a sample of a few more recent failed domains:

www.modahealthquotes.com
www.snoopbuy.com
www.mochinga.live
www.webamela.ga
www.khyrulkabir.ml
www.ozerileri.com.tr
www.promotechcomputersolutions.com

Checking that domain there is a “not so good” configuration - https://check-your-website.server-daten.de/?q=promotechcomputersolutions.com

The two name servers:

promotechcomputersolutions.com
	•  ns1.epizy.com / ns1.byet.org
	198.251.86.152
Amsterdam/North Holland/Netherlands (NL) - FranTech Solutions	•

	•  ns2.epizy.com / ns1.byet.org
	198.251.86.153
Amsterdam/North Holland/Netherlands (NL) - FranTech Solutions	

Not really different networks.

And delegation / zone is inconsistent:

Fatal: Inconsistency between delegation and zone. The set of NS records served by the authoritative name servers must match those proposed for the delegation in the parent zone.: ns1.epizy.com (198.251.86.152): Delegation: ns1.epizy.com,ns2.epizy.com, Zone: ns2.epizy.com

Both problems are not completely critical. Isn’t there something like a blocking firewall, if multi perspective validation is used?

2 Likes

Let’s Encrypt runs their own resolvers (using an off-the-shelf recursive DNS server).

3 Likes

They are anycast IPs. Not in different subnets, sure, but changing that would be purely cosmetic. IMO the idea that using different subnets makes the service any more reliable is quite outdated.

That’s one of the issues which @jsha also raised. I do agree that the lack of NS records is not correct or nice, but I haven’t heard any complaints for the months or years where this has been the case.

To my knowledge, there is no classic firewall restricting access from IPs. But, come to think of it, there is a DDoS protection system which might interfere with the DNS queries.

Or it might be a broken POP. Which, given that Google’s resolvers also returned errors for @jsha, sounds a bit more likely.

But it’s just a guess from my end if I don’t have any network diagnostic information.

That’s the Unbound instance which @jsha was referring to? Too bad, it makes it harder to reproduce issues locally.

1 Like

That happens sometimes. Let’s Encrypt is especially capable of triggering paranoid DDoS systems because it can send a modestly large amount of queries, for a “weird” type (CAA), and for things that don’t exist (CAA again, and misconfigured domains).

Yes.

https://unboundtest.com/ is a public service with an ~identical configuration, but it’s totally unrelated infrastructure.

1 Like

That

may be the problem.

1 Like

I’ll just confirm what @mnordhoff and @JuergenAuer have said. They are, as always, spot on. :slight_smile:

I just spun up a couple more instances of unboundtest.com in different regions:

https://ams.unboundtest.com/
https://blr.unboundtest.com/

Try doing some lookups from there and let me know if it helps diagnose the problem. If people wind up finding these useful I’ll update the HTML on the main page to link out to them.

2 Likes

That’s great, thanks!

And there is a result:

Checked promotechcomputersolutions.com

https://unboundtest.com/m/A/promotechcomputersolutions.com/N7QJAE3X

2 seconds.

https://blr.unboundtest.com/m/A/promotechcomputersolutions.com/TL3XQCTX

8 seconds.

Rechecked - 7 seconds.

1 Like

@jsha

Nice! :smile:

It looks like blr doesn’t have working IPv6, though. :grimacing: E.g. @JuergenAuer’s test has a lot of:

Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:500:2d::d port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:500:2::c port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:dc3::35 port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:7fd::1 port 53
Mar 02 20:36:02 unbound[1140:0] info: error sending query to auth server 2001:7fd::1 port 53

(Those are root nameservers.)

(ams works perfectly, though!)

2 Likes

Thanks for letting me know! I tried adding IPv6 support and restarting, but it still seems to be having some troubles. I’ll give it another shot later - please ping me if I forget. :slight_smile:

4 posts were split to a new topic: Difficulty issuing for shintajim.ir

I believe I’ve fixed this. https://blr.unboundtest.com/ should have working IPv6 now.

2 Likes

I did some testing on my end, and while my monitoring probing the nameservers directly doesn’t show any errors, I added some monitoring which goes through Google’s resolvers, which does return a quite high error rate. So I’m starting to suspect a paranoid DDoS system, or some other malfunction in the namesever stack is the issue.

The error rate is not big enough to have a really noticeable impact on normal browsing, but if you do multiple TXT and CAA lookups from multiple locations, and one failed query fails the entire validation, then it’s not really surprising many certificate requests fail.

I’m already trying to contact the nameserver operators to get more info about this.

Thanks all for the help so far!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.