PowerDNS: Can't find why CAA servfails

cpu · July 19, 2017, 12:58pm

As a quick note: In the future it would be easier to keep this sort of detail straight if you created a separate thread. CAA servfails are a DNS provider specific problem, if you don't have the same DNS provider as the original poster then I think it would be best to use a separate thread.

My mistake, you are the original poster as you point out. Apologies. This advice should be for @tgx.

rickjanssen · July 19, 2017, 12:59pm

@cpu I am the original poster…

cpu · July 19, 2017, 1:01pm

I apologize. You’re correct. Now there’s two confused staff in here . I’ve updated my comment above for @tgx.

rickjanssen · July 19, 2017, 1:20pm

Thank you @jsha, but all responses say "0000" (NOERROR). Do you have any idea?

Edit: Can you send over an resolve of an DNSSEC enabled NL domain? Like overheid.nl? And is this pcap the same as the log you gave? Was the pcap made when you got the SERVFAIL or was it made after?

Edit2: Can you send over an debug level log? Or your configuration so I don't have to bother you with this.

cpu · July 19, 2017, 6:35pm

4 posts were split to a new topic: DNSimple CAA SERVFAIL

cpu · July 19, 2017, 2:09pm

Here's a log with Unbound verbosity=3. The previous log was at 2.

I suspect so, but just in case here is another base64 encoded pcap file taken at the exact same time as the verbosity 3 log above.

rickjanssen · July 19, 2017, 2:13pm

Hi @cpu

So it seems to me like a DNSSEC validation error, but why whould it fail? According to every DNSSEC test tool I try, DNSSEC is correct. I’ve setup an default config unbound recursor, but it gives NOERROR…

cpu · July 19, 2017, 2:17pm

Agreed. It does seem to boil down to an NSEC failure on the NODATA response. I'm also personally unsure why that's happening in this case. I think we need to do some more digging.

rickjanssen · July 19, 2017, 2:18pm

@cpu is there anything I can do? With my unbound setup, the requests do not fail, they give NOERROR

cpu · July 19, 2017, 2:21pm

I can't offer any advice at present. I can share a sanitized copy of the config from the server I'm testing against if you want to try and see if you can get your Unbound to match.

EDIT: here it is - note that I removed some access-control lines and I wouldn't recommend running this as-is since I suspect it will be an open resolver.

I'm going to try increasing the verbosity level from 3 to 4 to get algorithm data. Looking at validator/val_nsec3.c's nsec3_prove_nodata function I believe this will give us more information.

rickjanssen · July 19, 2017, 2:27pm

Which version are you running?

@weppos Thank you, NXDOMAIN only occurs to me when I ask for a really non existing domain.

cpu · July 19, 2017, 2:29pm

1.5.8-1ubuntu1 from apt on Ubuntu 16.04

cpu · July 19, 2017, 2:37pm

I tried this and the snipped results do give more information that I can match up to the if(!has_valid_nsec) branch of the validate_nodata_response() function from ~/unbound-1.5.8/validator/validator.c. The extra information from the qchase and chase_reply don't give me any immediate insights :-/

rickjanssen · July 19, 2017, 3:59pm

I really can’t figure out what is causing this to happen. I reproduced the SERVFAIL now. Hope to figure out what the problem is soon. Is it possible for you to “whitelist” (skip CAA check) us until we figure this out? Lots of customers and lots domains are failing to create and renew right now…

Edit: seems to be better now, not sure.

Edit2: Nope, servfail is back.

Edit3: It seems like the query gives “NOERROR” after a restart of PowerDNS and keeps giving NOERROR until querycache times out

jsha · July 19, 2017, 5:41pm

@rickjanssen I’m still curious about whether your nameservers are anycast. If so, that could potentially explain random differences, if queries are routed to different servers. Also, sounds like you’re running PowerDNS. What version? I assume you’re running the same version on all nameservers?

Also, in what role are you experiencing failures? Are you acting as a hosting provider and doing issuances yourself that are failing? Or are you acting as a nameserver and getting customer reports that their renewals are failing? Our ideal recommendation is to handle renewal automatically 30 days in advance, and automatically retry errors. Do you have customers whose certificates are on the cusp of expiring? We have a temporary whitelisting mechanism, but it’s based on domain names rather than nameservers (because Unbound doesn’t make that information available to Boulder). We are currently using a list of domains that were showing SERVFAIL consistently, but if a domain was returning NOERROR some of the time it wouldn’t have made it onto the list.

weppos · July 19, 2017, 6:03pm

I use a Mac machine, I'll need to give a shot and install Unbound. Not sure how feasible it is.

I can also provide some logs & pcaps if you share an affected domain name the same way I did for @rickjanssen.

I sent you the error and the names via email.

roland · July 19, 2017, 6:09pm

One thing that stands out to me is that in the snippet that @cpu posted your nameserver returned an unexpected NSEC record which doesn’t prove the NODATA response (mail.gwvanpelt.nl. 86400 IN NSEC pop.gwvanpelt.nl. A AAAA RRSIG NSEC which would authenticate a mail.gwvanpelt.nl. NODATA response but not a pop.gwvanpelt.nl. response) which is why that specific query failed.

That said I’m unable to replicate this as whenever I query your servers I instead get valid NSEC3 records back which makes me think there is something funky going on behind the scenes. As far as I understand it PowerDNS has to be explicitly configured to serve either NSEC or NSEC3 so is it possible you have a mix of servers and one is acting up or something?

WinstonSmith · July 19, 2017, 6:11pm

This any help?

http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&doe=on&ta=.&tk=

rickjanssen · July 19, 2017, 6:27pm

@jsha We do not use anycast, but I can explain a little further what happens now:

nsX had no questions about sub.domain.nl -> cache is clear
nsX receives a quistion about CAA of sub.domain.nl -> cache is filled with TTL of 300 seconds -> request verified
nsX receives after 45 seconds another call for CAA of sub.domain.nl -> cache is used -> request verified
nsX receives after 310 seconds another call for CAA of sub.domain.nl -> problem occurs -> request failed

since the round robin technique of DNS servers is changing from ns1 to 2 to 3 the random answers started to confuse us. We blocked the ns1 and 2 in the ip table of a random Ubuntu 16.04 serverwith unbound 1.5.8-1unbuntu1 and set specifiek forced forwarding to ns3.zxcs.nl. I started to loop every second requesting CAA records from unbound at 127.0.0.1 and NOERROR stayed for the length of the cache, followed by SERVFAIL when the cache expired. Restarting PowerDNS the NOERROR came back for exactly 301 seconds, lowering the query cache TTL to 60 made it 61 seconds.

We are running a webhosting platform with thousands of domains, of which probably 90% of the SSL enabled websites is using Let’sEncrypt as certificate provider. The customers request their SSL certificates them self through the script in the webpanel. However the (cc @weppos) problem is not with Let’sEncrypt, but with unbound dns recursor or our PowerDNS setup.

I know about the 30 days, we still have 24 left or so, but this is about new domains too. We want them to be able to request the certificates and renew them. I’m working continuously to find a fix for our setup. We are about to start building a new nameserver with the latest version to see if this behavior will not occur on that authoritative nameserver.

Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24

@roland Thank you, will look in to this.

@WinstonSmith that is the problem: http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&ta=.&tk=

cpu · July 19, 2017, 6:33pm

I replied there. Since I'm confident this is a distinct root cause from the one affecting @rickjanssen I'm going to split your posts into a separate thread to keep this one clear. Thanks!

Topic		Replies	Views
SERVFAIL looking up CAA, but I see NOERROR myself Help	25	7239	September 7, 2017
Help diagnosing CAA failures `ns1.cyso.nl` Help	14	3590	August 23, 2017
DNS problem: SERVFAIL looking up CAA Help	16	4223	May 8, 2021
CAA requests resulting in SERVFAIL since Dec 12th Help	22	1176	January 19, 2024
False CAA failure when issuing certs Issuance Tech	35	4172	August 9, 2018

PowerDNS: Can't find why CAA servfails

Related topics