PowerDNS: Can't find why CAA servfails

As a quick note: In the future it would be easier to keep this sort of detail straight if you created a separate thread. CAA servfails are a DNS provider specific problem, if you don't have the same DNS provider as the original poster then I think it would be best to use a separate thread.

My mistake, you are the original poster as you point out. Apologies. This advice should be for @tgx.

@cpu I am the original poster…

I apologize. You’re correct. Now there’s two confused staff in here :blush:. I’ve updated my comment above for @tgx.

Thank you @jsha, but all responses say “0000” (NOERROR). Do you have any idea?

Edit: Can you send over an resolve of an DNSSEC enabled NL domain? Like overheid.nl? And is this pcap the same as the log you gave? Was the pcap made when you got the SERVFAIL or was it made after?

Edit2: Can you send over an debug level log? Or your configuration so I don’t have to bother you with this. :slight_smile:

4 posts were split to a new topic: DNSimple CAA SERVFAIL

Here's a log with Unbound verbosity=3. The previous log was at 2.

I suspect so, but just in case here is another base64 encoded pcap file taken at the exact same time as the verbosity 3 log above.

Hi @cpu

So it seems to me like a DNSSEC validation error, but why whould it fail? According to every DNSSEC test tool I try, DNSSEC is correct. I’ve setup an default config unbound recursor, but it gives NOERROR…

Agreed. It does seem to boil down to an NSEC failure on the NODATA response. I'm also personally unsure why that's happening in this case. I think we need to do some more digging.

@cpu is there anything I can do? With my unbound setup, the requests do not fail, they give NOERROR

I can't offer any advice at present. I can share a sanitized copy of the config from the server I'm testing against if you want to try and see if you can get your Unbound to match.

EDIT: here it is - note that I removed some access-control lines and I wouldn't recommend running this as-is since I suspect it will be an open resolver.

I'm going to try increasing the verbosity level from 3 to 4 to get algorithm data. Looking at validator/val_nsec3.c's nsec3_prove_nodata function I believe this will give us more information.

Which version are you running?

@weppos Thank you, NXDOMAIN only occurs to me when I ask for a really non existing domain.

1.5.8-1ubuntu1 from apt on Ubuntu 16.04

I tried this and the snipped results do give more information that I can match up to the if(!has_valid_nsec) branch of the validate_nodata_response() function from ~/unbound-1.5.8/validator/validator.c. The extra information from the qchase and chase_reply don’t give me any immediate insights :-/

I really can’t figure out what is causing this to happen. I reproduced the SERVFAIL now. Hope to figure out what the problem is soon. Is it possible for you to “whitelist” (skip CAA check) us until we figure this out? Lots of customers and lots domains are failing to create and renew right now…

Edit: seems to be better now, not sure.

Edit2: Nope, servfail is back.

Edit3: It seems like the query gives “NOERROR” after a restart of PowerDNS and keeps giving NOERROR until querycache times out

1 Like

@rickjanssen I’m still curious about whether your nameservers are anycast. If so, that could potentially explain random differences, if queries are routed to different servers. Also, sounds like you’re running PowerDNS. What version? I assume you’re running the same version on all nameservers?

Also, in what role are you experiencing failures? Are you acting as a hosting provider and doing issuances yourself that are failing? Or are you acting as a nameserver and getting customer reports that their renewals are failing? Our ideal recommendation is to handle renewal automatically 30 days in advance, and automatically retry errors. Do you have customers whose certificates are on the cusp of expiring? We have a temporary whitelisting mechanism, but it’s based on domain names rather than nameservers (because Unbound doesn’t make that information available to Boulder). We are currently using a list of domains that were showing SERVFAIL consistently, but if a domain was returning NOERROR some of the time it wouldn’t have made it onto the list.

I use a Mac machine, I’ll need to give a shot and install Unbound. Not sure how feasible it is.

I can also provide some logs & pcaps if you share an affected domain name the same way I did for @rickjanssen.

I sent you the error and the names via email.

One thing that stands out to me is that in the snippet that @cpu posted your nameserver returned an unexpected NSEC record which doesn’t prove the NODATA response (mail.gwvanpelt.nl. 86400 IN NSEC pop.gwvanpelt.nl. A AAAA RRSIG NSEC which would authenticate a mail.gwvanpelt.nl. NODATA response but not a pop.gwvanpelt.nl. response) which is why that specific query failed.

That said I’m unable to replicate this as whenever I query your servers I instead get valid NSEC3 records back which makes me think there is something funky going on behind the scenes. As far as I understand it PowerDNS has to be explicitly configured to serve either NSEC or NSEC3 so is it possible you have a mix of servers and one is acting up or something?

1 Like

This any help?

http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&doe=on&ta=.&tk=

@jsha We do not use anycast, but I can explain a little further what happens now:

nsX had no questions about sub.domain.nl -> cache is clear
nsX receives a quistion about CAA of sub.domain.nl -> cache is filled with TTL of 300 seconds -> request verified
nsX receives after 45 seconds another call for CAA of sub.domain.nl -> cache is used -> request verified
nsX receives after 310 seconds another call for CAA of sub.domain.nl -> problem occurs -> request failed

since the round robin technique of DNS servers is changing from ns1 to 2 to 3 the random answers started to confuse us. We blocked the ns1 and 2 in the ip table of a random Ubuntu 16.04 serverwith unbound 1.5.8-1unbuntu1 and set specifiek forced forwarding to ns3.zxcs.nl. I started to loop every second requesting CAA records from unbound at 127.0.0.1 and NOERROR stayed for the length of the cache, followed by SERVFAIL when the cache expired. Restarting PowerDNS the NOERROR came back for exactly 301 seconds, lowering the query cache TTL to 60 made it 61 seconds.

We are running a webhosting platform with thousands of domains, of which probably 90% of the SSL enabled websites is using Let’sEncrypt as certificate provider. The customers request their SSL certificates them self through the script in the webpanel. However the (cc @weppos) problem is not with Let’sEncrypt, but with unbound dns recursor or our PowerDNS setup.

I know about the 30 days, we still have 24 left or so, but this is about new domains too. We want them to be able to request the certificates and renew them. I’m working continuously to find a fix for our setup. We are about to start building a new nameserver with the latest version to see if this behavior will not occur on that authoritative nameserver.

Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24

@roland Thank you, will look in to this.

@WinstonSmith that is the problem: http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&ta=.&tk=

I replied there. Since I’m confident this is a distinct root cause from the one affecting @rickjanssen I’m going to split your posts into a separate thread to keep this one clear. Thanks!

1 Like