PowerDNS: Can't find why CAA servfails

@cpu is there anything I can do? With my unbound setup, the requests do not fail, they give NOERROR

I can't offer any advice at present. I can share a sanitized copy of the config from the server I'm testing against if you want to try and see if you can get your Unbound to match.

EDIT: here it is - note that I removed some access-control lines and I wouldn't recommend running this as-is since I suspect it will be an open resolver.

I'm going to try increasing the verbosity level from 3 to 4 to get algorithm data. Looking at validator/val_nsec3.c's nsec3_prove_nodata function I believe this will give us more information.

Which version are you running?

@weppos Thank you, NXDOMAIN only occurs to me when I ask for a really non existing domain.

1.5.8-1ubuntu1 from apt on Ubuntu 16.04

I tried this and the snipped results do give more information that I can match up to the if(!has_valid_nsec) branch of the validate_nodata_response() function from ~/unbound-1.5.8/validator/validator.c. The extra information from the qchase and chase_reply don’t give me any immediate insights :-/

I really can’t figure out what is causing this to happen. I reproduced the SERVFAIL now. Hope to figure out what the problem is soon. Is it possible for you to “whitelist” (skip CAA check) us until we figure this out? Lots of customers and lots domains are failing to create and renew right now…

Edit: seems to be better now, not sure.

Edit2: Nope, servfail is back.

Edit3: It seems like the query gives “NOERROR” after a restart of PowerDNS and keeps giving NOERROR until querycache times out

1 Like

@rickjanssen I’m still curious about whether your nameservers are anycast. If so, that could potentially explain random differences, if queries are routed to different servers. Also, sounds like you’re running PowerDNS. What version? I assume you’re running the same version on all nameservers?

Also, in what role are you experiencing failures? Are you acting as a hosting provider and doing issuances yourself that are failing? Or are you acting as a nameserver and getting customer reports that their renewals are failing? Our ideal recommendation is to handle renewal automatically 30 days in advance, and automatically retry errors. Do you have customers whose certificates are on the cusp of expiring? We have a temporary whitelisting mechanism, but it’s based on domain names rather than nameservers (because Unbound doesn’t make that information available to Boulder). We are currently using a list of domains that were showing SERVFAIL consistently, but if a domain was returning NOERROR some of the time it wouldn’t have made it onto the list.

I use a Mac machine, I’ll need to give a shot and install Unbound. Not sure how feasible it is.

I can also provide some logs & pcaps if you share an affected domain name the same way I did for @rickjanssen.

I sent you the error and the names via email.

One thing that stands out to me is that in the snippet that @cpu posted your nameserver returned an unexpected NSEC record which doesn’t prove the NODATA response (mail.gwvanpelt.nl. 86400 IN NSEC pop.gwvanpelt.nl. A AAAA RRSIG NSEC which would authenticate a mail.gwvanpelt.nl. NODATA response but not a pop.gwvanpelt.nl. response) which is why that specific query failed.

That said I’m unable to replicate this as whenever I query your servers I instead get valid NSEC3 records back which makes me think there is something funky going on behind the scenes. As far as I understand it PowerDNS has to be explicitly configured to serve either NSEC or NSEC3 so is it possible you have a mix of servers and one is acting up or something?

1 Like

This any help?

http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&doe=on&ta=.&tk=

@jsha We do not use anycast, but I can explain a little further what happens now:

nsX had no questions about sub.domain.nl -> cache is clear
nsX receives a quistion about CAA of sub.domain.nl -> cache is filled with TTL of 300 seconds -> request verified
nsX receives after 45 seconds another call for CAA of sub.domain.nl -> cache is used -> request verified
nsX receives after 310 seconds another call for CAA of sub.domain.nl -> problem occurs -> request failed

since the round robin technique of DNS servers is changing from ns1 to 2 to 3 the random answers started to confuse us. We blocked the ns1 and 2 in the ip table of a random Ubuntu 16.04 serverwith unbound 1.5.8-1unbuntu1 and set specifiek forced forwarding to ns3.zxcs.nl. I started to loop every second requesting CAA records from unbound at 127.0.0.1 and NOERROR stayed for the length of the cache, followed by SERVFAIL when the cache expired. Restarting PowerDNS the NOERROR came back for exactly 301 seconds, lowering the query cache TTL to 60 made it 61 seconds.

We are running a webhosting platform with thousands of domains, of which probably 90% of the SSL enabled websites is using Let’sEncrypt as certificate provider. The customers request their SSL certificates them self through the script in the webpanel. However the (cc @weppos) problem is not with Let’sEncrypt, but with unbound dns recursor or our PowerDNS setup.

I know about the 30 days, we still have 24 left or so, but this is about new domains too. We want them to be able to request the certificates and renew them. I’m working continuously to find a fix for our setup. We are about to start building a new nameserver with the latest version to see if this behavior will not occur on that authoritative nameserver.

Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24

@roland Thank you, will look in to this.

@WinstonSmith that is the problem: http://dnsviz.net/d/pop.gwvanpelt.nl/dnssec/?rr=257&a=all&ds=all&ta=.&tk=

I replied there. Since I’m confident this is a distinct root cause from the one affecting @rickjanssen I’m going to split your posts into a separate thread to keep this one clear. Thanks!

1 Like

I’ve made a pass to split off the folks that piled onto this thread so we can help individually. The root cause differs case-to-case. To help avoid more pile-on’s I updated the title of this thread to mention PowerDNS since presently it seems the root cause in this case will be related to that DNS server.

2 Likes

I made a tool that makes it easier to make queries against a DNSSEC-validating Unbound instance and see the debug logs: https://unboundtest.com/. Hopefully it’s helpful. @rickjanssen based on your comments about how you reproduced more reliably, I tried blocking ns1.zxcs.nl and ns2.zxcs.nl in iptables on that machine, and querying CAA pop.gwvanpelt.nl every 5 seconds for 10 minutes. I never saw one SERVFAIL unfortunately. Was that the domain you were able to reproduce with, or was there another?

unboundtest # iptables --list OUTPUT --line-numbers -v
Chain OUTPUT (policy ACCEPT 53 packets, 70499 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1       18  1302 REJECT     all  --  any    any     anywhere             ns1.zxcs.nl          reject-with icmp-port-unreachable
2     627K   53M REJECT     all  --  any    any     anywhere             ns2.zxcs.nl          reject-with icmp-port-unreachable

Also, one thing we noticed when talking with @weppos separately was that there appears to be a bug either in DNSimple’s name server or potentially in Unbound specifically with the combination of DNSSEC-signed zones, DNS 0x20 (which we use), and empty responses. We found that DNSSEC-signed responses that were non-empty worked fine, and disabling DNS 0x20 on the test instance fixed the empty responses (note: we’re not planning to disable DNS 0x20 in prod since that would reduce security).

I’m pretty sure you’re not experiencing the exact same issue (for one thing, you are using different software), but there may be a similar confluence of confounding factors that includes caching. Do you find that all the domains that are having problems are DNSSEC-signed? Are you able to reproduce the same problem for TXT records? If you add CAA records to a domain that reproduces the problem, does the problem go away?

Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24

Unfortunately this isn’t possible with our software.

3 Likes

I’m still figuring out what causes this to happen.

Indeed, we figured that if we add a CAA record the problem is worked around, but we can’t add it for everyone. We plan on automatically adding the record when requesting a Let’sEncrypt certificate.

Only CAA has this, although I haven’t tested TXT, but A works.

This is weird, I am unable to reproduce the SERVFAIL responses too since now, but nothing changed.

A post was merged into an existing topic: Help diagnosing CAA failures ns1.cyso.nl

Whoops, posted on the wrong thread. Moving that post to the right thread.

The reason I suggest TXT is that for most domains it will be an empty response, while the response for A is non-empty. It seems like there are potentially issues specifically around empty responses.

Will check on that, but for now, even CAA stopped sending SERVFAILs. Might be because of the low traffic at this moment.

Edit: it’s back, going to test some more after some sleep.

For what domain is it back? I don't see SERVFAILs for pop.gwvanpelt.nl right now.