As a quick note: In the future it would be easier to keep this sort of detail straight if you created a separate thread. CAA servfails are a DNS provider specific problem, if you don't have the same DNS provider as the original poster then I think it would be best to use a separate thread.
My mistake, you are the original poster as you point out. Apologies. This advice should be for @tgx.
Thank you @jsha, but all responses say "0000" (NOERROR). Do you have any idea?
Edit: Can you send over an resolve of an DNSSEC enabled NL domain? Like overheid.nl? And is this pcap the same as the log you gave? Was the pcap made when you got the SERVFAIL or was it made after?
Edit2: Can you send over an debug level log? Or your configuration so I don't have to bother you with this.
So it seems to me like a DNSSEC validation error, but why whould it fail? According to every DNSSEC test tool I try, DNSSEC is correct. Iâve setup an default config unbound recursor, but it gives NOERRORâŚ
Agreed. It does seem to boil down to an NSEC failure on the NODATA response. I'm also personally unsure why that's happening in this case. I think we need to do some more digging.
I can't offer any advice at present. I can share a sanitized copy of the config from the server I'm testing against if you want to try and see if you can get your Unbound to match.
EDIT: here it is - note that I removed some access-control lines and I wouldn't recommend running this as-is since I suspect it will be an open resolver.
I'm going to try increasing the verbosity level from 3 to 4 to get algorithm data. Looking at validator/val_nsec3.c's nsec3_prove_nodata function I believe this will give us more information.
I tried this and the snipped results do give more information that I can match up to the if(!has_valid_nsec) branch of the validate_nodata_response() function from ~/unbound-1.5.8/validator/validator.c. The extra information from the qchase and chase_reply don't give me any immediate insights :-/
I really canât figure out what is causing this to happen. I reproduced the SERVFAIL now. Hope to figure out what the problem is soon. Is it possible for you to âwhitelistâ (skip CAA check) us until we figure this out? Lots of customers and lots domains are failing to create and renew right nowâŚ
Edit: seems to be better now, not sure.
Edit2: Nope, servfail is back.
Edit3: It seems like the query gives âNOERRORâ after a restart of PowerDNS and keeps giving NOERROR until querycache times out
@rickjanssen Iâm still curious about whether your nameservers are anycast. If so, that could potentially explain random differences, if queries are routed to different servers. Also, sounds like youâre running PowerDNS. What version? I assume youâre running the same version on all nameservers?
Also, in what role are you experiencing failures? Are you acting as a hosting provider and doing issuances yourself that are failing? Or are you acting as a nameserver and getting customer reports that their renewals are failing? Our ideal recommendation is to handle renewal automatically 30 days in advance, and automatically retry errors. Do you have customers whose certificates are on the cusp of expiring? We have a temporary whitelisting mechanism, but itâs based on domain names rather than nameservers (because Unbound doesnât make that information available to Boulder). We are currently using a list of domains that were showing SERVFAIL consistently, but if a domain was returning NOERROR some of the time it wouldnât have made it onto the list.
One thing that stands out to me is that in the snippet that @cpu posted your nameserver returned an unexpected NSEC record which doesnât prove the NODATA response (mail.gwvanpelt.nl. 86400 IN NSEC pop.gwvanpelt.nl. A AAAA RRSIG NSEC which would authenticate a mail.gwvanpelt.nl. NODATA response but not a pop.gwvanpelt.nl. response) which is why that specific query failed.
That said Iâm unable to replicate this as whenever I query your servers I instead get valid NSEC3 records back which makes me think there is something funky going on behind the scenes. As far as I understand it PowerDNS has to be explicitly configured to serve either NSEC or NSEC3 so is it possible you have a mix of servers and one is acting up or something?
@jsha We do not use anycast, but I can explain a little further what happens now:
nsX had no questions about sub.domain.nl -> cache is clear
nsX receives a quistion about CAA of sub.domain.nl -> cache is filled with TTL of 300 seconds -> request verified
nsX receives after 45 seconds another call for CAA of sub.domain.nl -> cache is used -> request verified
nsX receives after 310 seconds another call for CAA of sub.domain.nl -> problem occurs -> request failed
since the round robin technique of DNS servers is changing from ns1 to 2 to 3 the random answers started to confuse us. We blocked the ns1 and 2 in the ip table of a random Ubuntu 16.04 serverwith unbound 1.5.8-1unbuntu1 and set specifiek forced forwarding to ns3.zxcs.nl. I started to loop every second requesting CAA records from unbound at 127.0.0.1 and NOERROR stayed for the length of the cache, followed by SERVFAIL when the cache expired. Restarting PowerDNS the NOERROR came back for exactly 301 seconds, lowering the query cache TTL to 60 made it 61 seconds.
We are running a webhosting platform with thousands of domains, of which probably 90% of the SSL enabled websites is using LetâsEncrypt as certificate provider. The customers request their SSL certificates them self through the script in the webpanel. However the (cc @weppos) problem is not with LetâsEncrypt, but with unbound dns recursor or our PowerDNS setup.
I know about the 30 days, we still have 24 left or so, but this is about new domains too. We want them to be able to request the certificates and renew them. Iâm working continuously to find a fix for our setup. We are about to start building a new nameserver with the latest version to see if this behavior will not occur on that authoritative nameserver.
Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24
I replied there. Since I'm confident this is a distinct root cause from the one affecting @rickjanssen I'm going to split your posts into a separate thread to keep this one clear. Thanks!