I’ve made a pass to split off the folks that piled onto this thread so we can help individually. The root cause differs case-to-case. To help avoid more pile-on’s I updated the title of this thread to mention PowerDNS since presently it seems the root cause in this case will be related to that DNS server.
I made a tool that makes it easier to make queries against a DNSSEC-validating Unbound instance and see the debug logs: https://unboundtest.com/. Hopefully it's helpful. @rickjanssen based on your comments about how you reproduced more reliably, I tried blocking ns1.zxcs.nl and ns2.zxcs.nl in iptables on that machine, and querying CAA pop.gwvanpelt.nl
every 5 seconds for 10 minutes. I never saw one SERVFAIL unfortunately. Was that the domain you were able to reproduce with, or was there another?
unboundtest # iptables --list OUTPUT --line-numbers -v
Chain OUTPUT (policy ACCEPT 53 packets, 70499 bytes)
num pkts bytes target prot opt in out source destination
1 18 1302 REJECT all -- any any anywhere ns1.zxcs.nl reject-with icmp-port-unreachable
2 627K 53M REJECT all -- any any anywhere ns2.zxcs.nl reject-with icmp-port-unreachable
Also, one thing we noticed when talking with @weppos separately was that there appears to be a bug either in DNSimple's name server or potentially in Unbound specifically with the combination of DNSSEC-signed zones, DNS 0x20 (which we use), and empty responses. We found that DNSSEC-signed responses that were non-empty worked fine, and disabling DNS 0x20 on the test instance fixed the empty responses (note: we're not planning to disable DNS 0x20 in prod since that would reduce security).
I'm pretty sure you're not experiencing the exact same issue (for one thing, you are using different software), but there may be a similar confluence of confounding factors that includes caching. Do you find that all the domains that are having problems are DNSSEC-signed? Are you able to reproduce the same problem for TXT records? If you add CAA records to a domain that reproduces the problem, does the problem go away?
Is it possible to whitelist certain IP addresses from which the requests come? That will be 185.104.29.0/24
Unfortunately this isn't possible with our software.
I’m still figuring out what causes this to happen.
Indeed, we figured that if we add a CAA record the problem is worked around, but we can’t add it for everyone. We plan on automatically adding the record when requesting a Let’sEncrypt certificate.
Only CAA has this, although I haven’t tested TXT, but A works.
This is weird, I am unable to reproduce the SERVFAIL responses too since now, but nothing changed.
A post was merged into an existing topic: Help diagnosing CAA failures ns1.cyso.nl
Whoops, posted on the wrong thread. Moving that post to the right thread.
The reason I suggest TXT is that for most domains it will be an empty response, while the response for A is non-empty. It seems like there are potentially issues specifically around empty responses.
Will check on that, but for now, even CAA stopped sending SERVFAILs. Might be because of the low traffic at this moment.
Edit: it’s back, going to test some more after some sleep.
For what domain is it back? I don't see SERVFAILs for pop.gwvanpelt.nl
right now.
Try mail.bkbouw.nl @ns1.zxcs.nl ( 185.104.28.19 ), I've lowered the query cache so it should start to servfail almost instantly. ns3.zxcs.nl ( 178.62.208.8 ) has a different configuration right now.
Hm, I'm still not able to reproduce, even for this domain.
Aren’t you using a different setup than before? I see this happening:
blocked ns2 ns3, forward to ns1
root@ubuntu:/home/rick# while true; do sleep 1; dig mail.bkbouw.nl CAA @127.0.0.1 | grep status; done
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 43506
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5907
---- THE MOMENT I RESTART POWERDNS ----
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37054
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56267
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7804
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8234
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47057
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 6974
I still don't reproduce, even with this script. One thing to note: I don't have the ability to restart your PowerDNS instance (of course). Is it possible the issue only presents shortly after a restart?
Also, my current config rejects packets going to ns[23], and allows packets going to ns1. I didn't apply a forwarding rule because I figured it was unnecessary. Want to share with me your iptables? Here's mine:
# iptables --list OUTPUT -v --line-numbers
Chain OUTPUT (policy ACCEPT 39160 packets, 44M bytes)
num pkts bytes target prot opt in out source destination
1 804 55864 REJECT all -- any any anywhere ns3.zxcs.nl reject-with icmp-port-unreachable
2 629K 53M REJECT all -- any any anywhere ns2.zxcs.nl reject-with icmp-port-unreachable
Also, it would probably be easier if instead you set up a domain (or subdomain) that had only one NS record, so neither of us would have to mess around with iptables.
Also, I got a reply on the unbound-users mailing list suggesting a possible area to look at: Issues with DNSSEC, use-caps-for-id, and empty responses. I assume your PowerDNS instance does online signing? Can you check whether it downcases queries before signing NSEC responses?
I reported an issue to PowerDNS, and they tell me:
When I check the version of PowerDNS you're currently running, it looks like 4.0.4. Have you upgraded since we began the discussion, or have you been running 4.0.4 all along?
$ dig +short version.bind chaos txt @ns1.zxcs.nl
"PowerDNS Authoritative Server 4.0.4 (built Jun 22 2017 20:14:47 by buildbot@c1b965951e5b)"
Hi @jsha
Yes, I’ve upgraded yesterday evening to pdns 4.0.4. I’m currently on a holiday so I’m sory for my slow answers.
The problem looks solved! Thanks for giving so much time and attention to this issue.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.