BIND regression in 9.16.10 triggers SERVFAIL on CAA

I signed up for the forum here to post a question, and in the process of gathering debugging info, figured out the problem. :slight_smile: But I could still use a few hints from the experts in how to narrow this down to report it "upstream" to ISC.

TL+DR summary: ISC BIND 9.16.10 introduces a bug that, when LE installs TXT _acme_challenge.<subdomain>, causes LE to report "SERVFAIL" even though wireshark confirms that NOERROR is returned by all name servers. The previous BIND release, 9.16.9, is not affected.

My domain is: moonlit-rail.com
Nota Bene: there is no point in checking this domain at the moment, as I have downgraded to BIND 9.16.9 on the internal server, which mitigates the bug. And yes, I now have renewed certs from LetsEncrypt. :smile:

I ran this command: dehydrated -c
It's the older 0.6.5 version and runs against the ACME v2 DNS-01 API.

It produced this output:

External tests: letsdebug.net reported "OK", as did mxtoolbox, verisign and similar DNS checking sites. unboundtest.com also reported no errors initially, but I have been able to get it to repeatably report "sec_status_bogus" when running BIND 9.16.10 or 9.16.11, like so:

  • unbound info: resolving ns3.moonlit-rail.com. AAAA IN
  • unbound info: response for moonlit-rail.com. DNSKEY IN
  • unbound info: reply from <moonlit-rail.com.> 88.218.94.169 #53
  • unbound info: query response was ANSWER
  • unbound info: validated DNSKEY moonlit-rail.com. DNSKEY IN
  • unbound info: validate(nodata): sec_status_bogus

Digging deeper: Analyzing the raw DNS traffic via wireshark was not enlightening, other than to show NOERROR from all servers. However, in the BIND debug logs, I saw the following with bind 9.16.10 but not with bind 9.16.9:

  • dnssec: warning: client @0x7fffb800cec8 66.133.109.36#34285 (wwW.mOoNlIt-rAil.Com): view external: expected a exact match NSEC3, got a covering record

There were several similar lines, one for each probe from LE. My assumption is that BIND is not properly maintaining the DNSSEC records when dynamic updates (from LE via dehydrated) are inserted. And that this causes a DNSSEC validation error, which in turn causes LE to report "SERVFAIL" even though the actual servers are returning queries just fine.

Questions: All of the above seems simple enough to put into a GIT "issue" on ISC's GitLab. But I'm wondering if there may be anything non-standard in how the moonlit-rail.com domain is organized that may also be to blame. None of the DNS checkers I ran complained. Nonetheless...

  • The certificate requested from LE contains the base domain name, plus the usual assortment of subdomains: www, smtp, ftp, and so forth. Each of these is implemented as a CNAME. But BIND tries hard to not allow you to add other RRs when there is a CNAME present. E.g., if in the zone for example.com there is a "foo CNAME bar", then BIND says there should not be a "foo CAA letsencrypt". But LE is probing for exactly that case.
  • Similarly, if there is a "foo CNAME bar" then should there also be a "_acme_challenge.foo TXT <something>" ? Would it make more sense to implement "foo" not as a CNAME, but instead by cloning the A/AAAA/etc of the corresponding server, "bar" ? I know that many sites use CNAMEs successfully, even LetsEncrypt's own API servers do so; but adding other RRs beside or below a CNAME is not something I see.

Thanks for any thoughts...
Kris Karas

3 Likes

Even with you running ISC BIND 9.16.9, your nameservers generate responses for non-existent names which don't pass DNSSEC validation.

nonexistent12345.moonlit-rail.com | DNSViz makes gives a nice view as well. I wonder, is your zone split-horizon?

Nicely written post by the way!

5 Likes

Thanks _az! :smile:

[edit] Wish I had known about DNSVis before posting the original. I'm running 9.16.11 on the public-facing DNS servers and 9.16.9 only on the internal master, figuring that the public-facing servers are not doing any DNSSEC maintenance, only copying from the internal master. But perhaps there's another bug in 9.16.10+ that is causing the NSEC3 lossage? Or perhaps, the NSEC3 issues are also present in 9.16.9 and I haven't noticed them yet (with LE at least). DNSVis would seem to suggest that.

Yes, I'm running split horizon on the downstream from the master (ns3), which has quite a few views, one per VLAN plus an external one on the WAN. The tertiary servers (ns1 and ns2) only serve external, with no views.

I'm using the 9.16 branch to handle the dnssec-policy feature (automatic key rollover).

2 Likes

Just wanted to post an update. The issue is as-yet unresolved upstream. But working with the ISC devos, I have narrowed it down.

Improper Usage

Because my external DNS zone was not being automatically signed, as named was supposed to be doing, I had tried a workaround of adding a NSEC3PARAM RR directly into the zone. This caused named to sign the zone and add NSEC3 RRs. I went with this; and it worked fine up through BIND 9.16.9. But as has @_az noted, the NSEC3 RRs were improperly maintained, causing no-such-address.moonlit-rail.com to fail to prove its non-existence.

In talking with the devos at ISC, they said that adding NSEC3PARAM into the zone is not supported, and will cause the sort of "partially signed" zone issues I saw with the NSEC3. The only proper way of doing this, they say, is to include a "nsec3param" statement inside the dnssec-policy block that controls automatic zone maintenance. Say what?

Yes, the "nsec3param" statement was not added to BIND until 9.16.10! This change of API/configuration made to a minor patch revision is not going to make harried system administrators happy, when they have to upgrade bind on a 3:00am page and find it now requires reading the Bv9ARM and editing their config files.

As it turns out, using ISC's approach (only controlling NSEC3 via the dnssec-policy mechanism) doesn't work in one common situation, which is why I had to implement the workaround noted above, and which led to the failure of LetsEncrypt to validate the challenge.

Zeroing in on the Bug

Although the reasons are not yet understood, it seems that adding this very common control block to your zone's definition in named.conf will cause BIND to fail to sign the zone, despite using ISC's mandated statement inside dnssec-policy:

update-policy {
    grant ddns-update-key zonesub ANY;
};

Remove that block, and automatic DNSSEC maintenance works just fine (and LetsEncrypt will be happy). But then, you cannot dynamically update your zone using, for example, dhcpd.

I'll update, of course, when I have a chance to work on this with ISC, upstream.

4 Likes

Just to keep everybody updated, ISC has worked this bug and released a fix that makes KASP (key and signing policy) work correctly for NSEC3. For reference, the commit is at: ISC Merge 4739; this is expected to debut sometime this month as part of the BIND 9.16.13 release.

There was a bit of a disagreement over the documentation. ISC says that NSEC3 was not supported until BIND 9.16.10, whereas the Bv9ARM reference manual has stated since 9.16.0 that rndc can convert NSEC to NSEC3 and named will maintain it. (Of course, BIND won't actually support NSEC3 properly in dynamically updated zones until 9.16.13)

I have not seen any other users of BIND chime in on the bug report I made over there. And I have not seen any other "me too" comments over here at LetsEncrypt. So, perhaps I'm the only one to have stumbled upon this. :smile: Well, if anybody does, at least we now know the cause and fix.

5 Likes