I signed up for the forum here to post a question, and in the process of gathering debugging info, figured out the problem. But I could still use a few hints from the experts in how to narrow this down to report it "upstream" to ISC.
TL+DR summary: ISC BIND 9.16.10 introduces a bug that, when LE installs TXT _acme_challenge.<subdomain>, causes LE to report "SERVFAIL" even though wireshark confirms that NOERROR is returned by all name servers. The previous BIND release, 9.16.9, is not affected.
My domain is: moonlit-rail.com
Nota Bene: there is no point in checking this domain at the moment, as I have downgraded to BIND 9.16.9 on the internal server, which mitigates the bug. And yes, I now have renewed certs from LetsEncrypt.
I ran this command: dehydrated -c
It's the older 0.6.5 version and runs against the ACME v2 DNS-01 API.
It produced this output:
- "Error finalizing order :: While processing CAA for www.moonlit-rail.com: DNS problem: SERVFAIL looking up CAA for www.moonlit-rail.com - the domain's nameservers may be malfunctioning"
External tests: letsdebug.net reported "OK", as did mxtoolbox, verisign and similar DNS checking sites. unboundtest.com also reported no errors initially, but I have been able to get it to repeatably report "sec_status_bogus" when running BIND 9.16.10 or 9.16.11, like so:
- unbound info: resolving ns3.moonlit-rail.com. AAAA IN
- unbound info: response for moonlit-rail.com. DNSKEY IN
- unbound info: reply from <moonlit-rail.com.> 22.214.171.124 #53
- unbound info: query response was ANSWER
- unbound info: validated DNSKEY moonlit-rail.com. DNSKEY IN
- unbound info: validate(nodata): sec_status_bogus
Digging deeper: Analyzing the raw DNS traffic via wireshark was not enlightening, other than to show NOERROR from all servers. However, in the BIND debug logs, I saw the following with bind 9.16.10 but not with bind 9.16.9:
- dnssec: warning: client @0x7fffb800cec8 126.96.36.199#34285 (wwW.mOoNlIt-rAil.Com): view external: expected a exact match NSEC3, got a covering record
There were several similar lines, one for each probe from LE. My assumption is that BIND is not properly maintaining the DNSSEC records when dynamic updates (from LE via dehydrated) are inserted. And that this causes a DNSSEC validation error, which in turn causes LE to report "SERVFAIL" even though the actual servers are returning queries just fine.
Questions: All of the above seems simple enough to put into a GIT "issue" on ISC's GitLab. But I'm wondering if there may be anything non-standard in how the moonlit-rail.com domain is organized that may also be to blame. None of the DNS checkers I ran complained. Nonetheless...
- The certificate requested from LE contains the base domain name, plus the usual assortment of subdomains: www, smtp, ftp, and so forth. Each of these is implemented as a CNAME. But BIND tries hard to not allow you to add other RRs when there is a CNAME present. E.g., if in the zone for example.com there is a "foo CNAME bar", then BIND says there should not be a "foo CAA letsencrypt". But LE is probing for exactly that case.
- Similarly, if there is a "foo CNAME bar" then should there also be a "_acme_challenge.foo TXT <something>" ? Would it make more sense to implement "foo" not as a CNAME, but instead by cloning the A/AAAA/etc of the corresponding server, "bar" ? I know that many sites use CNAMEs successfully, even LetsEncrypt's own API servers do so; but adding other RRs beside or below a CNAME is not something I see.
Thanks for any thoughts...