Yesterday, we started noticing an increased failure rate when completing certificate orders. This began at 16:53 UTC, November 6th, 2023 and is ongoing.
About 60% of all certificate orders generated have resulted in INVALID orders, supposedly due to DNS failures. We’ve checked that our nameserver is indeed receiving the requests, and there have been no code changes on our end. Below are some example failures:
DNS problem: SERVFAIL looking up CAA for [domain_name] - the domain's nameservers may be malfunctioning
While we are continuing to investigate on our end -- this smells like a networking issue between the Let’s Encrypt DNS challenge client and our nameserver. Our nameserver implementation does not use DNSSEC and we simply return NOERROR with empty responses (which had been working with close to 0% DNS failure rate for about 3 years now). Below is a specific example:
One device failing to fetch certificates was able to continuously retry to get out of this INVALID state for domainName: *.wxolw6t3l6ana3eryxwm.device.stripe-terminal-local-reader.net. For this domain, we see 7 total certificates being ordered (between 18:05 - 19:37 UTC), the first 6 DNS failures and immediately becoming INVALID. The last one finally transitions to PROCESSING and finally VALID.
device.stripe-terminal-local-reader.net zone: The server(s) did not respond authoritatively for the namespace. (54.185.11.235, 54.212.160.246, 54.212.216.243)
kvbiatr7m3pcshymkoh7.device.stripe-terminal-local-reader.net/CAA (NODATA): No SOA RR was returned with the NODATA response. (54.185.11.235, 54.212.160.246, 54.212.216.243, UDP_-_EDNS0_4096_D_KN)
kvbiatr7m3pcshymkoh7.device.stripe-terminal-local-reader.net/CAA (NODATA): The Authoritative Answer (AA) flag was not set in the response. (54.185.11.235, 54.212.160.246, 54.212.216.243, UDP_-_EDNS0_4096_D_KN)
stripe-terminal-local-reader.net to device.stripe-terminal-local-reader.net: No SOA RR was returned with the NODATA response. (54.185.11.235, UDP_-_EDNS0_4096_D_KN)
stripe-terminal-local-reader.net to device.stripe-terminal-local-reader.net: The Authoritative Answer (AA) flag was not set in the response. (54.185.11.235, UDP_-_EDNS0_4096_D_KN)
stripe-terminal-local-reader.net zone: The server(s) responded over TCP with a malformed response or with an invalid RCODE. (54.185.11.235)
stripe-terminal-local-reader.net zone: The server(s) responded over UDP with a malformed response or with an invalid RCODE. (54.185.11.235)
stripe-terminal-local-reader.net/DNSKEY: The response had an invalid RCODE (REFUSED). (54.185.11.235, UDP_-_EDNS0_4096_D_KN, UDP_-_EDNS0_512_D_KN)
While there's no overall increase in SERVFAIL, I do see a correlation with your domain that corresponds immediately after an upgrade to Unbound DNS 1.18.0.
We had upgraded about half our DNS servers, which may explain why it's 60%
That update is rolling back right now while we investigate.
Our server is authoritative, but we have not been replying in a way an authoritative server should. Technically however, it is ns.stripe-terminal-local-reader.net.
and figure out whose bug this is
We think this is ours. We're working on remediating the following:
I think the question may have been them wanting to know what software are you using for the DNS server, in case this is something that might be a common configuration that other domain owners might run into.
In addition to DNSViz, you might want to try out the ISC EDNS Compliance Tester, especially if your server configuration is a bit off the beaten path.
We have built our own extremely lightweight nameserver. It's capabilities is described via the docstring below:
/**
* This service listens for DNS A record lookup requests made with a specifically formatted name,
* will parse out an IP that is within the reserved IPv4 ranges, and answers with the derived private IP address.
*
* This server is intended to be extremely lightweight and simple, as well as limited in functionality. Specifically,
* this server will only respond to DNS queries with certain conditions:
*
* - the DNS query can contain a single question for the A record of the provided name
* - the base of the provided name must match a specific domain (e.g. device.stripe-terminal-local-reader.net)
* - the lowest level subdomain must be formatted as four groups of numbers delimited by a hyphen (e.g. 10-2-3-4)
* - the numbers when joined by a '.' must be a valid IP address (e.g. 10.2.3.4)
* - this resulting IP must fall in the reserved ranges for private address spaces (https://tools.ietf.org/html/rfc1918)
*
* If all of the conditions above are met, the server will respond with the A record resolving to the derived private
* IP address. Due to the fact that the query result is deterministic, the answer will be returned with a high TTL to
* reduce the need for superfluous lookups.
*
* Additionally this server can also respond to TXT record requests for _acme-challenge.{valid-domain} where
* valid-domain meets the validity criteria for A record requests listed above.
* e.g. _acme-challenge.192-168-1-1.device.stripe-terminal-local-reader.net
**/
Yeah, as I think you're seeing, homebrewing your own DNS server is something that sounds a lot easier than it is. There's a lot to take care of with ensuring that it handles EDNS options (or at least not breaking when EDNS is used), handles both UDP and TCP, gives the right responses to requests for records you're not expecting, and so forth. (And that's without throwing fancier things like DNSSEC in there.) If you're rolling your own DNS server, you might want to test against a wide variety of clients, as well as several online DNS server tests, plus whatever test cases you can think of (Mixed case, subsets of the name, querying the wrong servers, etc.)
Well, if I understand what it's doing correctly, then there really isn't a "zone file" to sync, as it's dynamically returning a result depending on if it meets the criteria, and wants to just be a name for all internal IPs. There may be more off-the-shelf software for that sort of thing, but it might not be any easier than this approach is.
@jcjones / @mcpherrinm Is staging running the new version of Unbound? We have made changes that we believe should resolve the issue we had but want to confirm this before y'all make any changes to the production environment.