Potential networking / client changes on DNS Challenges

Yesterday, we started noticing an increased failure rate when completing certificate orders. This began at 16:53 UTC, November 6th, 2023 and is ongoing.

About 60% of all certificate orders generated have resulted in INVALID orders, supposedly due to DNS failures. We’ve checked that our nameserver is indeed receiving the requests, and there have been no code changes on our end. Below are some example failures:

We see the following error:

DNS problem: SERVFAIL looking up CAA for [domain_name] - the domain's nameservers may be malfunctioning

While we are continuing to investigate on our end -- this smells like a networking issue between the Let’s Encrypt DNS challenge client and our nameserver. Our nameserver implementation does not use DNSSEC and we simply return NOERROR with empty responses (which had been working with close to 0% DNS failure rate for about 3 years now). Below is a specific example:

One device failing to fetch certificates was able to continuously retry to get out of this INVALID state for domainName: *.wxolw6t3l6ana3eryxwm.device.stripe-terminal-local-reader.net. For this domain, we see 7 total certificates being ordered (between 18:05 - 19:37 UTC), the first 6 DNS failures and immediately becoming INVALID. The last one finally transitions to PROCESSING and finally VALID.

1 Like

DNSViz thinks that there are things wrong with your DNS setup:

https://dnsviz.net/d/kvbiatr7m3pcshymkoh7.device.stripe-terminal-local-reader.net/dnssec/?rr=257&a=all&ds=all&ta=.&tk=

  • device.stripe-terminal-local-reader.net zone: The server(s) did not respond authoritatively for the namespace. (54.185.11.235, 54.212.160.246, 54.212.216.243)
  • kvbiatr7m3pcshymkoh7.device.stripe-terminal-local-reader.net/CAA (NODATA): No SOA RR was returned with the NODATA response. (54.185.11.235, 54.212.160.246, 54.212.216.243, UDP_-_EDNS0_4096_D_KN)
  • kvbiatr7m3pcshymkoh7.device.stripe-terminal-local-reader.net/CAA (NODATA): The Authoritative Answer (AA) flag was not set in the response. (54.185.11.235, 54.212.160.246, 54.212.216.243, UDP_-_EDNS0_4096_D_KN)
  • stripe-terminal-local-reader.net to device.stripe-terminal-local-reader.net: No SOA RR was returned with the NODATA response. (54.185.11.235, UDP_-_EDNS0_4096_D_KN)
  • stripe-terminal-local-reader.net to device.stripe-terminal-local-reader.net: The Authoritative Answer (AA) flag was not set in the response. (54.185.11.235, UDP_-_EDNS0_4096_D_KN)
  • stripe-terminal-local-reader.net zone: The server(s) responded over TCP with a malformed response or with an invalid RCODE. (54.185.11.235)
  • stripe-terminal-local-reader.net zone: The server(s) responded over UDP with a malformed response or with an invalid RCODE. (54.185.11.235)
  • stripe-terminal-local-reader.net/DNSKEY: The response had an invalid RCODE (REFUSED). (54.185.11.235, UDP_-_EDNS0_4096_D_KN, UDP_-_EDNS0_512_D_KN)
5 Likes

While there's no overall increase in SERVFAIL, I do see a correlation with your domain that corresponds immediately after an upgrade to Unbound DNS 1.18.0.

We had upgraded about half our DNS servers, which may explain why it's 60%

That update is rolling back right now while we investigate.

9 Likes

Rollback is completed.

6 Likes

Hey @mcpherrinm and @jcjones, feel free to contact me directly if you'd like some more insight / help into figuring out what's happening here.

We're definitely curious as well

Oh and I missed mentioning -- the error rates after the rollback have subsided on our side as well.

2 Likes

Thanks. Can you share what your authoritative nameserver for these records is?

We’re going to gather more information in our staging environment to understand what’s happened here, and figure out whose bug this is.

6 Likes

Our server is authoritative, but we have not been replying in a way an authoritative server should. Technically however, it is ns.stripe-terminal-local-reader.net.

and figure out whose bug this is

We think this is ours. We're working on remediating the following:

  • Add SOA RR to Authority Section on DNS response
  • Add AA flag to all dns answers
  • Create email target for SOA record
1 Like

I think the question may have been them wanting to know what software are you using for the DNS server, in case this is something that might be a common configuration that other domain owners might run into.

In addition to DNSViz, you might want to try out the ISC EDNS Compliance Tester, especially if your server configuration is a bit off the beaten path.

5 Likes

We have built our own extremely lightweight nameserver. It's capabilities is described via the docstring below:

/**
 * This service listens for DNS A record lookup requests made with a specifically formatted name,
 * will parse out an IP that is within the reserved IPv4 ranges, and answers with the derived private IP address.
 *
 * This server is intended to be extremely lightweight and simple, as well as limited in functionality. Specifically,
 * this server will only respond to DNS queries with certain conditions:
 *
 * - the DNS query can contain a single question for the A record of the provided name
 * - the base of the provided name must match a specific domain (e.g. device.stripe-terminal-local-reader.net)
 * - the lowest level subdomain must be formatted as four groups of numbers delimited by a hyphen (e.g. 10-2-3-4)
 * - the numbers when joined by a '.' must be a valid IP address (e.g. 10.2.3.4)
 * - this resulting IP must fall in the reserved ranges for private address spaces (https://tools.ietf.org/html/rfc1918)
 *
 * If all of the conditions above are met, the server will respond with the A record resolving to the derived private
 * IP address. Due to the fact that the query result is deterministic, the answer will be returned with a high TTL to
 * reduce the need for superfluous lookups.
 *
 * Additionally this server can also respond to TXT record requests for _acme-challenge.{valid-domain} where
 * valid-domain meets the validity criteria for A record requests listed above.
 * e.g. _acme-challenge.192-168-1-1.device.stripe-terminal-local-reader.net
 **/
2 Likes

Yeah, as I think you're seeing, homebrewing your own DNS server is something that sounds a lot easier than it is. There's a lot to take care of with ensuring that it handles EDNS options (or at least not breaking when EDNS is used), handles both UDP and TCP, gives the right responses to requests for records you're not expecting, and so forth. (And that's without throwing fancier things like DNSSEC in there.) If you're rolling your own DNS server, you might want to test against a wide variety of clients, as well as several online DNS server tests, plus whatever test cases you can think of (Mixed case, subsets of the name, querying the wrong servers, etc.)

Good luck!

5 Likes

You should use that homebrewed server as an unlisted author.
And have the listed nameservers simple sync the zone whenever a change is made.

3 Likes

Well, if I understand what it's doing correctly, then there really isn't a "zone file" to sync, as it's dynamically returning a result depending on if it meets the criteria, and wants to just be a name for all internal IPs. There may be more off-the-shelf software for that sort of thing, but it might not be any easier than this approach is.

4 Likes

True.
Having to build a zone file and a compatible transfer mechanism [from scratch] may not be worth the effort.

2 Likes

@jcjones / @mcpherrinm Is staging running the new version of Unbound? We have made changes that we believe should resolve the issue we had but want to confirm this before y'all make any changes to the production environment.

4 Likes

Yes, Staging is running 1.18.0 and has been since 3 Nov at 22:23Z.

6 Likes

@gurjit Do you have any timelines you can share about your tests against Staging?

3 Likes

@jcjones Later today or tomorrow worst case. I have tested the existing implementation fails on staging and need to test the new dns implementation.

1 Like

@jcjones We have tested the new DNS implementation on staging and everything looks to be working well. We will rollout the change next week.

3 Likes

Excellent. Then I'll plan to do our upgrade next week after you. :slight_smile:

4 Likes