I’ve seen that lack of DNS response be related to IPv6 MTU size limits being set too low or overly strict ICMP block rules. And would contribute to the overall delays in response (or fail to respond altogether).
EDNS can allow UDP packet size up to 4096 bytes.
But sadly it is not properly implemented across the Internet and it has been out since 1999: https://www.ietf.org/rfc/rfc2671.txt
from: https://tools.ietf.org/html/rfc6891
EDNS provides a mechanism to improve the scalability of DNS as its
uses get more diverse on the Internet. It does this by enabling the
use of UDP transport for DNS messages with sizes beyond the limits
specified in RFC 1035 …
So much for that theory … (though us.com does go through CentralNIC first before it hits nsX.whois.com, so maybe the theory about the levels of indirection causing deadlines to be exceeded has merit to it).
Let’s try this experiment: Can you take a sampling of your domains that have failed, and submit them every five minutes against unboundtest.com and see if you get consistent success vs intermittent failures? It should be possible to script this with curl.
No API reference, I’m afraid (it’s not really robust enough to be an API, but decent enough for this one-off test). Simulating the form submit should be fine.
DNSSPY.io shows:
All IPv6 nameservers are hosted by the same provider (AS16509 - AMAZON-02 - Amazon.com, Inc., US). Consider spreading the nameservers across multiple DNS providers for increased redundancy.
I ran my own Unbound-based test for an hour at 1m intervals across those domains, with a fresh libunbound instance every interval, and didn’t get any resolver errors.
Edit: 24h later, no SERVFAILs and no slow queries apart from a weird spike that happened one time, and could have easily been a local condition:
I also had a persistent false CAA failure for mx2.slxh.nl (again). The failure disappeared after I requested the cert from another machine, after which requesting a cert from the original machine also worked.
Maybe rare failures are cached somehow?
Edit: same for a large set of other .slxh.nl domains: works from one host, not from another.