So much for that theory … (though us.com does go through CentralNIC first before it hits nsX.whois.com, so maybe the theory about the levels of indirection causing deadlines to be exceeded has merit to it).
Let’s try this experiment: Can you take a sampling of your domains that have failed, and submit them every five minutes against unboundtest.com and see if you get consistent success vs intermittent failures? It should be possible to script this with curl.
No API reference, I’m afraid (it’s not really robust enough to be an API, but decent enough for this one-off test). Simulating the form submit should be fine.
DNSSPY.io shows:
All IPv6 nameservers are hosted by the same provider (AS16509 - AMAZON-02 - Amazon.com, Inc., US). Consider spreading the nameservers across multiple DNS providers for increased redundancy.
I ran my own Unbound-based test for an hour at 1m intervals across those domains, with a fresh libunbound instance every interval, and didn’t get any resolver errors.
Edit: 24h later, no SERVFAILs and no slow queries apart from a weird spike that happened one time, and could have easily been a local condition:
I also had a persistent false CAA failure for mx2.slxh.nl (again). The failure disappeared after I requested the cert from another machine, after which requesting a cert from the original machine also worked.
Maybe rare failures are cached somehow?
Edit: same for a large set of other .slxh.nl domains: works from one host, not from another.