DNS problem: query timed out looking up CAA (using Netregistry)

Thanks for the detailed post!

Our past investigation into reachability issues with CAA and NetRegistry have indicated that they are most likely routing-related. I.e., some hop along the path from us to them seems to drop UDP DNS packets if they are of type CAA. Querying from some parts of the Internet results in a timeout; querying from other parts succeeds. Previously we were able to get around this with a big hack, routing DNS traffic to NetRegistry through one of our datacenters that seemed to be able to reach them reliably. It's possible routing tables have changed in such a way that that hack no longer works. We'll look into it.

Our Unbound is configured with the default behavior, to attempt TCP if UDP fails. Our past investigations showed TCP failing in the same way. However, it's possible the TCP fallback is not happening fast enough for the timeouts we have configured inside Boulder. We'll dig into this too. We've also been meaning to explore a "TCP first" lookup methodology, which might mitigate the CAA timeouts if it's currently true that TCP queries reliably succeed. (see this post)

Looking at our logs, we do see an increase in CAA-related timeouts on April 13. That happens to be the same day that NetRegistry had a major outage (1 2 3). It's possible that as part of their recovery from that outage, some routing properties changed that are causing this recurrence of timeout problems.

Thanks for bringing this to our attention; we'll work on getting it fixed.

2 Likes