False CAA failure when issuing certs

But the initial delegation (from the root zones) is to worldnic.com.

That the domain is further delegated may be irrelevant, because the query may die before it ever reaches pantheon.

e.g.

$ dig rule.com ns +short
ns67.worldnic.com.
ns68.worldnic.com.

Ah I see, the root zones. Ok.

Unbound Test returns NOERROR for CAA lookup of the root zone: https://unboundtest.com/m/CAA/altamidtown.com/EO4F53SO

Here’s a non-worldnic example: SERVFAIL checking CAA on www.eduhub.us.com for www.eduhub.us.com

$ dig ns eduhub.us.com +short
ns1.whois.com.
ns2.whois.com.
ns3.whois.com.
ns4.whois.com.

So much for that theory … (though us.com does go through CentralNIC first before it hits nsX.whois.com, so maybe the theory about the levels of indirection causing deadlines to be exceeded has merit to it).

Let’s try this experiment: Can you take a sampling of your domains that have failed, and submit them every five minutes against unboundtest.com and see if you get consistent success vs intermittent failures? It should be possible to script this with curl.

Yeah we can do that, is there a convenient API reference anywhere for unboundtest or do we simulate the form submit action?

No API reference, I’m afraid (it’s not really robust enough to be an API, but decent enough for this one-off test). Simulating the form submit should be fine.

Ya, curl -L 'https://unboundtest.com/q?type=CAA&qname=pantheon.io' works fine.

DNSSPY.io shows:
All IPv6 nameservers are hosted by the same provider (AS16509 - AMAZON-02 - Amazon.com, Inc., US). Consider spreading the nameservers across multiple DNS providers for increased redundancy.

I ran my own Unbound-based test for an hour at 1m intervals across those domains, with a fresh libunbound instance every interval, and didn’t get any resolver errors.

Edit: 24h later, no SERVFAILs and no slow queries apart from a weird spike that happened one time, and could have easily been a local condition:

plot

2 Likes

I ran a test on these domains for 4 days, and got no SERVFAIL at all.

I also had a persistent false CAA failure for mx2.slxh.nl (again). The failure disappeared after I requested the cert from another machine, after which requesting a cert from the original machine also worked.

Maybe rare failures are cached somehow?

Edit: same for a large set of other .slxh.nl domains: works from one host, not from another.

@jsha Hi again, I’ve just poked around at the latest version of https://github.com/golang/crypto but there’s no mention of user-agent. We can (and will) update, but it doesn’t look like that will change anything. https://github.com/golang/crypto/blob/master/acme/http.go#L198

Where did you see the default user-agent change?

Turns out I was incorrect about this. I saw an update on x/crypto/acme: Set a meaningful user-agent · Issue #24496 · golang/go · GitHub indicating there was a CL ready, but it looks like it hasn't been merged (https://go-review.googlesource.com/c/crypto/+/86635).

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.