CNAME does not solve CAA timeout, but probably should?

I’ve found what may seem to be a weird corner case in the CAA query chain logic, but it’s really important for our use case here at search.usa.gov. I hope you’ll consider it :slight_smile:

Background: at DigitalGov Search, we supply search services for a whole boatload of US federal government agencies. The way this works is that each agency CNAMEs a “vanity domain” such as search.someagency.gov to us, and we handle everything from there. One of the many things that we handle is the creation of SSL (SAN) certificates that sign traffic for all of the search.someagency.gov hostnames. To accomplish this we use Let’s Encrypt as described in this blog post - and we’re very grateful for the service!

Since we are bundling many hostnames together onto a single cert (to serve them from a single AWS ELB), creation or renewal of a cert is an all-or-nothing proposition for us: if any of the vanity hostnames fail CAA validation, we can’t renew our SAN cert. In our case we have ~80 hostnames that are passing CAA validation, but two hostnames that have begun failing due to CAA timeouts since the cutover to mandatory CAA validation. This prevents us from renewing our existing CA cert.

However, after reading the RFC closely and considering this thread on GitHub, I’m not sure that LE is doing the right thing in our situation.

Here’s one example. First, we see that that our agency customer has a DNS server that is (incorrectly) timing out on CAA validation:

[nick@nick-mbp ~]$ host -t caa search.invasivespeciesinfo.gov ns1.usda.gov
;; connection timed out; no servers could be reached

Getting the authoritative search.invasivespeciesinfo.gov DNS server to support CAA lookups might, in all actuality, require an Act of Congress. (Or it might not. But you can imagine why this is a non-trivial thing for us to change at the DNS configuration level.) However, they have CNAME’d search.invasivespeciesinfo.gov into our DNS zone:

[nick@nick-mbp ~]$ host -t cname search.invasivespeciesinfo.gov ns1.usda.gov
Using domain server:
Name: ns1.usda.gov
Address: 199.141.126.202#53
Aliases:

search.invasivespeciesinfo.gov is an alias for nisic.sites.infr.search.usa.gov.

And if you query the search.usa.gov authoritative nameservers, there is a valid CAA response for that hostname:

[nick@nick-mbp ~]$ host -t caa nisic.sites.infr.search.usa.gov search-ns1.usa.gov
Using domain server:
Name: search-ns1.usa.gov
Address: 52.203.57.242#53
Aliases:

nisic.sites.infr.search.usa.gov has CAA record 0 iodef "mailto:dgsearchops@rrsoft.co"
nisic.sites.infr.search.usa.gov has CAA record 0 issue "letsencrypt.org"

While I can completely understand the argument that “a CAA resolution that times out should not be considered successful” (because a malicious actor could flood the authoritative nameservers at the same time a request is being made for CAA validation), I don’t think that logic should apply when a CNAME exists for the hostname in question and it is returning a positive CAA validation result.

Here’s some additional defense for the argument I am making. The RFC says:

   Let CAA(X) be the record set returned in response to performing a CAA
   record query on the label X, P(X) be the DNS label immediately above
   X in the DNS hierarchy, and A(X) be the target of a CNAME or DNAME
   alias record specified at the label X.

   o  If CAA(X) is not empty, R(X) = CAA(X), otherwise

   o  If A(X) is not null, and R(A(X)) is not empty, then R(X) =
      R(A(X)), otherwise ...

Let’s Encrypt is choosing to treat a timeout during the CAA(X) phase as “CAA validation failure” rather than “empty response”. AFAICT this isn’t mandated by the RFC, but it does make sense in isolation for the DNS flooding security concern mentioned above and elsewhere on this forum.

However, the definition of CNAME suggests that we probably shouldn’t evaluate the CAA(X) lookup timeout in isolation. Specifically, RFC 2181 reminds us that

An alias name (label of a CNAME record) may, if DNSSEC is in use,
have SIG, NXT, and KEY RRs, but may have no other data.

Note that CAA records are not mentioned in the list of RRs for which the delegating zone retains authority. At one extreme, the resolution order in RFC 6844 might be considered wrong by adherents of the strict CNAME definition (which suggests that R(A(X)) should take precedence over R(X)), but at the very least I think that implementors of Let’s Encrypt’s backend validation mechanism should consider following the CAA query chain through a same-level CNAME/DNAME branch before treating a timeout as a CAA failure.

Thanks in advance for your consideration!

@jsha, would you care to opine on this?

Hi @nickmarden! Thanks for the detailed post.

We haven't made any recent changes to our CAA code. Validation has been mandatory since the beginning. There is one pending change: to reject domains that return SERVFAIL. That's enabled in staging, but not prod, and has been for a while while we find time to do some analysis of the switchover.

Is it possible something has recently changed about those two domains causing them to fail?

I believe this is the case given the current implementation. Currently, Boulder relies on Unbound as a recursive resolver to look up CAA records. The RFC 1034 recursive resolution algorithm will first check for a CNAME, and if there is one, will follow it.

So I think we already do what you're proposing. I can dig into our logs to see if I can find out anything more about this particular domain's failure. If you'd like to do some independent investigating, you can also set up a local Unbound instance and try querying through it for CAA for these domains.

We hadn't changed anything in our DNS prior to noticing this issue, and our client has been CNAME'ing to us without interruption for many months. So, no, nothing has changed.

After we noticed the timeout error during cert renewal, we added CAA support, but that was after the fact - it did not cause the original CAA timeout error.

What in the what? The cert renewal was failing last night and is not failing now, and nothing has been changed in the interim. That's weird to me because all of the DNS responses were at that time, and are now, what I've detailed in my post.

The only thing I can think that might have changed is that last night your code might not have been querying the authoritative nameserver(s) for search.invasivespeciesinfo.gov or search.usa.gov, and therefore receiving a cached SERVFAIL from some intermediate (caching) nameserver? Although the RFC does say

   Data cached by third parties MUST NOT be relied on but MAY be used to
   support additional anti-spoofing or anti-suppression controls.

I feel like "concluding that the CAA lookup is a SERVFAIL because of cached DNS data is a bad idea" would fall under this clause, if that's what was happening.

At any rate, thanks for your consideration and for having implemented this (either mostly or 100%) correctly.

Odd! Perhaps the servers were slow last night? However, that would make it odd that we saw timeouts for the CAA lookup and not the A lookup. Another possibility: We’ve found that a very small minority of routes appear to drop CAA DNS queries, due to some unknown filtering. Perhaps traffic between Let’s Encrypt’s resolver and yours briefly went through one of those routes last night?

At any rate, glad it’s working for you now!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.