I’ve found what may seem to be a weird corner case in the CAA query chain logic, but it’s really important for our use case here at search.usa.gov. I hope you’ll consider it
Background: at DigitalGov Search, we supply search services for a whole boatload of US federal government agencies. The way this works is that each agency CNAMEs a “vanity domain” such as search.someagency.gov to us, and we handle everything from there. One of the many things that we handle is the creation of SSL (SAN) certificates that sign traffic for all of the search.someagency.gov hostnames. To accomplish this we use Let’s Encrypt as described in this blog post - and we’re very grateful for the service!
Since we are bundling many hostnames together onto a single cert (to serve them from a single AWS ELB), creation or renewal of a cert is an all-or-nothing proposition for us: if any of the vanity hostnames fail CAA validation, we can’t renew our SAN cert. In our case we have ~80 hostnames that are passing CAA validation, but two hostnames that have begun failing due to CAA timeouts since the cutover to mandatory CAA validation. This prevents us from renewing our existing CA cert.
Here’s one example. First, we see that that our agency customer has a DNS server that is (incorrectly) timing out on CAA validation:
[nick@nick-mbp ~]$ host -t caa search.invasivespeciesinfo.gov ns1.usda.gov ;; connection timed out; no servers could be reached
Getting the authoritative search.invasivespeciesinfo.gov DNS server to support CAA lookups might, in all actuality, require an Act of Congress. (Or it might not. But you can imagine why this is a non-trivial thing for us to change at the DNS configuration level.) However, they have CNAME’d search.invasivespeciesinfo.gov into our DNS zone:
[nick@nick-mbp ~]$ host -t cname search.invasivespeciesinfo.gov ns1.usda.gov Using domain server: Name: ns1.usda.gov Address: 126.96.36.199#53 Aliases: search.invasivespeciesinfo.gov is an alias for nisic.sites.infr.search.usa.gov.
And if you query the search.usa.gov authoritative nameservers, there is a valid CAA response for that hostname:
[nick@nick-mbp ~]$ host -t caa nisic.sites.infr.search.usa.gov search-ns1.usa.gov Using domain server: Name: search-ns1.usa.gov Address: 188.8.131.52#53 Aliases: nisic.sites.infr.search.usa.gov has CAA record 0 iodef "mailto:email@example.com" nisic.sites.infr.search.usa.gov has CAA record 0 issue "letsencrypt.org"
While I can completely understand the argument that “a CAA resolution that times out should not be considered successful” (because a malicious actor could flood the authoritative nameservers at the same time a request is being made for CAA validation), I don’t think that logic should apply when a CNAME exists for the hostname in question and it is returning a positive CAA validation result.
Here’s some additional defense for the argument I am making. The RFC says:
Let CAA(X) be the record set returned in response to performing a CAA record query on the label X, P(X) be the DNS label immediately above X in the DNS hierarchy, and A(X) be the target of a CNAME or DNAME alias record specified at the label X. o If CAA(X) is not empty, R(X) = CAA(X), otherwise o If A(X) is not null, and R(A(X)) is not empty, then R(X) = R(A(X)), otherwise ...
Let’s Encrypt is choosing to treat a timeout during the CAA(X) phase as “CAA validation failure” rather than “empty response”. AFAICT this isn’t mandated by the RFC, but it does make sense in isolation for the DNS flooding security concern mentioned above and elsewhere on this forum.
However, the definition of CNAME suggests that we probably shouldn’t evaluate the CAA(X) lookup timeout in isolation. Specifically, RFC 2181 reminds us that
An alias name (label of a CNAME record) may, if DNSSEC is in use, have SIG, NXT, and KEY RRs, but may have no other data.
Note that CAA records are not mentioned in the list of RRs for which the delegating zone retains authority. At one extreme, the resolution order in RFC 6844 might be considered wrong by adherents of the strict CNAME definition (which suggests that R(A(X)) should take precedence over R(X)), but at the very least I think that implementors of Let’s Encrypt’s backend validation mechanism should consider following the CAA query chain through a same-level CNAME/DNAME branch before treating a timeout as a CAA failure.
Thanks in advance for your consideration!