Strict CAA checking does not implement tree climbing on CNAMEs

Hi,

I have some certificates for internal machines which have begun failing DNS-01 challenges since the CAA strict checking/no SERVFAIL changes.

My setup is that the domains I'm requesting certs for are CNAMEs pointing to DNS records which are hosted on firewalled DNS authorities. This allows a public name to point to an internal IP without revealing the network addresses or structure of my company's LAN.

e.g.

Thus when LetsEncrypt follows the CNAME from git.example.com and asks the internal.example.com nameserver, it gets SERVFAIL since it cannot contact 10.0.0.53.

This appears to be the reason LE returns

acme: Error 400 - urn:acme:error:connection - DNS problem: SERVFAIL looking up CAA for git.example.com
Error Detail:
Validation for git.example.com:
Resolved to:

Used:

Lego exit status: 1

However, per RFC6844 Section 4 (RFC 6844 - DNS Certification Authority Authorization (CAA) Resource Record), it looks to me like LE should be implementing tree climbing to the CNAME record's parent name, and should thus eventually check the CAA record for example.com, even if it cannot reach the authority for gitserver.internal.example.com.

I have attempted to catch tree climbing by both:

On the theory that tree climbing from git.example.com should hit the former, while tree climbing from gitserver.internal.example.com should hit the latter.

However, my DNS-01 attempts are still failing.

Here's the RFC section I think says this is wrong:

Let CAA(X) be the record set returned in response to performing a CAA
record query on the label X, P(X) be the DNS label immediately above
X in the DNS hierarchy, and A(X) be the target of a CNAME or DNAME
alias record specified at the label X.

o If CAA(X) is not empty, R(X) = CAA (X), otherwise

o If A(X) is not null, and R(A(X)) is not empty, then R(X) =
R(A(X)), otherwise

o If X is not a top-level domain, then R(X) = R(P(X)), otherwise

o R(X) is empty.

Bullet (1) should fail, since we cannot have a CAA on the CNAMEd record.

Bullet (2) will SERVFAIL, meaning that the boolean A(X) is not null AND R(A(X)) is not empty is undefined, meaning we should proceed with the OTHERWISE to

Bullet (3), which would cause a CAA lookup on the parent of X, in this case the parent of git.example.com, which is example.com, and should be retrievable.

Given the announcement of hard fails on SERVFAIL and the behavior I have observed, I believe the Let's Encrypt checker is likely incorrectly taking the SERVFAIL in the bullet (2) test as an overall permanent fail, while it seems to me is should simply pass on to the remainder of the rules.

Glancing at the IETF mailing lists on this subject, it looks like the tree climbing and CNAME rules in the RFC were intended to support non-publicly-accessible nameservers like I'm talking about.

My specific providers is Amazon Route53 for the both the internal and external zones. I'm using the Lego client from a Linux host (which, incidentally, on the client side can access the internal DNS authority).

Will it be possible to fix the behavior of the verification tool on LE's servers?

I believe the issue is that your DNS should not be returning SERVFAIL, but rather NOERROR. The former is considered a failure condition that, as I understand it, is designed to prevent issuance.

Unfortunately, it’s not straightforward to distinguish a SERVFAIL due to an unavailable nameserver from a SERVFAIL due to tampered DNS, which is why we implement strict failing.

@jared.m’s solution is not quite right, since it’s not your authoritative resolver that is returning SERVFAIL. Your authoritative resolver, as you say, is not available from the public Internet, so Let’s Encrypt’s recursive resolver is returning SERVFAIL when trying to contact it.

I believe the best way to fix this situation is for your externally-visible nameserver to not return an internal-only delegation when queried from external hosts. In other words, when queried from the public Internet for a private name, your public nameserver should return NXDOMAIN or NOERROR; when queried from behind your firewall it should return the delegation to the internal-only nameserver.

Thank you for your quick reply.

@jsha, regarding your comments:

Unfortunately, it’s not straightforward to distinguish a SERVFAIL due to an unavailable nameserver from a SERVFAIL due to tampered DNS, which is why we implement strict failing.

I understand that SERVFAIL has a security implication, but I believe that the tree climbing portion of the RFC was specifically designed to address this SERVFAIL-from-RR due to network connectivity limitations.

I hate to reference a non-authoritative source, but looking at the IEFT lists, the first result is a message from one of the RFC's authors clearly saying that the tree climbing behavior was intended to support exactly the sort of security controls I have implemented by keeping my authorities off the public Internet:

The reason the tree climbing is necessary is that MANY DNS host and service
names are not visible on the public DNS. So most of the time, a CA has no
way to validate the records for secrethost.example.com, the CAA record has
to be at example.com.

In other words, the four-bullet-point recursive algorithm I copied from the RFC in my initial post was specifically intended to mean if SERVFAIL-from-RR, then proceed to tree climbing (the P(X) portion), not short-circuit fail the entire algorithm.

Separately, I would point out that the failure to lookup this record originates from the authority of my CNAME's target, and the RFC is very clear that the tree-climbing should proceed off of the parent of the CNAME itself (i.e. NOT the target's parent). This is very sensible, since the cert is being issued for the CNAME, and thus for a name under the bailiwick of the CNAME's parent, not the target's parent.

Thus I would argue that a SERVFAIL on the CNAME target's resolution should certainly not preclude a more positive result from the CNAME's parent itself, and I believe this is implicit in the RFC's ruleset.

Supporting this interpretation, note that I could easily obtain an LE cert for my CNAME if I were to temporarily repoint it at a different, public name (not a practical solution, but part of the security model), then repoint it back to the secret name later on. Thus there is no real security advantage in ignoring the security policy of the CNAME's bailiwick in favor of an equivocal response from the target's authority.

I believe the best way to fix this situation is for your externally-visible nameserver to not return an internal-only delegation when queried from external hosts.

Split-horizoning, while a common technique, is effectively a violation of the domain name system, as now we have one name with two "authoritative" resolutions. Aside from being ugly and inelegant, it can practically lead to caching issues (as we have two authoritative answers possible depending on connection status) and other heisenbugs.

Your suggestion of a pure public NXDOMAIN would help limit the cache issue to 300 seconds (and is likely what I'll do on an emergency basis, though I'd point out it's more prudent and practical to split-horizon the private namespace with public NXDOMAINs, and not poison the public namespace), but it's still a non-DNS hack.

The reason for pointing publicly to a "hidden" authority is to accomplish the security goal of limiting network visibility while remaining within standard DNS rules, and avoiding split-horizon DNS entirely.

This is an intended use of the DNS, and as such is, per my reading, and intended and supported use case of the CAA RFC's checking rules.

Is this the right forum for discussing the security design of the ACME rule checker?

I think this particular issue is especially relevant to the DNS-01's checker design, since the primary motivation for the DNS-01 checker was to support corporate networks with non-publicly accessible web servers. This is exactly the circumstance where keeping the secret HTTP server's address itself secret would be both prudent and expected.

I would love for the checker to support these private names within the DNS spec as intended, rather than mandating a broken-DNS configuration by requiring split-horizoning.

Thanks for any pointers on working with the right groups to sort this out.

Re-reading your original post, I think you have an outdated view of how Let's Encrypt implements CAA. You're describing the legacy RFC 6844 with tree-climbing on CNAMEs. We used that method for a couple of weeks, but we're now back to implementing the erratum 5065 variant, which doesn't tree-climb on CNAMEs. When was the last time you got this error?

Also, your example doesn't really make sense to me: If git.example.com is a CNAME to gitserver.internal.example.com, then looking up the TXT record for DNS validation will fail because 10.0.0.53 is unreachable. Could you please show your real domain names so we can help debug further? As a reminder, all domain names in your certificates wind up in the public CT logs (e.g. at https://crt.sh/).

Thanks,
Jaco

@jsha, thanks for the response.

Re-reading your original post, I think you have an outdated view of how Let’s Encrypt implements CAA. You’re describing the legacy RFC 6844 with tree-climbing on CNAMEs. We used that method for a couple of weeks, but we’re now back to implementing the erratum 5065 variant2, which doesn’t tree-climb on CNAMEs. When was the last time you got this error?

You're right, I hadn't seen those changes. I will look in to those differences more deeply, but at first glance, it looks to me like the erratum clarifies the behavior I expected: Namely, that CAA records should be checked on the CNAME record's tree, not the target's tree.

Also, your example doesn’t really make sense to me: If git.example.com is a CNAME to gitserver.internal.example.com, then looking up the TXT record for DNS validation will fail because 10.0.0.53 is unreachable.

Exactly; thus, failing to get any answer for CAA gitserver.internal.example.com, my reading of the RFC and erratum says you should check the CAA for the parent of the CNAME, e.g. query CAA example.com.

Could you please show your real domain names so we can help debug further? As a reminder, all domain names in your certificates wind up in the public CT logs (e.g. at https://crt.sh/).

Absolutely.

The domain in question is git.fuwt.org, which is a CNAME to tools.mgmt.int.fuwt.org. At the time I opened the ticket, we also had:

Zone fuwt.org:

git.    IN CNAME tools.mgmt.int.fuwt.org
int.    IN NS ns1.int.fuwt.org ns2.int.fuwt.org
ns1.int IN A 10.0.0.53
ns2.int IN A 10.0.1.53
.       IN CAA 0 issue ";"

Zone fuwt.org:

tools.mgmt IN A 10.0.0.10

Please note that since our last communication, we've implemented split horizon DNS as an emergency workaround, and may be making other DNS changes, so it's not a 100% valid test of the system. However, it would be easy for me to make a replica with a test name if it helps your testing process.

One other thing: Even after adding a split horizon zone for int.fuwt.org (and thus removing those internal NS and glue records for int.fuwt.org from our public DNS), LetsEncrypt still won't verify. The error is no longer SERVFAIL; the error is that the CAA policies don't allow it.

However, if you query our DNS now, you should see:

Zone fuwt.org:

.    IN CAA 0 issue ";"
git. IN CNAME tools.mgmt.int.fuwt.org

Zone int.fuwt.org:

tools.mgmt IN CAA 0 issue ";"
tools.mgmt IN A 127.0.0.1

I've also tried with a public tools.mgmt CAA but without any A, which also fails (also contrary to the spec, I believe).

One final note: We're likely moving this address this evening to an Amazon ELB so as to use their ACM auto-issued certs as a stopgap while we debug LetsEncrypt, so the live DNS records may change.

At this point, I'm just confused about the exact checking ruleset that LetsEncrypt is implementing, since the RFC and erratum seem to permit both failure to resolve a CNAME and NXDOMAIN so long as there are valid CAA records somewhere in the DNS tree (as well as SERVFAIL due to failure to connect to the CNAME target's DNS authority).

It's possible I'm missing something, but it feels to me like any of these configurations should have been accepted. Thanks for any guidance on next steps working with LE.

Setting aside CAA for a moment: In order to perform domain validation for tools.mgmt.int.fuwt.org, Let's Encrypt does a lookup for _acme-challenge.tools.mgmt.int.fuwt.org. In your earlier configuration, the authoritative NSes for int.fuwt.org were unavailable, so that TXT record lookup should have failed. How were you planning to complete domain validation for this domain?

This means "forbid all issuance."

RFC 6844 doesn't actually talk about error handling at all, an unfortunate oversight. We've chosen to fail closed on errors because we think it's the only way to be secure. The algorithm in both the RFC and the erratum says that if there is no CAA record present at a given level (i.e. NOERROR with an empty RRset), then processing should continue to the next level.

It still feels to me as though you’re not grasping the attack model here. Several times now you’ve referred to SERVFAIL as if it’s just no records found, but that’s explicitly not what it is.

This is clearer in the DNSSEC scenario. Active bad guys can trivially arrange SERVFAIL by dropping the queries or answers on the network. They can’t synthesise the lack of records because that answer is signed.

Yes, I understand that issue. However, lacking support for off-the-Internet CNAME target zones is a huge hole in the CAA specification, if in fact that was its intent.

I can see that case for rejecting a SERVFAIL on the name being signed, but it looks to me like the tree-climbing was explicitly intended to allow signing for unavailable CNAME targets (which is a name explicitly not being signed in the cert).

The reasons cited at the time were about how to prevent signing for unavailable targets, not how to allow it. For instance, if an attacker tried to issue for nonexistent.int.example.com, a lookup for CAA nonexistent.int.example.com would either error or return an empty NOERROR. In an early version of the draft, that was the only query that would be made, and an empty NOERROR would have meant "go ahead and issue." The goal of tree-climbing was to allow example.com to set a policy that would restrict issuance for its subdomains, whether those subdomains exist or not.

I'm still curious about the answer to this question, as I think it is key to understanding your planned deployment.

Setting aside CAA for a moment: In order to perform domain validation for tools.mgmt.int.fuwt.org, Let’s Encrypt does a lookup for _acme-challenge.tools.mgmt.int.fuwt.org. In your earlier configuration, the authoritative NSes for int.fuwt.org were unavailable, so that TXT record lookup should have failed. How were you planning to complete domain validation for this domain?

My understanding has always been (and my experience with LE has always been) that it looks for TXT _acme-challenge.git.fuwt.org.. If you look at crt.sh, you'll see over a year of certs issued by LE using exactly this configuration with private target zone nameservers using the DNS-01 challenge.

This means “forbid all issuance.”

Got it, my mistake, and probably what's blocking the split horizon config. Thank you.

RFC 6844 doesn’t actually talk about error handling at all, an unfortunate oversight. We’ve chosen to fail closed on errors because we think it’s the only way to be secure. The algorithm in both the RFC and the erratum says that if there is no CAA record present at a given level (i.e. NOERROR with an empty RRset), then processing should continue to the next level.

Yeah, that's what I'd figured. I'd just point out that failing closed on specifically CNAME resolutions creates a problem in the case I outlined above, and looks like a case considered in the drafting of the RFC.

The ACME spec's looking for a sibling record of the name being signed makes much more sense to me because it proves control of the name's DNS tree parent, and it avoids the CNAME issue entirely. I'm surprised the CAA spec doesn't follow this example explicitly, but it seems to me that treating failure to resolve the CNAME target as a soft error (while leaving all other errors as hard errors) would be entirely equivalent, and may be the intent. Doing otherwise would promote out-of-spec split-horizon DNS for any internal-only zones.

Since there is no clear answer in the RFC, is this a possibility of a feature change for LetsEncrypt?

Thanks,

Chris

Aha, now I understand; I had gotten confused about the names. You are correct, that validation for git.fuwt.org will do a lookup for _acme-challenge.git.fuwt.org, and that lookup won't have to hit your internal DNS servers. Glad we are on the same page now. Thanks for explaining.

We're very unlikely to change this behavior anytime soon. There is currently a discussion on the LAMPS mailing list about the next revision of the CAA spec, which may involve a discussion of error handling, but I think it will be hard to devise any soft-fail error handling regime that is still secure.

Thanks for the reference to the LAMPS list!

Looks like the latest thread there exactly addresses my concern about CNAME CAA resolution, though coming at it from a different business case – they’re discussing how to treat CNAME CAA records to avoid interference from CDNs that are CNAME targets, and one of the proposals is to check _caa.git.fuwt.org instead of resolving the CNAME, as we just discussed. Hopefully the spec will evolve to cover more use cases like this.

Thanks,

Chris

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.