Different CAAs returned from two DNS servers

I found a bit weird behavior, testing some edge scenarios for my system and dns-01 challenges.

It is perfectly legal if DNS server returns many CAA records for the domain. ACME service should find the one matching the provider's policy. So, if I provide two records:

example.org         CAA 0 issue "letsencrypt.org"
example.org         CAA 0 issue "otherca.org"

Let's Encrypt should, and indeed it correctly recognizes the first CAA record as the one which allows the service to process with certificate generation. So technically this is an OR (alternative) logic in the scope of single DNS server.

Second scenario I tested is the one with two authoritative DNS servers.

DNS 1 provides this:

example.org         CAA 0 issue "letsencrypt.org"

DNS 2 provides nothing (no CAA, no TXT records for _acme-challenge).
In this scenario ACME queries both DNS servers, finds that one of them provides correct records and issues the certificate. So it seems that OR logic works also with many DNS servers. So far, so good.

Then I tested third scenario, where DNS 1 provides this:

example.org         CAA 0 issue "letsencrypt.org"

but DNS 2 provides this:

example.org         CAA 0 issue "otherca.org"

And sadly, in this scenario ACME complaints about CAA record mismatch. So the OR logic does not really work for many DNS servers.

Why is it important?

Imagine building a multi-cluster environment of super-high available services.
There are many DNS servers in different locations. There are also many ACME clients renewing certificates. For security reasons, we don't want to share account's private key between those instances, so we decided to create an account on each ACME client, being able to issue certificates for our domains (yes, I'm aware of the API limits, all is good). Each ACME client is able to connect to DNS servers to instruct them to serve appropriate CAA and TXT records.

First thing is that, we don't want to be attacked by fulfilling ACME API limits. That's why our DNS servers return CAA 0 ; response if there are no pending certificate orders. The correct CAA record (containing the appropriate account URI) is served only if our ACME client submitted the challenge to the DNS servers.

As I said, we are preparing super-highly-available cluster so we must assume that things might (and will) go wrong. Meaning that there might be network hickups, some DNS servers might be malfunctioning, etc... The result might be that only a subset of our DNS servers receives the chalenge requerst from our ACME clients. As a result, part of the DNS servers might serve correct CAA record like:

example.org         CAA 128 issue "letsencrypt.org;accounturi=...;validationmethods=dns-01"

but others might serve this:

example.org         CAA 0 issue ";"

The current implementation of Let's Encrypt returns error in such scenarios. Instead the ACME service should observe that one of the authoritative DNS servers provides valid CAA record and should continue processing the order.

Otherwise there are only two bad options available:

  1. All the DNS servers must be updated (which opens the gate for failures for many reasons)
  2. DNS servers cannot provide the default CAA 0 ";" response (which opens the gate for attackers).

A DNS zone should be consistent between all authorative DNS servers. I can understand this might be difficult with superfancy cluster shizzle, but it's just a fact that it should.

If authorative nameservers contradict each other, there's no "OR" possible or even allowed.

4 Likes

RFC 6844 - DNS Certification Authority Authorization (CAA) Resource Record

CAA authorizations are additive; thus, the result of specifying both
the empty issuer and a specified issuer is the same as specifying
just the specified issuer alone.

You can't add what you can't see. -- If you only see CAA 0 ";" issuance will fail.

On the other hand, I don't think the RFC says what to do in case of mismatched nameservers. This might be an implementation detail. (Note that even if at first you succeed it will most probably fail on secondary validation)

4 Likes

I understand where you come from guys and would love to live in such a beautiful world. But let's be realistic. We have XXI century, we live in global world where IT systems are spread across the world and are made of thousands of microservices. Where any of those services might be unavailable at any time. DNS spec was proposed more than 40 years ago, nowadays IT specialists already know that expecting two IT systems to be in perfect sync all the time is just a nice wish.

Let's say that there are 2 authoritative DNS servers for a domain. DNS 1 responds with

example.org         CAA 0 issue "letsencrypt.org"

which effectively means "I allow LE to issue certificates for my domain".
And there is DNS 2 saying:

example.org         CAA 0 issue ";"

which means "I don't allow any CA to issue certificate for my domain".

The question is what the standard says. Is it that all the authoritative servers must allow. Or that any must allow? Is it specified at all? Is it AND or OR? This is very crucial for the protocol to define this. And as I said, it's time to let to pass away the idea that all the servers are in perfect sync. We know that it's not true.

I don't agree. ACME may query all the authoritative servers (it already does from whart I see) and accept the request if any of the retrieved CAAs grants the permission.

The CA/B Baseline Requirements might be an interesting read on this.

I'd start on section 3.2.2.9. (and 3.2.2.8, too)

3 Likes

I don't see how Multi-Perspective Issuance Corroboration affects what I wrote before.
The goal there is to confirm authentication from different network scopes to eliminate risk of result being affected by an attacker. But still, it does not specify how the ACME should behave when DNSes are not fully synced. Of course, everyone might agree that they must be fully synced but then it means we live in illussion.

It does say how many non-corroborations are allowed. I'm not sure what counts as a non-corroboration for CAA, though; I didn't read that section very carefully.

Note that each perspective may query several authoritative dns servers, that's a different matter and you might be right about that.

2 Likes

Yes, exactly. It seems that it is an quiet assumption made initially that DNS servers are in perfect sync and responses are consistent. So... if querying all authoritative servers and doing an OR by each Network Perspective is not denied by the spec I would love to propose this to the LE team. Otherwise people must choose if their system are resistant to attacks or highly available. But we need both.

(post deleted by author)

more like inconsistency between each endpoint itself is fail condition: perspectives are not allowed peek into other's result so need to vote independently

CA/B BR 3.2.2.9

Results or information obtained from one Network Perspective MUST NOT be reused or cached when performing validation through subsequent Network Perspectives (e.g., different Network Perspectives cannot rely on a shared DNS cache to prevent an adversary with control of traffic from one Network Perspective from poisoning the DNS cache used by other Network Perspectives).

3 Likes

Yes, so each NP could independently query all the authoritative servers, do the OR and vote based on that. No?

it can lookup every server, but not sure you can inter prate result as OR of them:

2 Likes

There are 3 basic possible options:

  • query all (or subset) and do the OR
  • query all (or subset) and do the AND
  • query random one and take it as a result

Can you name one recursive resolver implementation that does not do this? AFAIK this is exactly how DNS works: Query one server which gives you an answer. If you don't get an answer, try another. The whole reason why you even want multiple nameservers is for outages. If you suddenly require answers from multiple nameservers to succeed, ot defeats the entire point of redundancy.

5 Likes

I tesed only Let's Encrypt so far.

I haven't said that. If DNS fails to give an answer, ignore it. So the full algorithm looks like this:

  1. Are there any authoritative DNS servers to test?
  2. If no, return failure.
  3. Take the authoritative DNS server from the stack
  4. Query it
  5. If it fails go to 1
  6. Do the CAA test
  7. If test failed go to 1
  8. Return success

This is what I mean by "OR" algorithm. It handles DNS failures and out-of-sync servers.

This is logic that doesn't exist within DNS. Let's Encrypt uses standard, off-the-shelf recursive DNS resolvers. The nameserver selection is done by that recursive resolver. What you propose is certainly doable, but it implies writing custom resolver logic that probably no one else on the planet uses. This is usually against LE's design goals, which aim to use standardized software, logic and protocols wherever possible. LE simply does what every other resolver does.

The algorithm you propose also violates RFC 8659: DNS Certification Authority Authorization (CAA) Resource Record, and as such cannot be adopted by Let's Encrypt prior to a new CAA RFC being written and approved by the CA/BF. You can start such a motion with the IETF, if you see a strong use case for your idea, but I wouldn't give my hopes up.

5 Likes

Which part exactly?

I don't know... maybe I'm the only guy on the planet who thinks that current algorithm causes issues... Imagine this scenario:

  1. One of the DNS servers fails, so it doesn't accept updates but still serves old content.
  2. ACME client tries to renew certificate. Process fails so it retries.
  3. At some point quota limits are reached.
  4. At this point fixing DNS won't help due to quota.
  5. Your website is left without valid certificate.

Obviously, I will find some workarounds to make the system stable and reliable. But would be good to fix the spec and implementations because the current ones expect that everything work all the time.

Section 3 mainly:

[...] until a CAA RRset is found.

-> implies that the first RRSet must be used, not any subsequent ones that a later queried nameserver may return.

Let CAA(X) be the RRset returned by performing a CAA record query for the FQDN X, according to the lookup algorithm specified in Section 4.3.2 of [RFC1034] (in particular, chasing aliases).

Specifies "a record query", to be performed, not multiple that may be merged in any way. It also explicitly references the DNS resolving mechanism from RFC1034, which has no mention of multiple DNS responses being combined or merged in any way.

(The RFC also goes own to say that the resolver terminates once a negative match has been found, so you cannot query again if a nameserver replied with an unacceptable response)

4 Likes