Networking error when LetsEncrypt is looking up CAA

Around November 21st, we started experiencing a sustained, 5x increase in the frequency of ACME authorization errors. A sampling of the errors has this message in common: "DNS problem: networking error looking up CAA for mongodb.net". We don't believe we have a CAA record published as part our LetsEncrypt setup. Has anything changed on the LE side with regards to this?

My domain is: mongodb.net

I ran this command:

It produced this output:

My web server is (include version):

The operating system my web server runs on is (include version):

My hosting provider, if applicable, is: Amazon Route 53

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):

Are you still getting these now? Because there was a change today to address an issue with DNS resolution.

This was caused by DNS Servers that did not provide an SOA record for empty or not found responses. Let's Encrypt must look for a CAA to ensure they are allowed to issue. And, unbound v1.18 got stricter about the kind of "not found" responses that were allowed.

If you use https://unboundtest.com and see that unbound 1.18 gave a SERVFAIL for a CAA lookup but 1.16 was okay then the upgrade to 1.19 should fix that.

Update: Sorry, I should have checked first. unboundtest does not show this kind of error for that domain name. You might try some others. I'm surprised to see any problem with Route53 DNS server but I am stepping away so someone else will have to continue.

4 Likes

This did seem to reduce in volume as of ~9pm UTC Dec. 20th. I'll check in again tomorrow and see where we're at.

1 Like

Assuming this affected particular subdomains and not the primary domain, would you be able to provide an example subdomain that failed?

Can you confirm it's Route53 all the way for your DNS or do you drop down to anything custom/self-managed for subdomains?

2 Likes

It looks like the reduction in error volume was transient. We're now back to the 5x increase from historical norms.

We are Route53 the entire way, yes.

Example failed subdomain: dcf71b8d7c64e8bdc2717b22.xnw6x.mongodb.net

What kind of authentication are you using for that domain? DNS, HTTP, TLS-ALPN?

Can you show the actual Certbot command?

Because I see NXDOMAIN responses even to the xnw6x.mongodb.net level as well as the full dcf... domain name

Try https://unboundtest.com or https://dnsviz.net for example

I see that with various tools so not just Let's Encrypt

3 Likes

We create and delete sub-domains all the time as we provision and remove resources on behalf of customers. This is all done automatically via Java code, so there's no Certbot involved.

One of the other related sub-domains is still in use. You can see there are txt records published at cmp91803e1006434r3be591.xnw6x.mongodb.net.

What kind of authentication are you asking about?

We're asking about what client you are using (if not certbot that's fine, if it's custom Java code that's fine, but if there's some ACME library being used it'd be good to know what that is), and asking how that client is trying to authenticate that it owns the domain names you're getting certificates for (probably HTTP-01 by configuring a web server to reply, or DNS-01 configuring an _acme-challenge record in DNS).

Are you still experiencing issues after Let's Encrypt upgraded to Unbound 1.19 on Dec. 20? If so, can you please give an exact error message you're getting from the Let's Encrypt servers?

4 Likes

Yeah, what Peter said.

Also, looking at the public CT logs I saw a large volume of certs for names including xnw6x.mongodb.net. The certs are for wildcard names so we now know you are using the DNS Challenge (because no other challenge allows wildcards).

I don't have an explanation for any CAA lookup problems. But, have you tried adding a CAA record at the xnw6x level? Or even just at mongodb.net level? This will eliminate some of the DNS queries that Let's Encrypt makes and might work-around this problem.

Further, do all of your certs have that same xnw6x level or is that just one of many?

All of the (many) certs I saw had this same pattern of names. The only difference was the long alphameric qualifier in the first name

X509v3 Subject Alternative Name: 
DNS:*.4e653804e00e382c4d616057.xnw6x.mongodb.net
DNS:*.xnw6x.mesh.mongodb.net
DNS:*.xnw6x.mongodb.net

This example cert on crt.sh at (this link)
And another one just for fun (link here)

I provide these details mostly for other volunteers. There are so many certs it is difficult to search the CT logs without exceeding their rate limits.

2 Likes

One suggestion as a workaround:

Try adding a CAA record. The compatibility problems with CAA are in the tree-climbing with absent records. If we find a CAA record, we can stop looking upward.

3 Likes

"upward" is too subjective.
Where are the roots of this tree? [up or down]
Unless my "mental picture" if this "DNS tree" is upside down... :upside_down_face: I imagine the roots would be found beneath the tree.

My understanding: If there are multiple CAA records the one closest to the FQDN takes precedence.

So...
If we are "climbing upwards while going towards the roots" . . . [this confuses me]
"Climbing" and "upwards" seem to go hand-in-hand; Like "sitting" and "down" do.
But in this [DNS CAA] case, the "climbing" should be towards its' roots [down the tree].

1 Like

Does adding a CAA at mongodb.net level help at all? Or would it have to be at the xnw6x level or even the fqdn? I vaguely recall a discussion about adding CAA to registered domain level is helpful to eliminate some "not found" queries. But I am not certain.

4e653804e00e382c4d616057.xnw6x.mongodb.net
2 Likes

In computer science, when we talk about a tree we don't mean this:

Where the roots are indeed on its bottom. Instead, we mean this:

Which is just a form of graph. In a "rooted connected acyclic undirected graph" (which is the usual type of tree we mean), there's a single well-defined root (1 in the picture above) and the directions "upward" and "downward" have formal definitions: Upwards is towards the parent node (or ascendant), which is closer towards the root and downwards is towards a child node (or descendant), which is farther from the root. Visually these trees are usually depicted with the root sitting at the top, just like in the image above.

5 Likes

So.. all this time I've been upside down!
That explains sooooo much - LOL

3 Likes

It looks like the frequency of error started receding back to historical norms on the 22nd of December.

We use the org.shredzone.acme4j client to issue challenges. The authentication type is indeed DNS-01.

Since the errors have returned to normal levels, I'm not going to proceed with creating a CAA record.

The exact payload response is the following:

{
  "identifier": {
    "type": "dns",
    "value": "2kyltgd.mesh.mongodb.net"
  },
  "status": "invalid",
  "expires": "2024-01-03T16:17:01Z",
  "challenges": [
    {
      "type": "dns-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:dns",
        "detail": "DNS problem: networking error looking up CAA for net",
        "status": 400
      },
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/XXXXX",
      "token": "xxxxx",
      "validationRecord": [
        {
           "hostname": "2kyltgd.mesh.mongodb.net"
        }
      ],
      "validated": "2023-12-27T16:17:40Z"
    }
  ],
  "wildcard": true
}
1 Like

That's an interesting one. My understanding (and others please correct me if I'm wrong) is that each DNS query has a timeout, and there's a separate timeout for the overall process of all DNS queries. Usually, when it's talking about an issue looking up CAA for something like the TLD, it's that latter timeout being hit, where the whole process just took too long, and the error message is just what stage of the process it happened to be in when it ran out of time.

If you're just getting this some of the time (as you're talking about "normal levels"), then I think the main thing you could do is just double-check the performance of your DNS server and make sure that it isn't taking a really long time to reply that there aren't any CAA records at each of the levels. But if you're on Route 53 that probably isn't really the problem. Hmm… This may be a tricky one.

4 Likes

That's my understanding too. For that specific error adding a CAA record just to mongodb.net level might help. It would at least stop the need to walk the tree to the TLD during the CAA check. And is easy to try.

An intermittent DNS query problem tricky ... surely you jest :slight_smile:

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.