Around November 21st, we started experiencing a sustained, 5x increase in the frequency of ACME authorization errors. A sampling of the errors has this message in common: "DNS problem: networking error looking up CAA for mongodb.net". We don't believe we have a CAA record published as part our LetsEncrypt setup. Has anything changed on the LE side with regards to this?
Are you still getting these now? Because there was a change today to address an issue with DNS resolution.
This was caused by DNS Servers that did not provide an SOA record for empty or not found responses. Let's Encrypt must look for a CAA to ensure they are allowed to issue. And, unbound v1.18 got stricter about the kind of "not found" responses that were allowed.
If you use https://unboundtest.com and see that unbound 1.18 gave a SERVFAIL for a CAA lookup but 1.16 was okay then the upgrade to 1.19 should fix that.
Update: Sorry, I should have checked first. unboundtest does not show this kind of error for that domain name. You might try some others. I'm surprised to see any problem with Route53 DNS server but I am stepping away so someone else will have to continue.
We create and delete sub-domains all the time as we provision and remove resources on behalf of customers. This is all done automatically via Java code, so there's no Certbot involved.
We're asking about what client you are using (if not certbot that's fine, if it's custom Java code that's fine, but if there's some ACME library being used it'd be good to know what that is), and asking how that client is trying to authenticate that it owns the domain names you're getting certificates for (probably HTTP-01 by configuring a web server to reply, or DNS-01 configuring an _acme-challenge record in DNS).
Are you still experiencing issues after Let's Encrypt upgraded to Unbound 1.19 on Dec. 20? If so, can you please give an exact error message you're getting from the Let's Encrypt servers?
Also, looking at the public CT logs I saw a large volume of certs for names including xnw6x.mongodb.net. The certs are for wildcard names so we now know you are using the DNS Challenge (because no other challenge allows wildcards).
I don't have an explanation for any CAA lookup problems. But, have you tried adding a CAA record at the xnw6x level? Or even just at mongodb.net level? This will eliminate some of the DNS queries that Let's Encrypt makes and might work-around this problem.
Further, do all of your certs have that same xnw6x level or is that just one of many?
All of the (many) certs I saw had this same pattern of names. The only difference was the long alphameric qualifier in the first name
X509v3 Subject Alternative Name:
DNS:*.4e653804e00e382c4d616057.xnw6x.mongodb.net
DNS:*.xnw6x.mesh.mongodb.net
DNS:*.xnw6x.mongodb.net
This example cert on crt.sh at (this link)
And another one just for fun (link here)
I provide these details mostly for other volunteers. There are so many certs it is difficult to search the CT logs without exceeding their rate limits.
Try adding a CAA record. The compatibility problems with CAA are in the tree-climbing with absent records. If we find a CAA record, we can stop looking upward.
"upward" is too subjective.
Where are the roots of this tree? [up or down]
Unless my "mental picture" if this "DNS tree" is upside down... I imagine the roots would be found beneath the tree.
My understanding: If there are multiple CAA records the one closest to the FQDN takes precedence.
So...
If we are "climbing upwards while going towards the roots" . . . [this confuses me]
"Climbing" and "upwards" seem to go hand-in-hand; Like "sitting" and "down" do.
But in this [DNS CAA] case, the "climbing" should be towards its' roots [down the tree].
Does adding a CAA at mongodb.net level help at all? Or would it have to be at the xnw6x level or even the fqdn? I vaguely recall a discussion about adding CAA to registered domain level is helpful to eliminate some "not found" queries. But I am not certain.
Which is just a form of graph. In a "rooted connected acyclic undirected graph" (which is the usual type of tree we mean), there's a single well-defined root (1 in the picture above) and the directions "upward" and "downward" have formal definitions: Upwards is towards the parent node (or ascendant), which is closer towards the root and downwards is towards a child node (or descendant), which is farther from the root. Visually these trees are usually depicted with the root sitting at the top, just like in the image above.
That's an interesting one. My understanding (and others please correct me if I'm wrong) is that each DNS query has a timeout, and there's a separate timeout for the overall process of all DNS queries. Usually, when it's talking about an issue looking up CAA for something like the TLD, it's that latter timeout being hit, where the whole process just took too long, and the error message is just what stage of the process it happened to be in when it ran out of time.
If you're just getting this some of the time (as you're talking about "normal levels"), then I think the main thing you could do is just double-check the performance of your DNS server and make sure that it isn't taking a really long time to reply that there aren't any CAA records at each of the levels. But if you're on Route 53 that probably isn't really the problem. Hmm… This may be a tricky one.
That's my understanding too. For that specific error adding a CAA record just to mongodb.net level might help. It would at least stop the need to walk the tree to the TLD during the CAA check. And is easy to try.
An intermittent DNS query problem tricky ... surely you jest