Problems with CAA records only with Google and Let's Encrypt

My domain is: onoma.yocto.com

I ran this command: docker exec -it certbot certbot renew --dns-yocdns-propagation-seconds 120

It produced this output:

Waiting 120 seconds for DNS changes to propagate

Certbot failed to authenticate some domains (authenticator: dns-yocdns). The Certificate Authority reported these problems:
  Domain: onoma.yocto.com
  Type:   dns
  Detail: During secondary validation: While processing CAA for onoma.yocto.com: DNS problem: looking up CAA for onoma.yocto.com: DNSSEC: NSEC Missing: validation failure <onoma.yocto.com. CAA IN>: no DNSSEC records from 2a01:7c8:fff7:86::1 for DS onoma.yocto.com. while building chain of trust

  Domain: onoma.yocto.com
  Type:   dns
  Detail: During secondary validation: While processing CAA for *.onoma.yocto.com: DNS problem: looking up CAA for onoma.yocto.com: DNSSEC: NSEC Missing: validation failure <onoma.yocto.com. CAA IN>: no DNSSEC records from 136.144.225.232 for DS onoma.yocto.com. while building chain of trust

Hint: The Certificate Authority failed to verify the DNS TXT records created by --dns-yocdns. Ensure the above domains are hosted by this DNS provider, or try increasing --dns-yocdns-propagation-seconds (currently 120 seconds).

My web server is (include version): NGINX 1.27.1

The operating system my web server runs on is (include version): Alpine inside Docker on Debian

My hosting provider, if applicable, is: VPS

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 4.0.0


Hello all,

For some years now I'm running services for myself and my customers. This includes website, mail, domains and also DNS of course. I even dived so deep into DNS by reading the RFCs, that I could eventually develop my own DNS server, with portal included. I have this DNS server running on 4 VPS'es at this time.

The DNS server works great, but DNSSEC is only used on my own domain yocto.com for now. Certbot is able to get the certificate for yocto.com, but something strange happens when requesting a certificate for onoma.yocto.com, as seen above.

I checked multiple tools to check what was wrong, I tried dig too many times, but I don't seem able to reproduce the timeout with dig. Only Google gives me an EDE 12 error: Query: onoma.yocto.com - Google Public DNS. However, that NSEC record for CAA is present, so I don't understand the error.

I tried the following tools to detect the issue:

As someone that is very into DNS, it is very frustrating that I cannot reproduce it using dig. Also, what is causing that timeout? Some months ago I did indeed get also timeout errors in Certbot.

I also checked the domain api.sidn.nl that also doesn't have a CAA record: Query: api.sidn.nl - Google Public DNS, but it gives no problems.

Can somebody help me with this issue? This is one of the reasons I'm not comfortable with rolling out DNSSEC for all my customers.

Ben

1 Like

Well, when I tried DNSViz, it says that 2a01:7c8:d002:313::1 isn't responding via UDP.

It sounds like you're much more of a DNS expert than I am, so I'm not sure how much help I can be, but I'll ask a couple more questions anyway. Do the error messages you're getting always say "During secondary validation"? That implies that different parts of the world are getting different responses. Do you have logs from your DNS servers with the responses they're giving from the many validation requests that Let's Encrypt sends?

3 Likes

Hello @petercooperjr,

Thank you for your response.

It can happen that one DNS server isn't responding that well, but that never has given me issues in the past. The fact that a minimum of 2 nameservers is required on a domain is for redundancy, so that another DNS server can take over. I don't see why that shouldn't be the case here.

Talking about During secondary validation, I'm a DNS expert, not a Let's Encrypt expert. I have no idea what it actually means, other than having some second validation check. I have 4 VPS servers that run. I don't use GeoDNS. Actually, I'm even against GeoDNS, because I think Anycast is the way to go for high availability on a single IP. So, every server should in fact serve the same content (or has some issue causing it to be offline).

So, I'm not sure how to apply this information to help you, but I can tell you more about what it actually means. :slight_smile:

Let's Encrypt checks DNS from many places around the world, in order to help make sure that the requestor is actually someone who controls the domain name and not someone who can just redirect packets in one particular corner of the Internet. (See this FAQ on multi-perspective validation for more detail than you probably need.) The "During secondary validation" message means that the primary validation (from Let's Encrypt's main datacenter) succeeded, but at least two (I think, maybe just at least one) of the secondary validation sites (from various "cloud" region datacenters) got an error validating.

You may be seeing issues with CAA more often than others because the CA needs to check CAA at the time of issuing the certificate (unless it's been checked within the past 8 hours for that name already), even if it had been checked and approved earlier. It also is more of a "stress test", since it has to check each segment from the full domain name down to the root, and while having no CAA record is fine it needs to get a correct "no records, no error" response for each of the names.

3 Likes

Interesting. Does this mean the primary validation succeeded every time, but only once in the last 8 hours? :thinking: Then the question still is why it is behaving differently between those tests.

Okay, when checking Unbound Test for both onoma.yocto.com and api.sidn.nl, I see a difference in the last response:

May 05 16:36:58 unbound[22699:0] info: scrub for yocto.com. NS IN
May 05 16:36:58 unbound[22699:0] info: response for auth.yocto.com. CAA IN
May 05 16:36:58 unbound[22699:0] info: reply from <yocto.com.> 136.144.225.232#53
May 05 16:36:58 unbound[22699:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0 
;; QUESTION SECTION:
auth.yocto.com.	IN	CAA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
yocto.com.	0	IN	SOA	ns1.yoctodns.com. hostmaster.yoctodns.com. 2022072800 14400 3600 1209600 3600
yocto.com.	0	IN	RRSIG	SOA 13 2 3600 20251030120010 20250503120010 55647 yocto.com. 5AO1RncbqsskpQSUsiwjJWoXHg0kaqDtXG9XkY3p6/EMb6d3VQeeo+2mReuPu+v0FJ07eQxHDceSwPirDYCfEw== ;{id = 55647}

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 197
May 05 16:46:16 unbound[22701:0] info: scrub for sidn.nl. NS IN
May 05 16:46:16 unbound[22701:0] info: response for api.sidn.nl. CAA IN
May 05 16:46:16 unbound[22701:0] info: reply from <sidn.nl.> 194.0.28.10#53
May 05 16:46:16 unbound[22701:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 0 
;; QUESTION SECTION:
api.sidn.nl.	IN	CAA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
sidn.nl.	0	IN	SOA	ns4.sidn.nl. hostmaster.sidn.nl. 1746109496 14400 3600 3456000 300
sidn.nl.	0	IN	RRSIG	SOA 13 2 3600 20250515000000 20250424000000 30794 sidn.nl. A96DpwXFDRKOlSJT3LABr/AcHKVMSFyhjxlnz+kxJVr+wWRkX+F0gXsqRvwipH44iCAu08erRawgc/c8Hso6WQ== ;{id = 30794}
api.sidn.nl.	0	IN	NSEC	rdap.api.sidn.nl. A AAAA RRSIG NSEC
api.sidn.nl.	0	IN	RRSIG	NSEC 13 3 300 20250515000000 20250424000000 30794 sidn.nl. ci1kfvELUltbPtPuaOukUYq9S1dIRx/vuTrrzQpP2jvaI3VrlDQpAQ6ZMDzcI9AKzyCO6Ia43ml96klgFCHZWA== ;{id = 30794}

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 324

Somehow, the NSEC records aren't there for my domain, but are there for the SIDN API. I don't exactly know why, but maybe Unbound is filtering some RRSIG out of the logs, because I cannot reproduce this with dig.

Also, there is another difference:
May 05 16:36:58 unbound[22699:0] info: no signer, using auth.yocto.com. TYPE0 CLASS0
May 05 16:46:16 unbound[22701:0] info: signer is sidn.nl. TYPE0 CLASS0

I will dive deeper in this.

1 Like

Full disclosure, I am even less skilled at DNS than Peter :slight_smile:

But, thought I'd pass along the results of below tool in hopes it makes sense to you (I learned of this tool from Peter in fact). In the past I have seen problems from this testing tool for domains that worked fine. So, I am not sure how each result interacts with Let's Encrypt.

I have never seen the TooBig failure shown with yocto before though. Seems interesting.

Results from sidn.nl were all correct: EDNS Compliance Tester

Yocto had problems: EDNS Compliance Tester

2 Likes

Oh, I'll agree that looks interesting. Let's Encrypt queries with the equivalent of dig's +bufsize=1232 (which was a change last year from 512), and DNSSEC-including responses are more likely to be big enough to trigger that limit and require a switch from UDP to TCP.

3 Likes

I have never seen the TooBig failure shown with yocto before though. Seems interesting.

I think you confuse my Yocto with a Yocto from somebody else; however I will check if I can fix those things. I didn't spend that much time on EDNS when implementing my DNS server.

I looked at the apex zone, yes. But, note the CAA checks query each level of your domain name. The CAA record at or "closest to" your subdomain prevails. If none found the query proceeds to the apex name.

Actually, I am not certain LE starts at the "bottom" and works toward the apex. Or, whether it chases down from the apex retaining the last CAA it finds. In this example it doesn't matter since no CAA records exist at any level so all are queried.

Update: I just remembered that an LE person has said if you add a CAA record at your subdomain you can reduce the total number of CAA queries made. Which indicates the check starts at the deepest level and works towards the apex. See Aaron's comment below

This may be a debug test to see if adding a CAA record at your subdomain avoids the problem.

3 Likes

When was that? I think they may have changed how they've done this over time, the most recent post I can find (from last year) says that they do all the CAA names in parallel, though only after the ACME challenge passes.

3 Likes

It was some time ago. I saw it mentioned on several threads when we used to see more frequent problems with DNS servers that rate limited inbound queries.

Here is one example that explicitly says LE starts at the deepest level and works up. Although it pre-dates your post from Aaron so maybe was changed: "networking error looking up CAA for de" - #2 by mcpherrinm Update: Also see Aaron's comment below which seems to contradict this

3 Likes

I fixed the edns@512=toobig,notc error. If I check all servers, there is only one server that gives errors because it is unreachable.

However, the Unbound Test still gives me no NSEC records in the response:

May 05 20:53:54 unbound[22821:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0 
;; QUESTION SECTION:
onoma.yocto.com.	IN	CAA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
yocto.com.	0	IN	SOA	ns1.yoctodns.com. hostmaster.yoctodns.com. 2022072800 14400 3600 1209600 3600
yocto.com.	0	IN	RRSIG	SOA 13 2 3600 20251030120010 20250503120010 55647 yocto.com. 5AO1RncbqsskpQSUsiwjJWoXHg0kaqDtXG9XkY3p6/EMb6d3VQeeo+2mReuPu+v0FJ07eQxHDceSwPirDYCfEw== ;{id = 55647}

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 198

May 05 20:53:54 unbound[22821:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
May 05 20:53:54 unbound[22821:0] info: query response was nodata ANSWER
May 05 20:53:54 unbound[22821:0] debug: TTL 0: dropped msg from cache
May 05 20:53:54 unbound[22821:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE
May 05 20:53:54 unbound[22821:0] info: finishing processing for onoma.yocto.com. CAA IN
May 05 20:53:54 unbound[22821:0] debug: mesh_run: iterator module exit state is module_finished
May 05 20:53:54 unbound[22821:0] debug: validator[module 0] operate: extstate:module_wait_module event:module_event_moddone
May 05 20:53:54 unbound[22821:0] info: validator operate: query onoma.yocto.com. CAA IN
May 05 20:53:54 unbound[22821:0] debug: validator: nextmodule returned
May 05 20:53:54 unbound[22821:0] debug: val handle processing q with state VAL_INIT_STATE
May 05 20:53:54 unbound[22821:0] debug: validator classification nodata
May 05 20:53:54 unbound[22821:0] info: no signer, using onoma.yocto.com. TYPE0 CLASS0
May 05 20:53:54 unbound[22821:0] debug: val handle processing q with state VAL_FINISHED_STATE
May 05 20:53:54 unbound[22821:0] debug: TTL 0: dropped msg from cache
May 05 20:53:54 unbound[22821:0] debug: mesh_run: validator module exit state is module_finished
1 Like

(We do all CAA lookups in parallel, and have done so for 8+ years.)

5 Likes

Thanks for clarity. I guess I misunderstood what mcpherrinm said.

Do you have any insight into this thread's problem? Isn't a Secondary Validation error of this nature unusual?

4 Likes

I don't have much insight, sorry.

I basically only have one guess, which is that the two authoritative nameservers (mentioned in Comment #3) don't actually agree with each other. One is serving the NSEC and RRSIG records, and the other isn't. That's the best explanation I have for why this failure is nondeterministic, with some perspectives succeeding and some perspectives failing.

5 Likes

Another DNS implementation quirk to be aware of is that many resolvers expect the case of queries to be echoed in the responses. (See the 0x20 draft, and RFC 4343 Section 4.) Unbound tries to work around servers that don't, but that may lead to even more traffic and be related to the inconsistent results you're seeing.

For instance, querying your domain with mixed-case

$ dig +norecurse onOma.yoCto.cOm @136.144.225.232

; <<>> DiG 9.18.33 <<>> +norecurse onOma.yoCto.cOm @136.144.225.232
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40276
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;onOma.yoCto.cOm.               IN      A

;; ANSWER SECTION:
onoma.yocto.com.        3600    IN      A       136.144.213.19

;; AUTHORITY SECTION:
yocto.com.              3600    IN      NS      ns1.yoctodns.com.
yocto.com.              3600    IN      NS      ns2.yoctodns.com.
yocto.com.              3600    IN      NS      ns4.yoctodns.com.
yocto.com.              3600    IN      NS      ns3.yoctodns.com.

;; ADDITIONAL SECTION:
ns1.yoctodns.com.       3600    IN      A       136.144.213.19
ns1.yoctodns.com.       3600    IN      AAAA    2a01:7c8:d002:313::1
ns2.yoctodns.com.       3600    IN      A       136.144.225.232
ns2.yoctodns.com.       3600    IN      AAAA    2a01:7c8:d004:18d::1
ns4.yoctodns.com.       3600    IN      A       136.144.154.49
ns4.yoctodns.com.       3600    IN      AAAA    2a01:7c8:fff7:86::1
ns3.yoctodns.com.       3600    IN      A       37.97.226.189
ns3.yoctodns.com.       3600    IN      AAAA    2a01:7c8:fffd:214::1

;; Query time: 90 msec
;; SERVER: 136.144.225.232#53(136.144.225.232) (UDP)
;; WHEN: Tue May 06 01:06:58 UTC 2025
;; MSG SIZE  rcvd: 535

has the "Answer Section" in lowercase even though the query had some capital letters.

Whereas most authoritative DNS servers (just to use a random example)

$ dig +norecurse hellOworlD.LetsEncrypT.oRg. @108.162.193.219

; <<>> DiG 9.18.33 <<>> +norecurse hellOworlD.LetsEncrypT.oRg. @108.162.193.219
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55758
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;hellOworlD.LetsEncrypT.oRg.    IN      A

;; ANSWER SECTION:
hellOworlD.LetsEncrypT.oRg. 7200 IN     CNAME   origin.LetsEncrypT.oRg.
origin.LetsEncrypT.oRg. 300     IN      A       54.176.55.186

;; Query time: 10 msec
;; SERVER: 108.162.193.219#53(108.162.193.219) (UDP)
;; WHEN: Tue May 06 01:12:44 UTC 2025
;; MSG SIZE  rcvd: 92

Echoes the requested case back.

Again, not sure if it's really related to the problems you're having, but it's yet another DNS implementation difference that makes things harder to debug.

4 Likes

Again, not sure if it's really related to the problems you're having, but it's yet another DNS implementation difference that makes things harder to debug.

I don't think that is the issue.

I basically only have one guess, which is that the two authoritative nameservers (mentioned in Comment #3) don't actually agree with each other. One is serving the NSEC and RRSIG records, and the other isn't. That's the best explanation I have for why this failure is nondeterministic, with some perspectives succeeding and some perspectives failing.

Okay, but this could be tested by doing a dig @nsX.yoctodns.com CAA onoma.yocto.com for every nameserver (both IPv4 and IPv6), which is the case:

Honestly, I don't think that the issue is caused by different results from different nameservers. As I have mentioned before, unboundtest.com shows some difference in the logs of the last query (api.sidn.nl and onoma.yocto.com):

May 06 07:52:09 unbound[22843:0] info: signer is sidn.nl. TYPE0 CLASS0
May 06 07:52:27 unbound[22844:0] info: no signer, using onoma.yocto.com. TYPE0 CLASS0

In both cases, the SOA and its RRSIG is visible in the scrubbed packet, but the NSEC and its RRSIG is only visible in case of api.sidn.nl.

Okay, I found the issue and fixed it. It was indeed something with case-sensitivity, @petercooperjr, but not in the way you think.

My DNS server is written in Java. I have a function (checkRecordBetweenNSEC) that adds NSEC records to the response based on the state (NXDOMAIN or NODATA/NOERROR). However, this function used the String::equal() function. I know replaced that with String::equalsIgnoreCase and this fixed the issue. I was returning NSEC records only when there was an exact match.

How did I find this? The unboundtest.com tool shows udp message logs. I parsed the hex dump of the last query with my transaction decoder and was able to reproduce it with dig, because I saw oNOma.YoCto.cOm..

I think this topic can be closed now, but I think that this issue should be captured by some other tool like DNSViz or Zonemaster. I will mention this issue on their repositories and check with them in how to report errors for this issue. This isn't a Let's Encrypt or Unbound issue.

5 Likes
Congratulations, all renewals succeeded:
  /etc/letsencrypt/live/onoma.yocto.com/fullchain.pem (success)

For other people coming across this issue: check your domain (both CAA and _acme-challenge TXT) on unboundtest.com and check if all responses (especially the last one) gives the right results. If there is something not as it should be (as in my case), than that is likely the issue.

1 Like