CAA requests resulting in SERVFAIL since Dec 12th

I work on a small domain registrar/site hosting company. We have been using Let's Encrypt for our domains for about 2 years now but since about 6 days our certificate renewal processes have started failing with CAA SERVFAIL issues:

Problem {
  type: "urn:ietf:params:acme:error:dns",
  detail: "DNS problem: SERVFAIL looking up CAA for www.itsjustnic.com - the domain's nameservers may be malfunctioning",
  status: 400,
}

However, I can't reproduce this result:

djc-2021 instagram-owner certifier $ dig caa itsjustnic.com 

; <<>> DiG 9.10.6 <<>> caa itsjustnic.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49935
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4095
;; QUESTION SECTION:
;itsjustnic.com.                        IN      CAA

;; ANSWER SECTION:
itsjustnic.com.         3600    IN      CAA     0 issue "letsencrypt.org"
itsjustnic.com.         3600    IN      CAA     0 iodef "mailto:dns@instantdomains.com"

;; Query time: 104 msec
;; SERVER: 100.100.100.100#53(100.100.100.100)
;; WHEN: Mon Dec 18 16:07:31 CET 2023
;; MSG SIZE  rcvd: 125

djc-2021 instagram-owner certifier $ dig caa www.itsjustnic.com

; <<>> DiG 9.10.6 <<>> caa www.itsjustnic.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57555
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4095
;; QUESTION SECTION:
;www.itsjustnic.com.            IN      CAA

;; Query time: 101 msec
;; SERVER: 100.100.100.100#53(100.100.100.100)
;; WHEN: Mon Dec 18 16:07:39 CET 2023
;; MSG SIZE  rcvd: 47

All of the authorization errors that have seen have mentioned a www.* domain (we always create CSRs with two domains in the SAN, the registrable domain and the www. for that registrable).

I did find the documentation on CAA errors and common causes, however, I don't think these apply here? We do nothing with DNSSEC, and as shown above, we yield NOERROR for domains that don't have a CAA record.

We are using instant-acme (which we wrote) as our ACME client, and are using the dns-01 authorization method. Authoritative DNS records are served using a simple DNS server which we also wrote. However, our cloud logging solution does not reveal any error logs from the DNS server, nor any other logs with a SERVFAIL status (although we log ~every response).

This issue seems to have been occurring since about Dec 12th; we have about 100 sites right now that have failed to renew and 2 new domains that we haven't been able to get a certificate for. As far as I'm aware there have been no material changes to our DNS server or certificate issuance component. I'd appreciate any help on this issue!

This is giving a NOERROR, but a NOERROR response is supposed to send the SOA record as well (I think). Let's Encrypt recently upgraded their version of unbound to 1.18, which is pickier about DNS servers following the standards.

You can see this with Unboundtest:

1.18 SERVFAIL: https://unboundtest.com/m/CAA/www.itsjustnic.com/URZIL2K7
1.16 working: https://unboundtest.com/m/CAA/www.itsjustnic.com/ZQSLC6TW

And DNSViz: www.itsjustnic.com | DNSViz

www.itsjustnic.com/CAA (NODATA): No SOA RR was returned with the NODATA response. (34.152.18.51, 35.234.251.10, UDP_-_EDNS0_4096_D_KN)

So, this is a bug in your DNS server implementation, that Let's Encrypt's resolver now requires to be fixed. What DNS server are you using?

4 Likes

@petercooperjr is right on the money. It seems like your homegrown DNS server isn't fully compatible with the DNS specs and is now failing for some strict resolvers such as Unbound 1.18.

5 Likes

Thanks, I missed that they were trying to write their own DNS server. Yes, that's much more complicated than it seems like it should be. But you're in good company, there are other people who recently have had problems with their old or homebrew DNS servers not quite following the standards:

I've spent a few minutes trying to find the spec that says what should be returned for a NOERROR empty response, and that it should include a SOA record, and I don't know the chapter and verse since there are a lot of DNS specs over the years. But it seems to be a thing that's expected, and that current resolvers require.

3 Likes

Awesome, thanks for the quick response! Going to try this out.

2 Likes

It's unfortunate that the Discourse search, when prompted with "CAA SERVFAIL", seems to mostly yield much older results from 2017 etc.

2 Likes

Yeah, perhaps, but Advanced search allows sorting by most recent which is something I often do.

4 Likes

It might make no difference for this particular test.
But for future testing, all DNS requests should be made against the authoritative DNS tree path.
[ I can't tell where your DNS request went - I don't recognize "SERVER: 100.100.100.100#53(100.100.100.100)" ]

1 Like

Finally found the spec: RFC 2308, section 3:

Name servers authoritative for a zone MUST include the SOA record of the zone in the authority section of the response when reporting an NXDOMAIN or indicating that no data of the requested type exists. This is required so that the response may be cached.

And found the detail of the change in the Unbound 1.18 changelog:

  • Fix to ignore entirely empty responses, and try at another authority. This turns completely empty responses, a type of noerror/nodata into a servfail, but they do not conform to RFC2308, and the retry can fetch improved content.
4 Likes

Do you know if this is (strictness) is configurable in Unbound? Starting to see other users with the same problem and they're going to have to go to their DNS providers for help.

[Edit: maybe a false alarm.. seems similar though]

3 Likes

Slightly hijacking the thread but it looks like DNS services provided by hover.com (tucows) currently return an acceptable response for the primary domain but forget to include SOA records in some subdomain responses e.g. www, but not all. The nameservers affected are n1.hover.com and ns2.hover.com

I'll see if the affected user is willing to share one of the domains here as an example.

It may depend on other config like whether they have an existing A record or whether they're using a catch all to direct all subdomains to an IP

2 Likes

I'm not aware of one, from my reading of the Unbound changelog they consider the completely-empty response as a "bad server" and so it's trying to fall back to a different authoritative server that might give it a valid response.

The tally so far that I've found seems to be:

Does look that way. Might make sense to start a separate thread (or include in the other hover.com thread) rather than hijack this one. But I don't know as there's much for people here to do; it looks to be a bug on the DNS provider's side that people have just been managing to get away with for some time. It could be worth trying some other CAs, or other DNS resolving software, to see if they report an error differently, but I think that Let's Encrypt is compelled to follow the DNS standards. (I don't know as I'd go so far as to call the previously-issued certificates validated against bad DNS servers as being misissued, but I think there's an argument to be made for it. And it's probably a hard argument to convince Let's Encrypt to roll back, though I don't know what process they use to determine which DNS server software they need to be using.)

4 Likes

Indeed, changing our DNS server to align with the RFC 2308 behavior fixed the issue for us.

Given the fragmentation of the DNS RFC landscape, it would be good if Let's Encrypt had more precise documentation about the requirements imposed on DNS servers (other than, go read the Unbound source code). I suppose using unboundtest.com works, to some extent, but more normative language/references would be better.

2 Likes

That's not really Let's Encrypt's job though, right? They're using a well-known 3rd party DNS tool that is presumably following the standards. They didn't write it. Why would they know or care about RFC level implementation details? In my opinion, Let's Encrypt already goes above and beyond by having tools like UnboundTest and helping to track down issues rather than just hand waving and saying, "We use Unbound, go ask them."

But you (the org writing a custom authoritative DNS server) certainly have a responsibility to be following the RFCs despite their fragmentation and ever changing nature. That's sort of the job you accepted when you decided to write it instead of using something that already exists. Or at the very least, it's your job to make your implementation function close enough to the majority of the other implementations on the Internet.

5 Likes

The problem is that "the standards" is pretty amorphous in the case of DNS. For CAA processing in particular, RFC 8659 section 3 specifies an algorithm while referring to RFC 1034, but it doesn't talk about this behavior of requiring a NOERROR response without answers to specify the authority -- in fact, the CAA RFC doesn't refer to the negative caching RFC 2308 at all, so I don't think it's all that obvious that Let's Encrypt, when validating CAA records, relies on the behavior specified in that RFC. I don't think it's crazy to argue that there might be value in documenting this clearly, given that from threads on this forum I've definitely not been the only one bitten by this change in Unbound.

I did chose to opt in to that responsibility. Given the lack of a comprehensive modern set of standards (like the 91xx-era HTTP RFCs), unfortunately, it's hard to know what RFCs are required to implement -- as is clear from the fact that Unbound only started to enforce this particular aspect of a 1998 standard in 2023. I don't think it's reasonable to say that every DNS server should implement every DNS RFC, and there was no blog post or topic in the API Announcement category on this forum about this change.

DNS implementations must pick some subset of the plethora of existing DNS RFCs, and I don't think it's crazy to suggest that Let's Encrypt, given its reliance via the dns-01 authorization mechanism, documents in some detail what that mechanism entails. Otherwise it feels like Let's Encrypt is explicitly relying on Unbound behavior, including potential bugs, which doesn't seem like a great situation either.

So yes, I consider it my job to make my implementation function close enough to the majority of the other implementations, but the question is how I find out if something changes. Given the number of threads recently about this topic, it doesn't seem like I'm doing significantly worse than others.

2 Likes

Yeah, I think they've managed to upgrade Unbound in the past without anybody noticing, but I agree that they should have posted an API Announcement (especially in retrospect).

Their goal (as I understand it) is to eventually move to a memory-safe DNS resolver that they have more guidance over. But that may take some time. And probably will swap one set of implementation choices (and/or bugs) for another.

4 Likes

Don't get me wrong. I totally get that implementing a protocol like DNS that has 40 years worth of RFC changes would be a royal PITA and there's likely no two implementations that function exactly alike. All I'm saying is that the documentation and protocol level change notification you want should be provided by Unbound and not Let's Encrypt. And for this particular issue, it actually was in the changelog for Unbound as a bugfix.

What Let's Encrypt could have done better here is better publicize the Unbound version upgrade on their end and perhaps let it stew in Staging a bit longer before pushing it to Prod.

No doubt and I apologize if the tone of my previous post made it sound accusatory. I was not trying to knock your existing efforts. We're all just doing the best we can. The fact that you are here directly instead of a customer of yours trying to be a middle man between Let's Encrypt and your support system is great!

5 Likes

For context, this Unbound upgrade went to staging at November 3, 2023 at 22:24 UTC. It went to production at November 28, 2023 at 17:41 UTC.

Routine updates to Unbound aren't something that we've done announcements about ever before. Admittedly, we had a year-long period where we couldn't upgrade Unbound because newer versions had performance regressions for our (wild, abnormal) workload, but happily all that is remedied, and I'm trying to keep those updates, well, routine.

Should I write one up today for 1.18? Similarly, should I write one for 1.19 announcing it'll get updated sometime early January?

4 Likes

I would vote for Yes (to both). Given that we've heard from a few people using custom DNS software already, I think that some amount of warning is probably helpful. Especially as we've only heard about people who have noticed problems renewing in the past few weeks since it's been deployed, and more people will likely be affected in the coming couple months as their certificate comes up for renewal.

Might also be worth mentioning on the CAA errors documentation (as it'd be nice to be able to point to someplace "official"). The most common case where people are running into this is probably where they have no CAA record, and saying that one cause for SERVFAIL there is that their DNS server isn't responding with the SOA record for a correct empty response may be helpful in addition to the other possible causes listed there.

It's probably also worth understanding more about the changes they've made in 1.19, since it may be that this is only needed temporarily until that upgrade can be in place, if the "bugfix" wasn't actually intended to respond SERVFAIL on these use cases.

5 Likes

Thank you @jcjones for posting the announcement! The only thing I'd nitpick is that it mentions NXDOMAIN responses, but the problem also occurs with empty NOERROR/NODATA responses (which are more common for CAA as the domain exists but just doesn't have a CAA record).

3 Likes