SERVFAIL errors on CAA checks (primarily when including many SANs)

Hey there! I hope I've included enough information here; more than happy to add more info/detail as needed (apologies if I glazed over anything here).

My domain is: nmdp.org

I ran this command: (requesting renewal of several different certificates with more than a dozen or so domain names, using Ansible playbooks with the Ansible Crypto ACME modules, see below)

It produced this output:
Repeated SERVFAIL responses on CAA lookups, originally without a CAA record present for our domain, but also after creating CAA records it continues to happen. This is been happening for at least a few weeks now and a couple network engineers and myself have been scratching our heads on this trying all sorts of tests from various locations with various levels of intensity and we can't seem to reliably replicate the SERVFAIL errors. We also don't seem to have any indication from our DNS proxy provider (Imperva) that anything is going wrong here (we have a support ticket opened up with Imperva but still getting traction on that).

It's seemingly sporadic, and not always the same SAN/domain that comes back with the SERVFAIL, and not specific to wildcards or nonwildcards, and not specific to my domain name (nmdp.org) nor the subdomains we're listing as SANs.

If the error does happen to occur on a smaller cert (say, less than 8-10 domains) simply rerunning our automation job is successful. On these few larger certificates, it doesn't appear to help waiting any number of days waiting to retry.

My web server is (include version): (n/a, though using latest version or N-1 version of Ansible Crypto ACME modules)

The operating system my web server runs on is (include version): Running on Linux (Ubuntu, I believe), in Ansible execution runner containers.

My hosting provider, if applicable, is:
DNS: Imperva (proxying in front of Infoblox).

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
Ansible Crypto acme_certificate_module (community.crypto.acme_certificate module – Create SSL/TLS certificates with the ACME protocol — Ansible Community Documentation)

Other actions we've run/tested:

  • Nothing particularly obvious when looking at LetsDebug
  • CAA results appear as expected via unboundtest.com
  • I didn't notice any egregious issues on check-your-website.server-daten.de (though I could have missed something there)
  • We haven't been able to reproduce this as consistently over the Let's Encrypt Staging environment; we still get CAA failures on two of our larger certs, though occasionally get back a successfully issued cert in Staging (whereas Production hasn't seen a successful issuance on those certs in two weeks or so, failing on the CAA SERVFAIL errors)

I'd love for there to be a giant red flag somewhere, but I haven't been able to find it yet.

Any help on this would be awesome, even if just spitballing some more ideas.

Thanks!

1 Like

The sporadic nature of the problem feels like some sort of load-related DoS protection trigger either at the Imperva or the Infoblox layer. Too many queries from the same IPs too quickly triggering a temporary block.

6 Likes

I don't know how actionable or related this is, but both DNSViz and the ISC EDNS Tester report issues, including timeouts, and you may have a path MTU issue.

nmdp.org/TXT: No response was received until the UDP payload size was decreased, indicating that the server might be attempting to send a payload that exceeds the path maximum transmission unit (PMTU) size. See RFC 6891, Sec. 6.2.6. (192.230.121.1, 192.230.122.1, UDP_-_EDNS0_4096_D_KN)

6 Likes

Yeah... this is a tough one. We would normally see logs/events in our console on DoS but it's also entirely possible that Imperva is just not relaying that information when it comes to CAA records specifically. (Hoping that the vendor support ticket sheds some light there, because it really doesn't feel like a problem on the Let's Encrypt side)

2 Likes

Good call out on the MTU issue. I didn't notice that when I ran reports on DNSViz last week. To your point, that might be one of those totally un-actionable things depending on what levers we have the ability to pull/push in Imperva's configs. I'll check in with the network folks I'm working with and see what is available there. Thank you for that!

4 Likes

Quick update, not much progress on this. After reviewing the MTU config, we're sitting at 1220 bytes on Imperva, a reasonable value from my understanding. So it seems that report from DNZViz is just indicating that it tried the higher values and eventually reached 1220.

An extra note here though, the issue recurred on a certificate with a single domain (wso2mi.nmdp.org) but after rerunning that same automation job, the cert renewed just fine. The other cert renewal that ran last night with a couple domains on it ran without issue right before that, but not in parallel for what it's worth (we run our few renewals in sequence, not concurrently).

It continues to present itself as a problem that just happens a percentage of the time, not necessarily just on many-SAN certs.

1220 is a reasonable value, but you need to make sure that the firewalls/routers/etc. aren't blocking ICMP Destination Unreachable messages since they're required for the sender to be able to know they need to lower the MTU they're sending.

I don't know if that's specifically the problem you're running into, just that sometimes issues of "sometimes packets aren't getting through" are related to that.

Really I think what you need are packet captures from your DNS provider, to compare a time that it works to a time that it doesn't work. I doubt that there's anything really specific to CAA records involved, it's just that that's the part of issuance that can involve large numbers of requests all arriving at once and so it can be a stress test for some systems.

3 Likes