Hey there! I hope I've included enough information here; more than happy to add more info/detail as needed (apologies if I glazed over anything here).
My domain is: nmdp.org
I ran this command: (requesting renewal of several different certificates with more than a dozen or so domain names, using Ansible playbooks with the Ansible Crypto ACME modules, see below)
It produced this output:
Repeated SERVFAIL responses on CAA lookups, originally without a CAA record present for our domain, but also after creating CAA records it continues to happen. This is been happening for at least a few weeks now and a couple network engineers and myself have been scratching our heads on this trying all sorts of tests from various locations with various levels of intensity and we can't seem to reliably replicate the SERVFAIL errors. We also don't seem to have any indication from our DNS proxy provider (Imperva) that anything is going wrong here (we have a support ticket opened up with Imperva but still getting traction on that).
It's seemingly sporadic, and not always the same SAN/domain that comes back with the SERVFAIL, and not specific to wildcards or nonwildcards, and not specific to my domain name (nmdp.org) nor the subdomains we're listing as SANs.
If the error does happen to occur on a smaller cert (say, less than 8-10 domains) simply rerunning our automation job is successful. On these few larger certificates, it doesn't appear to help waiting any number of days waiting to retry.
My web server is (include version): (n/a, though using latest version or N-1 version of Ansible Crypto ACME modules)
The operating system my web server runs on is (include version): Running on Linux (Ubuntu, I believe), in Ansible execution runner containers.
My hosting provider, if applicable, is:
DNS: Imperva (proxying in front of Infoblox).
I can login to a root shell on my machine (yes or no, or I don't know): yes
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
Ansible Crypto acme_certificate_module (community.crypto.acme_certificate module – Create SSL/TLS certificates with the ACME protocol — Ansible Community Documentation)
Other actions we've run/tested:
- Nothing particularly obvious when looking at LetsDebug
- CAA results appear as expected via unboundtest.com
- I didn't notice any egregious issues on check-your-website.server-daten.de (though I could have missed something there)
- We haven't been able to reproduce this as consistently over the Let's Encrypt Staging environment; we still get CAA failures on two of our larger certs, though occasionally get back a successfully issued cert in Staging (whereas Production hasn't seen a successful issuance on those certs in two weeks or so, failing on the CAA SERVFAIL errors)
I'd love for there to be a giant red flag somewhere, but I haven't been able to find it yet.
Any help on this would be awesome, even if just spitballing some more ideas.
Thanks!