SERVFAIL looking up CAA: How to detect AHEAD of time?

For the last week, we’ve been getting SERVFAIL looking up CAA in LE staging. I’ve read a half dozen other threads on this and have learned:

  • LE has always inspected the DNS of domains during issuance process, but only recently started treating SERVFAIL as a failure in staging.
  • The problem is almost always with the DNS operator and requires some DNS upgrade by the operator

We have literally dozens of thousands of domains, from thousands of customers, and cannot reach out to DNS operators to request upgrades. I need to find a way around this.

So my first goal, and question of this thread, is: how can I reliably test domains programmatically BEFORE sending them to LE, to avoid this happening in the first place? The reason this is crucial is that every attempt is a SAN containing 100 domains, so if one fails, they all fail.

From reading other threads, there doesn’t appear to be a single command that will always expose the issue. I’m hoping there is one command, or one series of dig commands or something, that someone could share with me. My networking knowledge isn’t exactly expert level :stuck_out_tongue_winking_eye:

You might suggest, “just skip staging and go to production since production doesnt have this checking.”

I don’t believe that’s feasible. Production will have this SERVFAIL logic soon if it doesn’t already and we need to our code to be prepared.

Furthermore, we absolutely depend on issuing against staging first, every single time we authorize+issue. We authorize sometimes a thousand domains in a day, and at that rate even the rare issues stack up enough to hit production rate limiting issues. We depend on weeding those issues out in staging first. Typically, we issue a SAN with 100 domains to LE, LE will say which domain failed authorization, and then we retry with 99… etc etc until it works in staging and we can then attempt in production with confidence.

BTW this does mean that we’ve been failing to generate PRODUCTION certs for a week since staging has been failing for a week. So we’re not in hot water yet, but will be soon.

Hi @lancedolan,

You might want to try

https://unboundtest.com/

or the source code for that tool, linked from that page. This is a version of the same resolver that the Let’s Encrypt CA itself uses.

1 Like

I found that tool in a separate thread actually, but it is a user interface and not a service API that I’m confident I can integrate into our production system. I could probably observe the request sent by the form on that page and mimic it in order to use its backend from our servers, but that’s not a good integration for our production environment.

However, I hadn’t considered using the source code to just run the unbound test on our own infrastructure… Lemme look into this!

There’s a slight chance of divergences between the behavior of that source code and the behavior of the CA, but there aren’t intended to be any such divergences; if you find any, you can report them to @jsha, who would likely be able to get them straightened out by changing the behavior of one side or the other.

It sounds like you may have read some old threads and gotten an incorrect impression. Let’s Encrypt now treats CAA SERVFAIL as preventing issuance on both production and staging, and have since September 7: CAA SERVFAIL changes. We did our best to reach out in advance of that change; did you receive any notification emails from us? Do you have email addresses set on your account?

Using the config from unboundtest.com as a pre-check on your infrastructure is a good idea; there’s also helpful information at https://letsencrypt.org/docs/caa/.

It’s also worth noting that CAA is checked at validation time, so if you have a recent validation for a given domain, you can be pretty confident that that domain’s CAA works.

Why do you aggregate the domains at all? Are these subdomains of a given domain or distinct domains?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.