Spurious CAA SERVFAIL responses during finalize

@jcjones

The 5 certs that were failing renewals yesterday have now succeeded. Was something changed on the Let's Encrypt side? I'm wondering if this will happen again.

Can you tell specifically what errors occurred after the Unbound DNS upgrade 1.18.0? How can we ensure that our side is compliant?

unboundtest.com for example seems ok https://unboundtest.com/m/CAA/cert-intl-9d8pqgr5.ca-central-1.aws.glb.confluent.cloud/U2MOLV5D

2 Likes

The only difference in production from Monday to today is the Unbound 1.18 update yesterday: Potential networking / client changes on DNS Challenges - #28 by jcjones

As gurjit said, their DNS server had an EDNS compliance issue which was raised by our Unbound upgrade to 1.18. If the compliance tester is happy for your zone, it should be fine.

4 Likes

Even with EDNS compliance on our side, we're still seeing intermittent failures. Is there any way for you to see more detailed errors/logs on your side? The error message is too vague for us to know what to change.

1 Like

Since these are CAA SERVFAIL problems, my first thought is rate-limiting, because the CAA algorithm sends a lot of DNS requests in a short amount of time.

The next thing on my to-do list is to increase the number of validation hosts we have, which might help us spread out the number of requesting IP addresses out to reduce per-IP rate limit problems. (We've been all-hands-on-deck over in the other thread this morning).

8 Likes

@jcjones Thank you for looking into this.

Do I understand correctly that SERVFAIL comes from your Unbound DNS recursive resolver and your client application is seeing the SERVFAIL while the resolver is having some issues querying the DNS authority (either Route53 or our DNS infra) or thinks that the responses it's getting are invalid? Is there a way to tell which of those is the case?

SERVFAIL is coming from Unbound, yes, and I'm seeing some of those returns for your domain in under 2s, so they're not timeouts.

Only if it happens consistently, so that we can dump a detailed log. As you can probably understand, we can't handle a megabyte of logging per DNS lookup, which is roughly what Unbound produces at max verbosity.

But all-said, the consistent issue we see with CAA is: Our CAA checks, right before final issuance, trigger rate limiting at the DNS server. That either shows up as a timeout or a SERVFAIL. It's not guaranteed that this is what's happening here, but it's my best guess at present.

CAA is a common culprit for this because it's not one query, it's a bunch all at once to implement the algorithm for finding the "Relevant Resource Record Set" from RFC 8659, and the more labels in the DNS name (like yours there), the more leaves have to be checked.

5 Likes

Thank you. Is this mainly happening for our domains or across your subscriber base? If it's specific to us, I'm wondering if there's anything we can to reduce the number of queries. We don't/didn't have CAA set for our domains but are exploring whether adding it helps.

1 Like

If that's actually the issue -- the number of CAA queries against your NS triggering some protections -- then a Relevant CAA Set would shortcut the algorithm and return early when it's encountered.

So it might be worth a try? Otherwise talk to your nameserver operator about what query limits they have that might affect CAA validations?

We're talking about doing things to space out those CAA checks, but it's not trivial because with them being potentially legitimately slow, it leads us to needing asynchronous finalization .... which broke a lot of clients last time we enabled it.

Sharp rocks everywhere.

6 Likes

It's a common failure, but it's 4th out of 4 for kinds of DNS lookup failures, behind A, AAAA, and TXT.

On average, we send about 74.9 CAA requests per second 193 CAA requests per second and less than 0.002 requests/sec of them get SERVFAIL.

(Edit: Forgot the multi-perspective CAA checks)

5 Likes

Our initial experiment after creating CAA records seems to be working. We will monitor to see if this gets rid of the SERVFAIL errors.

1 Like

Hi Everyone,

Thanks for all the awesome work being put into this, unfortunately we cannot update DNS for all of our customers, so hope there will be a solution that will solve the sporadic issues when creating multi-san with up to 99 domain challenges we are seeing 1-2 failures on finalization certificates.

We are unable to run in --dryrun mode, so have halted all certificate generation since yesterday to avoid hitting ratelimits.

1 Like

@ITNiels Are you able to reduce the number of SANs per cert? That would increase the odds of getting at least some issued.

Is there a pattern for the ones that fail? Such as being the same DNS provider?

One work-around that has helped others is to add a CAA record. This eliminates some of the lookups. If there is a pattern of failure maybe try this only for those.

4 Likes

Hi @MikeMcQ ,

Thank you for your reply.
They are from different providers and different TLDs.

We are unfortunately limited a bit by our infrastructure in how many total certificates we can create and replace, and lowering the number of SANs will increate the amount of certificates needed, we are currently running 11 certificates total, and because domains are not "hard assigned" a certificate, we cannot even cache some of them as they might change if the order changes between runs (it's not a great system and we want to rewrite it, but that is a bit out of scope for right now)

And we are not in control of our customers DNS, so we cannot add CAA records either.

So hoping this can be solved and we can resume until we can get a new system built. :pray:

1 Like

But if the same limited number of domains were failing you could instruct the customer to do that.

Seems there would be some commonality among the failures. If you could give us some failing names maybe we would see it where you do not.

The normal failure rate is extremely low so the odds of you being affected for each 99 name cert is also low without some pattern of DNS provider.

4 Likes

Correct, not every certificate experienced issues, but if even one fails then we discard them all currently, and we have 8+ that has changes to it fails at at-least 1-2 certificates per run.

There were some repeats, but also others that were not.
I can try and re-enable the service and see if something has changed since last run, but I don't like to leave it up to luck, or risk being rate limited.

If there is a solution in the works I would rather wait for it?

Doesn't sound imminent to me. And may not even help your situation depending on the root cause. And, if it requires the ACME client to handle async finalize your very old client may not support that.

4 Likes

@MikeMcQ Thank you very much, I will try and look into alternatives.

2 Likes

Does your client create a new ACME account every time it runs?

If your client uses a consistent ACME account, then you should be able to get issuance even for these 100-name certificates by running it a few times in a row. If 95 names succeed both domain control validation and CAA checks, then those 95 names won't need to be checked at all the next time -- only the 5 failed domain names will need to have domain control validation and CAA checks redone. If the failures are truly random (e.g. due to CAA checks flooding the DNS servers), then the reduction of traffic volume on the retry should help significantly.

The only requirement for this to work is that 1) the same acme account be used, and 2) the attempts be less than 7 hours apart -- any more than that and the CAA checks will have to be redone anyway.

6 Likes

@aarongable Thank you so much,

We are running the service so that we retry every 3 hours with a maximum of 3 retries before it halts itself and waits for human intervention as to not hit ratelimits, and we are using the same account every time, but were still seeing the issues.

I will see how far I can stretch our certificates to reduce the number of sans, but all common names share the same top-level domain sslX.ipaper.io, so it just puts us closer to rate limiting the more certificates we try.

I did not know that it only had to try the failed ones within 7 hours, also if it is a new order but with mostly same domains?

1 Like

You can always use our rate limit adjustment form (linked from our rate limit documentation) to request adjustments to the "certificates per registered domain" rate limit. This should alleviate your concerns regarding having to renew too many certificates at a time.

Since you're shuffling subdomains between certificates at random, you could also consider getting a single wildcard certificate instead, reducing the number of validations you have to perform (and the number of CAA checks we have to do) from hundreds to just 1.

Let's Encrypt re-uses completed domain control validations (also knows as Authorizations in the ACME protocol) for up to 30 days after they are completed -- in other words, successfully completing validation for a domain name means you won't have to re-do validation for that name for 30 days. (The Baseline Requirements allow up to 398 days, but we like to do better than that.)

Similarly, Let's Encrypt re-uses CAA determinations for up to 7 hours after they're retrieved -- in other words, if we see a CAA record that says we're allowed to issue for a given domain name, we will continue to trust that without re-checking for 7 hours. (The Baseline Requirements allow up to 8 hours, but we like to have a buffer.)

5 Likes