The 5 certs that were failing renewals yesterday have now succeeded. Was something changed on the Let's Encrypt side? I'm wondering if this will happen again.
Can you tell specifically what errors occurred after the Unbound DNS upgrade 1.18.0? How can we ensure that our side is compliant?
As gurjit said, their DNS server had an EDNS compliance issue which was raised by our Unbound upgrade to 1.18. If the compliance tester is happy for your zone, it should be fine.
Even with EDNS compliance on our side, we're still seeing intermittent failures. Is there any way for you to see more detailed errors/logs on your side? The error message is too vague for us to know what to change.
Since these are CAA SERVFAIL problems, my first thought is rate-limiting, because the CAA algorithm sends a lot of DNS requests in a short amount of time.
The next thing on my to-do list is to increase the number of validation hosts we have, which might help us spread out the number of requesting IP addresses out to reduce per-IP rate limit problems. (We've been all-hands-on-deck over in the other thread this morning).
Do I understand correctly that SERVFAIL comes from your Unbound DNS recursive resolver and your client application is seeing the SERVFAIL while the resolver is having some issues querying the DNS authority (either Route53 or our DNS infra) or thinks that the responses it's getting are invalid? Is there a way to tell which of those is the case?
SERVFAIL is coming from Unbound, yes, and I'm seeing some of those returns for your domain in under 2s, so they're not timeouts.
Only if it happens consistently, so that we can dump a detailed log. As you can probably understand, we can't handle a megabyte of logging per DNS lookup, which is roughly what Unbound produces at max verbosity.
But all-said, the consistent issue we see with CAA is: Our CAA checks, right before final issuance, trigger rate limiting at the DNS server. That either shows up as a timeout or a SERVFAIL. It's not guaranteed that this is what's happening here, but it's my best guess at present.
CAA is a common culprit for this because it's not one query, it's a bunch all at once to implement the algorithm for finding the "Relevant Resource Record Set" from RFC 8659, and the more labels in the DNS name (like yours there), the more leaves have to be checked.
Thank you. Is this mainly happening for our domains or across your subscriber base? If it's specific to us, I'm wondering if there's anything we can to reduce the number of queries. We don't/didn't have CAA set for our domains but are exploring whether adding it helps.
If that's actually the issue -- the number of CAA queries against your NS triggering some protections -- then a Relevant CAA Set would shortcut the algorithm and return early when it's encountered.
So it might be worth a try? Otherwise talk to your nameserver operator about what query limits they have that might affect CAA validations?
We're talking about doing things to space out those CAA checks, but it's not trivial because with them being potentially legitimately slow, it leads us to needing asynchronous finalization .... which broke a lot of clients last time we enabled it.
Thanks for all the awesome work being put into this, unfortunately we cannot update DNS for all of our customers, so hope there will be a solution that will solve the sporadic issues when creating multi-san with up to 99 domain challenges we are seeing 1-2 failures on finalization certificates.
We are unable to run in --dryrun mode, so have halted all certificate generation since yesterday to avoid hitting ratelimits.
@ITNiels Are you able to reduce the number of SANs per cert? That would increase the odds of getting at least some issued.
Is there a pattern for the ones that fail? Such as being the same DNS provider?
One work-around that has helped others is to add a CAA record. This eliminates some of the lookups. If there is a pattern of failure maybe try this only for those.
Thank you for your reply.
They are from different providers and different TLDs.
We are unfortunately limited a bit by our infrastructure in how many total certificates we can create and replace, and lowering the number of SANs will increate the amount of certificates needed, we are currently running 11 certificates total, and because domains are not "hard assigned" a certificate, we cannot even cache some of them as they might change if the order changes between runs (it's not a great system and we want to rewrite it, but that is a bit out of scope for right now)
And we are not in control of our customers DNS, so we cannot add CAA records either.
So hoping this can be solved and we can resume until we can get a new system built.
Correct, not every certificate experienced issues, but if even one fails then we discard them all currently, and we have 8+ that has changes to it fails at at-least 1-2 certificates per run.
There were some repeats, but also others that were not.
I can try and re-enable the service and see if something has changed since last run, but I don't like to leave it up to luck, or risk being rate limited.
If there is a solution in the works I would rather wait for it?
Doesn't sound imminent to me. And may not even help your situation depending on the root cause. And, if it requires the ACME client to handle async finalize your very old client may not support that.
Does your client create a new ACME account every time it runs?
If your client uses a consistent ACME account, then you should be able to get issuance even for these 100-name certificates by running it a few times in a row. If 95 names succeed both domain control validation and CAA checks, then those 95 names won't need to be checked at all the next time -- only the 5 failed domain names will need to have domain control validation and CAA checks redone. If the failures are truly random (e.g. due to CAA checks flooding the DNS servers), then the reduction of traffic volume on the retry should help significantly.
The only requirement for this to work is that 1) the same acme account be used, and 2) the attempts be less than 7 hours apart -- any more than that and the CAA checks will have to be redone anyway.
We are running the service so that we retry every 3 hours with a maximum of 3 retries before it halts itself and waits for human intervention as to not hit ratelimits, and we are using the same account every time, but were still seeing the issues.
I will see how far I can stretch our certificates to reduce the number of sans, but all common names share the same top-level domain sslX.ipaper.io, so it just puts us closer to rate limiting the more certificates we try.
I did not know that it only had to try the failed ones within 7 hours, also if it is a new order but with mostly same domains?
You can always use our rate limit adjustment form (linked from our rate limit documentation) to request adjustments to the "certificates per registered domain" rate limit. This should alleviate your concerns regarding having to renew too many certificates at a time.
Since you're shuffling subdomains between certificates at random, you could also consider getting a single wildcard certificate instead, reducing the number of validations you have to perform (and the number of CAA checks we have to do) from hundreds to just 1.
Let's Encrypt re-uses completed domain control validations (also knows as Authorizations in the ACME protocol) for up to 30 days after they are completed -- in other words, successfully completing validation for a domain name means you won't have to re-do validation for that name for 30 days. (The Baseline Requirements allow up to 398 days, but we like to do better than that.)
Similarly, Let's Encrypt re-uses CAA determinations for up to 7 hours after they're retrieved -- in other words, if we see a CAA record that says we're allowed to issue for a given domain name, we will continue to trust that without re-checking for 7 hours. (The Baseline Requirements allow up to 8 hours, but we like to have a buffer.)