Let's Encrypt is failing to issue / renew existing certificates with large number of SAN records

gkulanga · June 21, 2023, 2:52am

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: Multiple certificate requests are failing. Please see previous posts with LE where haven't received a long term fix:

Since last few months we have noticed an increased number of failures while LE is trying to do DNS lookup or CAA validation for certificates with large number of SAN hostnames managed on Akamai's certificate provisioning system.

Based on all the previous cases, the issue has been SERVEFAIL / DNS lookup failure that LE systems are running into while resolving all the associated SAN. hostnames. Looking at Akamai's name server logs, we can confirm that there are no SERVFAIL / DNS failure when contacting our NS machines. All the checks provided by LE also passes when checking for the failed hostnames indicating it isn't anything to do with individual DNS settings or queries. Further checks indicate that the rate limiting and error is due to the increased rate of DNS queries going to the TLD name servers.

Would LE be able to provide details on which part of the DNS lookup is resulting in the error?
How can LE ensure the validation DNS queries are paced out to avoid any rate limiting?
What are the best practices from LE on the number of SAN hostnames per certificate?

Osiris · June 21, 2023, 5:32am

Quick question: at what exact time is your renewal happening?

mcpherrinm · June 22, 2023, 4:44pm

I'm going to investigate this in the next couple days. I'll report back.

gkulanga · June 23, 2023, 1:02am

It depends on when the certificate was last issued. LE certificates are valid for 90 days and renewal kicks off 1 - 2 weeks before the certificate expires.

Due to the CAA validation across all the SAN hostnames, the renewal gets delayed. The only option now is for our customers using LE certificates is to keep revalidating until it eventually succeeds.

The last resort which is not ideal is have multiple LE certificates to split and reduce the number of SAN hostnames. But this needs to be followed by additional changes to ensure the hostnames also start pointing to the new certificates.

rg305 · June 23, 2023, 1:54am

Why not start sooner?
Why not try more times per day?

gkulanga · June 23, 2023, 2:13am

To add clarity - failures are seen while renewal as well as creating new LE certificates. No matter how early the renewal is kicked in or when a certificate is created, the issue still persists and delays renewal as well as certificate creation due do these failures.

The LE cert is valid for 90 days and we will expect customers to use it close to its validity rather than shortening the same.

MikeMcQ · June 23, 2023, 2:31am

He was asking about the time the renewal runs. The concern was whether you run at known busy times like the top of an hour.

rg305 · June 23, 2023, 2:35am

You are missing my [unspoken] point.
I doubt that the failures are 100%.
Meaning that some part of it may well be passing - it just fails in total.
Those partial successes may be cached and with each attempt more and more "pieces" may also succeed. [note that some validations may be cached up to 30 days]
The more you try, the closer you may get to 100%.

Given that approach won't do much for initial cert issuances; As it would delay them perhaps for a dozen attempts [guessing here].
But a dozen renewal attempts can be done in under a week [when attempting only twice a day].

gkulanga · June 23, 2023, 2:54am

Thank you for the additional clarity.

He was asking about the time the renewal runs. The concern was whether you run at known busy times like the top of an hour.

We have customers across the globe. The renewal / revalidation requests are run at random times - typically business hours for the customers but may not be for the DNS servers or LE systems from where the validation is done. Retries are also attempted at random times through the day.

I doubt that the failures are 100%.
Meaning that some part of it may well be passing - it just fails in total.
Those partial successes may be cached and with each attempt more and more "pieces" may also succeed. [note that some validations may be cached up to 30 days]
The more you try, the closer you may get to 100%.

Correct the failures are not 100%. It is also on random SAN hostnames. As for caching and attempting again, I believe this is something we need LE team to help with as their systems are the ones doing the DNS lookup / CAA validation. Force revalidation on customer side is only to trigger LE systems to attempt again on CAA validation

gkulanga · July 4, 2023, 1:54am

Hi, Just wanted to follow up and see if you have nay updates from your side? This is still an issue with our customers and so far we have been helping them to split their certificates into smaller batches of 10 - 20 SAN hostnames which has a better chance of completing the CAA validation.

ShahriarNK · July 27, 2023, 4:06am

Team,

just following up on the above : -)

Unfortunately, we are having exactly the same issue with another cert.

This time the CN is: dev.digital.iag.com.au

Thanks!

rg305 · July 27, 2023, 4:14am

Not sure if this is part of the problem [or not]:

nslookup -q=ns digital.iag.com.au
digital.iag.com.au      canonical name = digital.iag.com.au.edgekey.net
digital.iag.com.au.edgekey.net  canonical name = e108809.x.akamaiedge.net

mcpherrinm · July 27, 2023, 4:51am

Sorry, I haven’t had time yet to collect any additional diagnostic data into why Akamai nameservers are returning errors. Due to our large traffic volume it’s not easy to isolate.

MarkusR · July 27, 2023, 6:57am

Just so you are looking in the right spot, we don't see any errors on Akamai nameservers and believe the rate limiting is between LE validation systems and the .com.au root/TLD nameservers on CAA record lookup.
Thanks

petercooperjr · July 27, 2023, 1:50pm

If that is in fact the case, then you might be able to work around the issue by having a CAA record on each of the domain names used in the certificate, so that Let's Encrypt wouldn't need to check the TLD & root nameservers for a CAA record at all.

(I find it kind of weird that CAA always requires checking all the levels up to the root; the .com TLD nameserver must need to respond a bazillion times a day that there isn't a CAA record at that TLD level. Or at least there should be a longer cache time allowed for no-records-found at the TLD level.)

MikeMcQ · July 27, 2023, 2:21pm

This may not be helpful to this situation as I am not a DNS wizard. But, using DNSViz I see a13-66.akam.net is not responsive to UDP requests over its IPv4 address.

Probably something you want fixed anyway.

https://dnsviz.net/d/dev.digital.iag.com.au/dnssec/

gkulanga · July 31, 2023, 1:39am

Thanks Mike.

Akamai's external authoritative name service, Edge DNS, includes a number of unique features to help customers fully realize the benefits of Anycast routing. From a reliability perspective, Akamai has over 300 points of presence (PoPs) across the globe, standard customer traffic typically consumes less than 1% of total nameserver capacity, In addition, each customer is assigned a unique combination of six "clouds," or Anycast IPs, to properly load balance client queries. SO at any given point in time, even if one of the 6 anycast NS aren't reachable for whatever reason, the DNS system is designed to failover to the remaining NS to ensure the DNS lookup succeeds.

gkulanga · July 31, 2023, 1:40am

Thanks Peter.

We are trying this suggestion to see if that helps avoid the multiple level CAA record lookup by LE system.

webprofusion · July 31, 2023, 2:53am

While that's true for general internet use, Let's Encrypt DNS validation looks at a subset of your nameservers and if any of them disagree or SERVFAIL then your validation will fail, so for domain validation you need healthy nameservers and cannot rely on redundancy. If a nameserver is not behaving you need to remove it or risk validation failures.

system · August 30, 2023, 2:53am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LE is not issuing cert for unknown reason Help	11	1398	June 15, 2023
Problem generating cert with large number of SANs (60+) Help	22	1686	September 30, 2020
Spurious CAA SERVFAIL responses during finalize Help	40	1712	January 4, 2024
LetsEncrypt renewal error - Error finalizing order :: While processing CAA, SERVFAIL looking up CAA Help	9	1042	March 2, 2024
SAN certificate with dns-cloudflare Help	42	1571	July 5, 2023

Let's Encrypt is failing to issue / renew existing certificates with large number of SAN records

Related topics