Let's Encrypt is failing to issue / renew existing certificates with large number of SAN records

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: Multiple certificate requests are failing. Please see previous posts with LE where haven't received a long term fix:

Since last few months we have noticed an increased number of failures while LE is trying to do DNS lookup or CAA validation for certificates with large number of SAN hostnames managed on Akamai's certificate provisioning system.

Based on all the previous cases, the issue has been SERVEFAIL / DNS lookup failure that LE systems are running into while resolving all the associated SAN. hostnames. Looking at Akamai's name server logs, we can confirm that there are no SERVFAIL / DNS failure when contacting our NS machines. All the checks provided by LE also passes when checking for the failed hostnames indicating it isn't anything to do with individual DNS settings or queries. Further checks indicate that the rate limiting and error is due to the increased rate of DNS queries going to the TLD name servers.

  1. Would LE be able to provide details on which part of the DNS lookup is resulting in the error?
  2. How can LE ensure the validation DNS queries are paced out to avoid any rate limiting?
  3. What are the best practices from LE on the number of SAN hostnames per certificate?
1 Like

Quick question: at what exact time is your renewal happening?

4 Likes

I'm going to investigate this in the next couple days. I'll report back.

6 Likes

It depends on when the certificate was last issued. LE certificates are valid for 90 days and renewal kicks off 1 - 2 weeks before the certificate expires.

Due to the CAA validation across all the SAN hostnames, the renewal gets delayed. The only option now is for our customers using LE certificates is to keep revalidating until it eventually succeeds.

The last resort which is not ideal is have multiple LE certificates to split and reduce the number of SAN hostnames. But this needs to be followed by additional changes to ensure the hostnames also start pointing to the new certificates.

Why not start sooner?
Why not try more times per day?

2 Likes

To add clarity - failures are seen while renewal as well as creating new LE certificates. No matter how early the renewal is kicked in or when a certificate is created, the issue still persists and delays renewal as well as certificate creation due do these failures.

The LE cert is valid for 90 days and we will expect customers to use it close to its validity rather than shortening the same.

He was asking about the time the renewal runs. The concern was whether you run at known busy times like the top of an hour.

3 Likes

You are missing my [unspoken] point.
I doubt that the failures are 100%.
Meaning that some part of it may well be passing - it just fails in total.
Those partial successes may be cached and with each attempt more and more "pieces" may also succeed. [note that some validations may be cached up to 30 days]
The more you try, the closer you may get to 100%.

Given that approach won't do much for initial cert issuances; As it would delay them perhaps for a dozen attempts [guessing here].
But a dozen renewal attempts can be done in under a week [when attempting only twice a day].

3 Likes

Thank you for the additional clarity.

He was asking about the time the renewal runs. The concern was whether you run at known busy times like the top of an hour.

We have customers across the globe. The renewal / revalidation requests are run at random times - typically business hours for the customers but may not be for the DNS servers or LE systems from where the validation is done. Retries are also attempted at random times through the day.

I doubt that the failures are 100%.
Meaning that some part of it may well be passing - it just fails in total.
Those partial successes may be cached and with each attempt more and more "pieces" may also succeed. [note that some validations may be cached up to 30 days]
The more you try, the closer you may get to 100%.

Correct the failures are not 100%. It is also on random SAN hostnames. As for caching and attempting again, I believe this is something we need LE team to help with as their systems are the ones doing the DNS lookup / CAA validation. Force revalidation on customer side is only to trigger LE systems to attempt again on CAA validation

2 Likes

Hi, Just wanted to follow up and see if you have nay updates from your side? This is still an issue with our customers and so far we have been helping them to split their certificates into smaller batches of 10 - 20 SAN hostnames which has a better chance of completing the CAA validation.

Team,

just following up on the above : -)

Unfortunately, we are having exactly the same issue with another cert.

This time the CN is: dev.digital.iag.com.au

Thanks!

Not sure if this is part of the problem [or not]:

nslookup -q=ns digital.iag.com.au
digital.iag.com.au      canonical name = digital.iag.com.au.edgekey.net
digital.iag.com.au.edgekey.net  canonical name = e108809.x.akamaiedge.net
2 Likes

Sorry, I haven’t had time yet to collect any additional diagnostic data into why Akamai nameservers are returning errors. Due to our large traffic volume it’s not easy to isolate.

3 Likes

Just so you are looking in the right spot, we don't see any errors on Akamai nameservers and believe the rate limiting is between LE validation systems and the .com.au root/TLD nameservers on CAA record lookup.
Thanks

1 Like

If that is in fact the case, then you might be able to work around the issue by having a CAA record on each of the domain names used in the certificate, so that Let's Encrypt wouldn't need to check the TLD & root nameservers for a CAA record at all.

(I find it kind of weird that CAA always requires checking all the levels up to the root; the .com TLD nameserver must need to respond a bazillion times a day that there isn't a CAA record at that TLD level. Or at least there should be a longer cache time allowed for no-records-found at the TLD level.)

4 Likes

This may not be helpful to this situation as I am not a DNS wizard. But, using DNSViz I see a13-66.akam.net is not responsive to UDP requests over its IPv4 address.

Probably something you want fixed anyway.

https://dnsviz.net/d/dev.digital.iag.com.au/dnssec/

4 Likes

Thanks Mike.

Akamai's external authoritative name service, Edge DNS, includes a number of unique features to help customers fully realize the benefits of Anycast routing. From a reliability perspective, Akamai has over 300 points of presence (PoPs) across the globe, standard customer traffic typically consumes less than 1% of total nameserver capacity, In addition, each customer is assigned a unique combination of six "clouds," or Anycast IPs, to properly load balance client queries. SO at any given point in time, even if one of the 6 anycast NS aren't reachable for whatever reason, the DNS system is designed to failover to the remaining NS to ensure the DNS lookup succeeds.

1 Like

Thanks Peter.

We are trying this suggestion to see if that helps avoid the multiple level CAA record lookup by LE system.

1 Like

While that's true for general internet use, Let's Encrypt DNS validation looks at a subset of your nameservers and if any of them disagree or SERVFAIL then your validation will fail, so for domain validation you need healthy nameservers and cannot rely on redundancy. If a nameserver is not behaving you need to remove it or risk validation failures.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.