Problem generating cert with large number of SANs (60+)

My domain is:

Can’t enter this because of 120 link limitation per post (see certbot command below).

I ran this command:

certbot-auto certonly --webroot -w [docroot] -d sites-e.larc.nasa.gov -d aab.larc.nasa.gov -d adt.larc.nasa.gov -d aero.larc.nasa.gov -d aeroelasticity.larc.nasa.gov -d ampb.larc.nasa.gov -d artcontest.larc.nasa.gov -d basketball.larc.nasa.gov -d colloqsigma.larc.nasa.gov -d commonresearchmodel.larc.nasa.gov -d ddtrb.larc.nasa.gov -d eds.larc.nasa.gov -d engineering.larc.nasa.gov -d environmental.larc.nasa.gov -d exhibits.larc.nasa.gov -d fileplottingtools.larc.nasa.gov -d flightsimulation.larc.nasa.gov -d gameon.nasa.gov -d gcd.larc.nasa.gov -d larc-exchange.larc.nasa.gov -d lasersdbw.larc.nasa.gov -d latinawomen.larc.nasa.gov -d lbpw.larc.nasa.gov -d leag.larc.nasa.gov -d locrwg.larc.nasa.gov -d matb.larc.nasa.gov -d microspecklestamps.larc.nasa.gov -d nga.larc.nasa.gov -d occ.larc.nasa.gov -d odeo.larc.nasa.gov -d overflow.larc.nasa.gov -d paw.larc.nasa.gov -d post2.larc.nasa.gov -d pto.larc.nasa.gov -d researchdirectorate.larc.nasa.gov -d researchtech.larc.nasa.gov -d sacd.larc.nasa.gov -d stab.larc.nasa.gov -d sw-eng.larc.nasa.gov -d tetruss.larc.nasa.gov -d uqtools.larc.nasa.gov -d scifli.larc.nasa.gov -d csaob.larc.nasa.gov -d eve.larc.nasa.gov -d education.larc.nasa.gov -d sepg.larc.nasa.gov -d essp.nasa.gov -d hpcincubator.larc.nasa.gov -d transitionmodeling.larc.nasa.gov -d odm.larc.nasa.gov -d skywatchers.larc.nasa.gov -d aeronautics.larc.nasa.gov -d activate.larc.nasa.gov -d pwix.larc.nasa.gov -d larcsos.larc.nasa.gov -d blueskyradiation.larc.nasa.gov -d science-people.larc.nasa.gov -d vspu.larc.nasa.gov -d winds-lidar-group.larc.nasa.gov -d stabserv.larc.nasa.gov -d arcstone.larc.nasa.gov -d act-america.larc.nasa.gov -d capable.larc.nasa.gov -d clarreo-pathfinder.larc.nasa.gov -d gewex-srb.larc.nasa.gov -d discover-aq.larc.nasa.gov -d science-edu.larc.nasa.gov

It produced this output:

The top level errors look like this (I’ve included them from a few attempts to do the same thing, since the DNS name that shows up first and the total failed validations are always different):

An unexpected error occurred:

Certification Authority Authorization (CAA) records forbid the CA from issuing a certificate :: Error finalizing order :: Rechecking CAA for “sw-eng.larc.nasa.gov” and 24 more identifiers failed. Refer to sub-problems for more information

An unexpected error occurred:

Certification Authority Authorization (CAA) records forbid the CA from issuing a certificate :: Error finalizing order :: Rechecking CAA for “eds.larc.nasa.gov” and 31 more identifiers failed. Refer to sub-problems for more information

An unexpected error occurred:

Certification Authority Authorization (CAA) records forbid the CA from issuing a certificate :: Error finalizing order :: Rechecking CAA for “locrwg.larc.nasa.gov” and 33 more identifiers failed. Refer to sub-problems for more information

The lower level errors look like this:

"detail": “Error finalizing order :: While processing CAA for sw-eng.larc.nasa.gov: DNS problem: SERVFAIL looking up CAA for larc.nasa.gov - the domain’s nameservers may be malfunctioning”,

I’ve already talked to our DNS admins, and they see nothing wrong on our DNS servers. The only thing that might be an issue that I can see is rate limiting: looks like for 60+ CNAMES the letsencrypt validation server(s) hit the DNS servers hard enough to trigger it.

Any way to slow down the letsencrypt validation checks? I wouldn’t mind waiting a second per CNAME if it meant it worked. :slight_smile:

Note that I can successfully generate a cert with a subset of the names, so that’s more confirmation to me it’s the scale of the request and not any malformed DNS entries.

Version of certbot:

certbot 1.7.0

1 Like

The three workarounds I can think of:

  1. You can use a pre/post hook in certbot to a shell script or Python script that does a sleep for 1 second (or more). that would slow down the validations.

  2. Is there any chance you could use a wildcard cert for larc.nasa.gov + *.larc.nasa.gov

  3. Generate 10 different certificates - each a subset of the total certs - then generate a full certificate of all 60+ certs. LetsEncrypt caches successful validations for a short period of time, so they should not be retried and you would ideally automatically get a certificate without any challenges
    .

1 Like

Certbot’s documentation on hooks:

They only work with the manual plugin, but you’re using certonly so that is not an issue.

I won’t ever believe Let’s Encrypt is DOSsing three nasa.gov (1) authoritative nameservers with ~200 queries.

Any chance your sysadmins apply some rate limiting to those?

(1):

% dig ns larc.nasa.gov +short
ns1.nasa.gov.
ns3.nasa.gov.
ns2.nasa.gov.

Hi @nelsonph

checking one of your subdomains - https://check-your-website.server-daten.de/?q=sw-eng.larc.nasa.gov

You have a CNAME:

Host Type IP-Address is auth. ∑ Queries ∑ Timeout
sw-eng.larc.nasa.gov CNAME sites-e.larc.nasa.gov yes 1 0
A 198.119.166.167 Forest Park/Georgia/United States (US) - National Aeronautics and Space Administration Hostname: sites-e.larc.nasa.gov yes
AAAA 2001:4d0:2340:4001::20a7 Ashburn/Virginia/United States (US) - National Aeronautics and Space Administration yes

So sites-e.larc.nasa.gov is checked if there is a CAA, then the parent of sw-eng.larc.nasa.gov, larc.nasa.gov.

Checking one domain there are no critical name server problems. But you have more then 60 domain names.

You can try:

  • Create a CAA with sites-e.larc.nasa.gov as domain name, then all parents aren't checked. That may stop the problem with larc.nasa.gov
  • Reduce the number of domain names
  • it's a temporary problem because of Service status: Partial Service Disruption - so wait one day, then it may work again

That

doesn't help. Because CAA entries are not cached, instead, they must be checked always (that was the bug 2020-03-~~04 I think).

1 Like

I don't think so.


  --pre-hook PRE_HOOK   Command to be run in a shell before obtaining any
                        certificates. Intended primarily for renewal, where it
                        can be used to temporarily shut down a webserver that
                        might conflict with the standalone plugin. This will
                        only be called if a certificate is actually to be
                        obtained/renewed. When renewing several certificates
                        that have identical pre-hooks, only the first will be
                        executed. (default: None)
  --post-hook POST_HOOK
                        Command to be run in a shell after attempting to
                        obtain/renew certificates. Can be used to deploy
                        renewed certificates, or to restart any servers that
                        were stopped by --pre-hook. This is only run if an
                        attempt was made to obtain/renew a certificate. If
                        multiple renewed certificates have identical post-
                        hooks, only one will be run. (default: None)

You're looking at the wrong hooks.

I hotlinked to the "Pre and Post Validation Hooks" section:

Certbot allows for the specification of pre and post validation hooks when run in manual mode. The flags to specify these scripts are --manual-auth-hook and --manual-cleanup-hook respectively and can be used as follows:

certbot certonly --manual --manual-auth-hook /path/to/http/authenticator.sh --manual-cleanup-hook /path/to/http/cleanup.sh -d secure.example.com

This will run the authenticator.sh script, attempt the validation, and then run the cleanup.sh script. Additionally certbot will pass relevant environment variables to these scripts:

  • CERTBOT_DOMAIN : The domain being authenticated
  • CERTBOT_VALIDATION : The validation string
  • CERTBOT_TOKEN : Resource name part of the HTTP-01 challenge (HTTP-01 only)
  • CERTBOT_REMAINING_CHALLENGES : Number of challenges remaining after the current challenge
  • CERTBOT_ALL_DOMAINS : A comma-separated list of all domains challenged for the current certificate

Additionally for cleanup:

  • CERTBOT_AUTH_OUTPUT : Whatever the auth script wrote to stdout
1 Like

Well… manual mode only. That’s… unexpected, you’d have to reimplement the webroot plugin.

  1. I tried this, but it looks like the authenticator.sh and cleanup.sh hooks are called serially both before and after all validations, so a sleep doesn't actually slow down the validations.
  2. A wildcard is a good question; I'm checking into that.
  3. Hmm: I haven't tried that yet, but will look into it.

Also looking into this.

Thanks for the suggestions!

Just hold on a second. I think @jsha is about to respond.

Hi @nelsonph! Welcome to the forum. I think your analysis makes sense. We have, in the past, occasionally seen DNS setups that start rate limiting us when we send dozens of queries at once. It's definitely an uncommon usage pattern.

In general, larger certificates will give you a harder time across the board, since it's more likely for a small problem with one hostname to block issuance for a lot of hostnames. So I would recommend to split up your cert a bit, for instance offering one VirtualHost for each hostname and issuing each certificate individually. This will also slightly speed things up for your site visitors, since each certificate is smaller, so the TLS handshake takes fewer round trips.

That said, I'll assume you have a good reason for putting lots of hostnames together. One possible fix would be to ask your DNS admins if they can relax the rate limits a bit.

One subtlety here: You're getting problems rechecking CAA during certificate finalization, rather than checking CAA during validation. Here's how the process works in a nutshell: When you ask for a certificate for 60 names, Let's Encrypt creates 60 "authorization" and related "challenge" objects. Your ACME client makes arrangements to prove that you control those domain names (e.g. by putting a token on a webserver), then tells Let's Encrypt to validate the challenges.

At that point, Let's Encrypt also checks CAA; if CAA fails, the validation fails and your authorization object for that hostname is "invalid." However, if validation succeeds and CAA succeeds, your authorization becomes "valid" and is good for the next 30 days. Any subsequent requests from your account for that domain name will just succeed, no extra validation needed.

However, a CAA check is only good for 8 hours. So, in a bit of a hack, when you ask to issue a certificate on the basis of some authorizations that became valid 15 days ago, we have to recheck CAA for those hostnames, all at the time of issuance. We want to keep the issuance request short for various reasons, so we do all those CAA checks in parallel. That's the step where you're running into trouble.

Long story short: You might be able to slow down CAA checking by deactivating your already-valid authorizations, so you are forced to reauthorize, which pushes the CAA checks through a different code path that's a bit more spread out. I don't think Certbot supports deactivating authzs out of the box, but I seem to recall some scripts out there that parse your logs and deactivate for you. You might also be able to try creating a new account.

All that said, if you can split up your hostnames a bit you will probably be happier in the long run.

3 Likes

Will Boulder re-check the CAA within those 8 hours? I thought not, which is why I suggested the OP obtain multiple certs on a subset of domains, then obtain a cert with all 60+ domains - but then this was posted:

1 Like

I thought that bug was about no CAA checking was done when a valid authorization was found within the 30 day period of authz validity. So CAA checks were consequently 30 days valid too, which isn’t allowed. I’m sure the 8 hour caching mentioned just now is allowed.

1 Like

I missed this earlier. This is not quite accurate. CAA entries are indeed cached for up to 7 hours (used to be 8 but we decided to make sure we are not close to the limit). The CAA rechecking bug was this: If a certificate request had 2 or more hostnames, and some of those hostnames had already-valid authorizations, Boulder would only recheck CAA for one of those hostnames, rather than for all hostnames that needed it.

You're correct, Boulder won't re-check CAA if it's reusing an authorization that was validated in the last 7 hours. But since the @nelsonph already has a bunch of authorizations that were validated > 7 hours ago, your recommendation only works if they first deactivate those authorizations, or create a new account.

1 Like

Thanks for the reply and putting up with my crazy idea to get around this limit... I think it would work though.

If the OP were to process the 60 domains (for simplicity) in 6 orders of 10 domains... wouldn't Boulder just re-check and re-cache the CAA lookup of the domains that passed more than 7 hours ago when obtaining those? Or are you saying the CAA lookups are only cached for 7 hours on the initial authorizations? If it's the latter, then it wouldn't work - but the former scenario should.

My expectation is that chunking the domains into 10 buckets would be a the main throttler for the DNS server's rate limits, regardless of cache status, as there would likely be a 1-2 minute delay between each invocation of certbot. When each domain is processed, if a new authorization is needed it would be cached along with the CAA lookup, and if the authorization is not needed, only the CAA lookup would happen. That should get around most rate limits, unless the dev/ops team really locked things down. A few minutes later, a single cert is requested and all the authorizations and caa lookups would be primed into the Boulder cache, so the certificate would issue without a lookup.

This is far from the ideal solution, it's an idea for a wonky workaround by (ab)using some implementation details in Boulder.

1 Like

Ah, I understand what you're saying now. Unfortunately, no. CAA checks aren't cached independently of authorizations. Each valid authorization expresses two concepts: "This passed ACME challenge validation" and "This passed CAA checking."

When we do a CAA recheck at issuance time, the results of that CAA recheck are only used for that issuance. They are not cached for future issuance.

1 Like

@nelsonph as an aside, Nasa have an enterprise license for the https://certifytheweb.com certificate management tool [for Let’s Encrypt/ACME] (which can run on windows or linux). Message me if you’d like more details or email support at certifytheweb.com - there are some features under development which could assist you in managing validation for large numbers of subdomains and in controlling distribution of certs to the servers/services that need them.

Unfortunately, I now understand you too :frowning:

It might be more helpful if you state that as "certificate management tool for LetsEncrypt"

1 Like

This fixed it, thanks! I'm still planning to look into other solutions for the long-term, but this got us going in the short-term. :slight_smile:

3 Likes