Secondary validation fails on all domains for dns-01 challenge

I'm troubleshooting an issue and I have a hard time wrapping my head around what might be the problem.
I am trying to renew a certificate with 23 subject alt names, so 24 certificates in total. Actually 12 domains once with and without wildcard.
Normally this is run in a gitlab CI pipeline but for troubleshooting I have isolated this to a single certbot command.
I am using the Let's Encrypt staging infrastructure.
All 12 domains I'm using have a CNAME entry for "_acme-challenge.$domain.$tld" pointing to "_acme-challenge.acme.puzzle.ch"

During the certbot run I get the same error for all domains:

  Domain: pitc.ch
  Type:   unauthorized
  Detail: During secondary validation: No TXT record found at _acme-challenge.pitc.ch

so presumably none of the primary validations fail and all of the secondary validations fail..

After the certbot prompt I added all requested TXT records using the dnsimple web UI and waited at least 10 minutes. I can also see all the records using unboundtest:

unbound 1.19: https://unboundtest.com/m/TXT/_acme-challenge.acme.puzzle.ch/TTKIQSCL
unbound 1.18: https://unboundtest.com/m/TXT/_acme-challenge.acme.puzzle.ch/AJ63665E
unbound 1.16: https://unboundtest.com/m/TXT/_acme-challenge.acme.puzzle.ch/EDNUCL34

even now the entries are still there, even if the challenge failed, for debugging.

Any and all hints are greatly appreciated as I'm a bit at a loss here.
Thanks!

My domain is:

*.pitc.ch,pitc.ch,*.linuxfriends.ch,linuxfriends.ch,*.linux-migration.ch,linux-migration.ch,*.linuxmigration.ch,linuxmigration.ch,*.puzzle-itc.ch,puzzle-itc.ch,*.puzzleitc.ch,puzzleitc.ch,*.puzzle-itc.com,puzzle-itc.com,*.puzzleitc.com,puzzleitc.com,*.puzzleversum.ch,puzzleversum.ch,*.puzzleversum.com,puzzleversum.com,*.puzzzle.ch,puzzzle.ch,*.puzzle-security.ch,puzzle-security.ch

I ran this command:

sudo certbot certonly --manual --dry-run --agree-tos --debug-challenges --preferred-challenge dns-01 -d *.pitc.ch,pitc.ch,*.linuxfriends.ch,linuxfriends.ch,*.linux-migration.ch,linux-migration.ch,*.linuxmigration.ch,linuxmigration.ch,*.puzzle-itc.ch,puzzle-itc.ch,*.puzzleitc.ch,puzzleitc.ch,*.puzzle-itc.com,puzzle-itc.com,*.puzzleitc.com,puzzleitc.com,*.puzzleversum.ch,puzzleversum.ch,*.puzzleversum.com,puzzleversum.com,*.puzzzle.ch,puzzzle.ch,*.puzzle-security.ch,puzzle-security.ch

It produced this output: (24 times this output all obviously with different domains and TXT values)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Please deploy a DNS TXT record under the name:

_acme-challenge.pitc.ch.

with the following value:

yTYUDfIMF4vfeWsWAeflADveyGEfg8D0blWfWF3fuhI

(This must be set up in addition to the previous challenges; do not remove,
replace, or undo the previous challenge tasks yet. Note that you might be
asked to create multiple distinct TXT records with the same name. This is
permitted by DNS standards.)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
certbot 2.8.0 however I also have this issue in a CI pipeline where we use lego with the dnsimple integration. Certbot is just used locally for easier testing.

1 Like

Hi @nerrehmit, and welcome to the LE community forum :slight_smile:

In general, that implies that the primary validation passed the test.
It's hard to say why some DNS systems' tests would pass and some would fail.
I'd start by checking your entire DNS system for any anomalies.

2 Likes

There have been challenges lately with this sort of setup, as the DNS response gets so large that their DNS resolver doesn't like it. If you can rework your system to use a different name for each domain name, it might be more reliable.

You can see some details in this thread, where LE staff @jcjones just deployed a change a few days ago that might help with it.

@jcjones, can you confirm that the Unbound change is in both on both the primary and secondary validation servers?

4 Likes

Thank you so much for your swift response and the link to the other thread. Unfortunately I didn't find that one during my search attempts.

Reading through that it looks like we are hitting the same or a very similar problem. The distinction between primary and secondary validation seems to indicate that maybe a certain setting or config option was not yet implemented for all of the secondary validators.
Let's see if anyone from LE has an input.

2 Likes

No, didn't do it for the secondary validation hosts - it wasn't clear we'd hit the same issue. I'll work that change up right away.

6 Likes

The fix is in Staging secondary validation hosts. I should be able to update the Production secondary validation hosts before end of day, assuming no problems arise.

5 Likes

There's nothing quite like a Friday afternoon production deployment.

:crossed_fingers:

6 Likes

Dot all your teas and cross all your eyes!
[what could go wrong?!]
LOL

3 Likes

Yeaaah.

Anyway, this is almost done deploying. May the odds be in my favor!

@nerrehmit: Hopefully this fixes the issue.

8 Likes

Two things that may also work, assuming this has to do with the number of domains impacting your systems:

1- If you can't use a separate certificate for each domain name, what about smaller batches? It looks like you could have 3+ logical groupings of certificates there.

2- A common workaround for issues like this is to run Certbot to get a certificate for each registered domain (*.example.com & example.com), then immediately run Certbot to get a certificate for the full list of domains. the last run should use the cached validations on LetsEncrypt's servers, so you won't have to complete any challenges and the certificate should immediately issue.

3 Likes

Fantastic news, thanks a lot for tackling this issue, on a Friday non the less.

I'm happy to report that it all looks ok using the staging infrastructure.

This is what I got at the end of the staging run:

2024-01-26 23:37:46,340:DEBUG:certbot._internal.client:Dry run: Skipping creating new lineage for pitc.ch
2024-01-26 23:37:46,340:DEBUG:certbot._internal.display.obj:Notifying user: The dry run was successful.

as you can see from the timestamps it's getting quite late here so any production LE tests will have to wait until tomorrow.
Thanks again and fingers crossed for the rest of the prod deployment!

3 Likes

Thank you for the inputs, we might very well rework the way we handle certificates with a large number of SAN in them.

I think in the past this exact scenario happened organically as the cached validations timed out in different intervals only a few of the domains were up for validation each run.
But then LE stopped processing the responses because they grew bigger and bigger leading to even more domains expiring from the cache only compounding on the size of the subsequent validation TXT records.

What stumped me for far to long during debugging is that the primary validation always passed and the secondary did not. So I was looking at the anycast setup of our dns provider thinking they were serving a different set or maybe errors when queried in a particular location..

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.