So I have sporadic failures when renewing my certificate by DNS challenge. Normally I have around 14 SANs in the cert, many of them wildcards. Some of them fail almost every time I try to renew the cert. I picked one domain (sites.karotte.org) for demonstration and created a new cert only with sites.karotte.org and *.sites.karotte.org in the SAN.
The setup is special as the _acme-challenge record is redirected via CNAME to a separate subdomain that can be updated dynamically. in this case:
_acme-challenge.sites.karotte.org. IN CNAME sites.karotte.org._acme_challenges.challenges.karotte.org.
I have a manual auth hook that sets the TXT records (needed b/c certbot DNS update would not work with CNAME redirects).
I ran this command (this comes out of a Makefile):
/opt/letsencrypt/bin/certbot.sh --non-interactive --agree-tos --email me@example.com \
--dry-run \
--config-dir /opt/letsencrypt/certbot/conf \
--logs-dir /opt/letsencrypt/certbot/log \
--work-dir /opt/letsencrypt/certbot/work \
certonly \
--csr testdomain/testdomain-1606834179.csr \
--cert-path testdomain/cert-1606834179.crt \
--chain-path testdomain/intermediate-1606834179.pem \
--fullchain-path testdomain/chained-1606834179.pem \
--server https://acme-staging-v02.api.letsencrypt.org/directory \
--manual \
--manual-public-ip-logging-ok \
--preferred-challenges dns \
--manual-auth-hook "/opt/letsencrypt/bin/dns-challenge.py auth" \
--manual-cleanup-hook "/opt/letsencrypt/bin/dns-challenge.py cleanup"
It produced this output:
Saving debug log to /opt/letsencrypt/certbot/log/letsencrypt.log Plugins selected: Authenticator manual, Installer None Performing the following challenges: dns-01 challenge for sites.karotte.org dns-01 challenge for sites.karotte.org Running manual-auth-hook command: /opt/letsencrypt/bin/dns-challenge.py auth Output from manual-auth-hook command dns-challenge.py: Verified sites.karotte.org._acme_challenges.challenges.karotte.org TXT D3TB-k4Ew9vBx2VwtltCyoVFS0d5h7xDQiHsKI8FMyk (1 records) Running manual-auth-hook command: /opt/letsencrypt/bin/dns-challenge.py auth Output from manual-auth-hook command dns-challenge.py: Verified sites.karotte.org._acme_challenges.challenges.karotte.org TXT 9b9n0URSsQibpCJwVVdw81A6N0uB28W0ft36kqRU-MQ (2 records) Waiting for verification... Challenge failed for domain sites.karotte.org Challenge failed for domain sites.karotte.org dns-01 challenge for sites.karotte.org dns-01 challenge for sites.karotte.org Cleaning up challenges Running manual-cleanup-hook command: /opt/letsencrypt/bin/dns-challenge.py cleanup Running manual-cleanup-hook command: /opt/letsencrypt/bin/dns-challenge.py cleanup Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server: Domain: sites.karotte.org Type: dns Detail: During secondary validation: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sites.karotte.org - check that a DNS record exists for this domain Domain: sites.karotte.org Type: dns Detail: During secondary validation: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sites.karotte.org - check that a DNS record exists for this domain
The version of my client is: 1.9.0
The error changes, in this case it is "During secondary validation", sometimes it is:
IMPORTANT NOTES: - The following errors were reported by the server: Domain: sites.karotte.org Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sites.karotte.org - check that a DNS record exists for this domain Domain: sites.karotte.org Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sites.karotte.org - check that a DNS record exists for this domain
Sometimes it just works. So whatever the problem it is transient but I can reproduce it almost every time the script runs.
Sniffing DNS traffic I see a lot of different DNS servers requesting the challenges (or records that are connected to the challenge). I see no NXDomain replies from my server so I assume the problem is somewhere farther away. Maybe some of the DNS servers LE uses have problems with the CNAME? It's unfortunate that the IP of the DNS server doing the validation is not logged in the error. That would narrow the problem down.
Any idea what to do?
The auth hook updates that zone so it is live immediately. The zone has a negative TTL of 1 second so caching of negativ responses is not a problem as well.