Hi everyone. Can't find anything same.
Need help with issuing certs with by DNS-01 by scheme - k8s cert_manager <-> letsencrypt prod/stage <-> aws route53.
Trying to issue certificate with 9 SAN's: 6 subdomain wildcard and 3 wildcard domain
for example part of
apiVersion: cert-manager.io/v1
kind: Certificate
...
dnsNames:
- '.domain1'
- '.domain2'
- '.domain3'
- '.apps.xxxx.yyyy.domain1'
- '.ing.xxxx.yyyy.domain1'
- '.dev.zzzz.yyyy.domain1'
- '.apps.xxxx.yyyy.domain1'
- '.ing.xxxx.yyyy.domain2'
- '*.dev.zzzz.yyyy.domain2'
8 of 9 challenges complete success, but challenge with random 1 subdomain get stuck with error:
"msg"="error waiting for authorization" "error"="acme: authorization error for apps.xxxx.yyyy.domain1: 400 urn:ietf:params:acme:error:dns: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.apps.xxxx.yyyy.domain1 - check that a DNS record exists for this domain"
TXT record for that subdomain 100% exist in all AWS NS servers which serve domain1 ...
ns-616.awsdns-13.net.
ns-464.awsdns-58.com.
ns-1865.awsdns-41.co.uk.
ns-1338.awsdns-39.org.
... when certmanager triggering letsencrypt for check challenge, i am checked it with $dig in loop during challenge.
And in AWS cloudtrail logs i see UPSERT and DELETE TXT record, so i think there is no problem with creating TXT record. Especially no problem with other 8 challenges, all names is valid.
Now certmanager repeat challenge that subdomain after intervals 1h - 4h - 8h - 16h - 32h, and every time get code 400.
Tried to set 20-60-120s wait interval for propogating TXT record before check, no success. TXT records also ready at all NS amazon servers after 5 seconds.
Tested with prod and stage issuers
https://acme-v02.api.letsencrypt.org/directory
https://acme-staging-v02.api.letsencrypt.org/directory
No problem with rate limits, check it on https://tools.letsdebug.net
FYI - problem with subdomain1 via acme-v02.api.letsencrypt.org is not mean same problem with subdomain1 on stage acme-staging-v02.api.letsencrypt.org.
I.e. subdomain1 may be success validated on stage, but failed many times on prod.
And conversely - challenge with other subdomain2 may be success on prod, and failed validate on stage.
That problem happened more frequently than bigger count of SAN in certificate.
If cert is small, i.e. - count if SAN 3-4 - there is no problem, and issuing time take 10-20 seconds.
So i get roundway - put "unvalid" subdomain name in "small" certificate and run challenge on prod LE - subdomain get "valid" status in letsencrypt cache. After, run challenge for certificate full of SAN's and all names get valid status.
Please help)