DNS problem: NXDOMAIN looking up TXT, 3 zone, 9 san, problem with 1 san

Hi everyone. Can't find anything same.
Need help with issuing certs with by DNS-01 by scheme - k8s cert_manager <-> letsencrypt prod/stage <-> aws route53.
Trying to issue certificate with 9 SAN's: 6 subdomain wildcard and 3 wildcard domain

for example part of
apiVersion: cert-manager.io/v1
kind: Certificate
...
dnsNames:
- '.domain1'
- '
.domain2'
- '.domain3'
- '
.apps.xxxx.yyyy.domain1'
- '.ing.xxxx.yyyy.domain1'
- '
.dev.zzzz.yyyy.domain1'
- '.apps.xxxx.yyyy.domain1'
- '
.ing.xxxx.yyyy.domain2'
- '*.dev.zzzz.yyyy.domain2'

8 of 9 challenges complete success, but challenge with random 1 subdomain get stuck with error:

"msg"="error waiting for authorization" "error"="acme: authorization error for apps.xxxx.yyyy.domain1: 400 urn:ietf:params:acme:error:dns: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.apps.xxxx.yyyy.domain1 - check that a DNS record exists for this domain"

TXT record for that subdomain 100% exist in all AWS NS servers which serve domain1 ...
ns-616.awsdns-13.net.
ns-464.awsdns-58.com.
ns-1865.awsdns-41.co.uk.
ns-1338.awsdns-39.org.
... when certmanager triggering letsencrypt for check challenge, i am checked it with $dig in loop during challenge.
And in AWS cloudtrail logs i see UPSERT and DELETE TXT record, so i think there is no problem with creating TXT record. Especially no problem with other 8 challenges, all names is valid.

Now certmanager repeat challenge that subdomain after intervals 1h - 4h - 8h - 16h - 32h, and every time get code 400.

Tried to set 20-60-120s wait interval for propogating TXT record before check, no success. TXT records also ready at all NS amazon servers after 5 seconds.
Tested with prod and stage issuers
https://acme-v02.api.letsencrypt.org/directory
https://acme-staging-v02.api.letsencrypt.org/directory

No problem with rate limits, check it on https://tools.letsdebug.net

FYI - problem with subdomain1 via acme-v02.api.letsencrypt.org is not mean same problem with subdomain1 on stage acme-staging-v02.api.letsencrypt.org.
I.e. subdomain1 may be success validated on stage, but failed many times on prod.
And conversely - challenge with other subdomain2 may be success on prod, and failed validate on stage.

That problem happened more frequently than bigger count of SAN in certificate.
If cert is small, i.e. - count if SAN 3-4 - there is no problem, and issuing time take 10-20 seconds.

So i get roundway - put "unvalid" subdomain name in "small" certificate and run challenge on prod LE - subdomain get "valid" status in letsencrypt cache. After, run challenge for certificate full of SAN's and all names get valid status.

Please help)

1 Like

I know nothing about k8s cert_manager. However, you may want to check, is it checking the DNS TXT records before triggering the verification, or it just waits for a while? If it just waits, you may want to increase the waiting time.

4 Likes

How many TXT records exist in the zone?

[and why are you obscuring a name that (once it gets a public cert) will be posted publicly]

4 Likes

I'm already try to increase propagation time before trigger check, default 10s, try 20s-60s-120s.

And during that interval i'm viewed process of propagation TXT record over NS servers Aws - record exists all time.
And in a same time i'm checked TXT over public google NS servers - result look like a record was propagate not to all NS servers, because it work over anycast, and $dig return txt with token, or empty answer randomly.

So, i don't know how letsencrypt checking TXT record, which NS servers using to check?
Is it get NS amazon and checks over it?
Or LE check over public dns, etc - 8.8.8.8, 9.9.9.9, 8.8.4.4, 1.1.1.1?

it's 3 domain in 1 aws account, and 5-10 static txt records over all, no more.

[Names obscured to get easy reading my case]

The authoritative nameservers for the domain are checked. TTL propagation to resolvers is not a relevant factor.

3 Likes

logs certmanager-controller in error case

I0229 08:05:35.088299       1 dns.go:130] cert-manager/challenges/Check "msg"="ACME DNS01 validation record propagated" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "domain"="TEST.SUBDOMAIN.DOMAIN" "fqdn"="_acme-challenge.TEST.SUBDOMAIN.DOMAIN." "resource_kind"="Challenge" "resource_name"="default-cert-ingress-11-1386384748-1319753650" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 08:05:35.088353       1 sync.go:359] cert-manager/challenges/acceptChallenge "msg"="accepting challenge with ACME server" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-11-1386384748-1319753650" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 08:05:35.710450       1 sync.go:376] cert-manager/challenges/acceptChallenge "msg"="waiting for authorization for domain" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-11-1386384748-1319753650" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
E0229 08:05:37.124463       1 sync.go:379] cert-manager/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="acme: authorization error for TEST.SUBDOMAIN.DOMAIN: 400 urn:ietf:params:acme:error:dns: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.TEST.SUBDOMAIN.DOMAIN - check that a DNS record exists for this domain" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-11-1386384748-1319753650" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 08:05:37.124741       1 logs.go:177] cert-manager/controller "msg"="Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"openshift-ingress\", Name:\"default-cert-ingress-11-1386384748-1319753650\", UID:\"c6865216-fea2-485e-be8c-e5aae66f78a0\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"16853038\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' Accepting challenge authorization failed: acme: authorization error for TEST.SUBDOMAIN.DOMAIN: 400 urn:ietf:params:acme:error:dns: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.TEST.SUBDOMAIN.DOMAIN - check that a DNS record exists for this domain"
I0229 08:05:37.184226       1 dns.go:294] cert-manager/challenges/CleanUp/solverForChallenge "msg"="preparing to create Route53 provider" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "domain"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-11-1386384748-1319753650" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"

and success case

I0229 07:58:42.993785       1 dns.go:130] cert-manager/challenges/Check "msg"="ACME DNS01 validation record propagated" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "domain"="TEST.SUBDOMAIN.DOMAIN" "fqdn"="_acme-challenge.TEST.SUBDOMAIN.DOMAIN." "resource_kind"="Challenge" "resource_name"="default-cert-ingress-10-1116789538-1844644491" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 07:58:42.993866       1 sync.go:359] cert-manager/challenges/acceptChallenge "msg"="accepting challenge with ACME server" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-10-1116789538-1844644491" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 07:58:43.623029       1 sync.go:376] cert-manager/challenges/acceptChallenge "msg"="waiting for authorization for domain" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-10-1116789538-1844644491" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 07:58:54.721368       1 logs.go:177] cert-manager/controller "msg"="Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"openshift-ingress\", Name:\"default-cert-ingress-10-1116789538-1844644491\", UID:\"559cd374-cf09-4ffb-aaf9-d32015c13da0\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"16842417\", FieldPath:\"\"}): type: 'Normal' reason: 'DomainVerified' Domain \"TEST.SUBDOMAIN.DOMAIN\" verified with \"DNS-01\" validation"
I0229 07:58:54.785726       1 dns.go:294] cert-manager/challenges/CleanUp/solverForChallenge "msg"="preparing to create Route53 provider" "dnsName"="TEST.SUBDOMAIN.DOMAIN" "domain"="TEST.SUBDOMAIN.DOMAIN" "resource_kind"="Challenge" "resource_name"="default-cert-ingress-10-1116789538-1844644491" "resource_namespace"="openshift-ingress" "resource_version"="v1" "type"="DNS-01"
I0229 07:58:54.888510       1 wait.go:329] Returning cached zone record "DOMAIN." for fqdn "_acme-challenge.TEST.SUBDOMAIN.DOMAIN."
I0229 07:58:55.003610       1 sync.go:682] cert-manager/orders "msg"="Retrieved ACME order from server" "raw_data"={"URI":"","Status":"ready","Expires":"2024-03-07T07:56:55Z","Identifiers":[{"Type":"dns","Value":...

And one more impornant thing - domain with failed validation may (not guaranteed) be validated in one of next retrys, it may happens after 2/4/8/16/32h, or more.

Try using https://unboundtest.com to query TXT while setting a longer wait like 120s. It uses a method similar to Let's Encrypt servers

On my US East Coast EC2 system it takes about 30-45s for Route53 to reply all DNS are "InSync" after an UPSERT.

3 Likes

Tested with 9 SAN's in certificate, 8 alt.names alreary validated, to run new challenge i'm change 9th alt.name to unvalid test17-18-19...
Maked 3 checks (1 positive and 2 negative) over unbound, and over my script with $dig from localhost.

In negative case unbound can't get TXT record during two minutes after $dig get it first time.
So i view unbound logs and checked names and ip's parents NS servers with $dig - there are all the same.

What can affect unbound/letsecrypt resolving TXT?

Below logs of test:
local script checking over NS
ns-1338.awsdns-39.org. 205.251.197.58
ns-1865.awsdns-41.co.uk. 205.251.199.73
ns-464.awsdns-58.com. 205.251.193.208
ns-616.awsdns-13.net. 205.251.194.104

1
positive test
Start at 4:25:18 with new test16 name
first time when script get TXT :

======================================================================
Пн 04 мар 2024 11:25:23 +07
test16.caas31t16.epaas.s7corp.ru
======================================================================
205.251.197.58
"HPRGDJMAAviqwDAgaIbrY5JxYCOjZT6EAOaQ7q1Ojd0"
205.251.199.73
"HPRGDJMAAviqwDAgaIbrY5JxYCOjZT6EAOaQ7q1Ojd0"
205.251.193.208
"HPRGDJMAAviqwDAgaIbrY5JxYCOjZT6EAOaQ7q1Ojd0"
205.251.194.104
"HPRGDJMAAviqwDAgaIbrY5JxYCOjZT6EAOaQ7q1Ojd0"
8.8.8.8
8.8.4.4
9.9.9.9

Run unbound at 04:25:50 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test16.caas31t16.epaas.s7corp.ru/GCKHLZZR
Run unbound at 04:27:12 - NOERROR and TXT - name validated, cert issued
https://unboundtest.com/m/TXT/_acme-challenge.test16.caas31t16.epaas.s7corp.ru/RL6BTP4F

2
negative test
Start at 4:28:35 with new test17 name
first time when script get TXT :

======================================================================
Пн 04 мар 2024 11:28:44 +07
test17.caas31t16.epaas.s7corp.ru
======================================================================
205.251.197.58
"GU9dH7W78tBNB30mYVKNDVNLFBIunbgLVMel6m6N3g4"
205.251.199.73
"GU9dH7W78tBNB30mYVKNDVNLFBIunbgLVMel6m6N3g4"
205.251.193.208
"GU9dH7W78tBNB30mYVKNDVNLFBIunbgLVMel6m6N3g4"
205.251.194.104
"GU9dH7W78tBNB30mYVKNDVNLFBIunbgLVMel6m6N3g4"
8.8.8.8
"GU9dH7W78tBNB30mYVKNDVNLFBIunbgLVMel6m6N3g4"
8.8.4.4
9.9.9.9

Run unbound at 04:28:46 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test17.caas31t16.epaas.s7corp.ru/5XSEYH5N
Run unbound at 04:29:33 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test17.caas31t16.epaas.s7corp.ru/CXCAGU47
Run unbound at 04:30:18 - NXDOMAIN

3
negative test
Start at 5:00:21 with new test18 name
first time when script get TXT :

======================================================================
Пн 04 мар 2024 12:00:29 +07
test18.caas31t16.epaas.s7corp.ru
======================================================================
205.251.193.208
"Kn5nC-ON6kEl5XjYiHu0QgOGeenkf1R4XO8U8Q0ulqM"
205.251.199.73
205.251.194.104
205.251.197.58
8.8.8.8
8.8.4.4
9.9.9.9

propagated after 10 sec:
======================================================================
Пн 04 мар 2024 12:00:39 +07
test18.caas31t16.epaas.s7corp.ru
======================================================================
205.251.193.208
"Kn5nC-ON6kEl5XjYiHu0QgOGeenkf1R4XO8U8Q0ulqM"
205.251.199.73
"Kn5nC-ON6kEl5XjYiHu0QgOGeenkf1R4XO8U8Q0ulqM"
205.251.194.104
"Kn5nC-ON6kEl5XjYiHu0QgOGeenkf1R4XO8U8Q0ulqM"
205.251.197.58
"Kn5nC-ON6kEl5XjYiHu0QgOGeenkf1R4XO8U8Q0ulqM"
8.8.8.8
8.8.4.4
9.9.9.9

Run unbount at 05:01:28 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test18.caas31t16.epaas.s7corp.ru/XMZHSQQL
Run unbount at 05:01:42 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test18.caas31t16.epaas.s7corp.ru/QRLLHZPJ
Run unbount at 05:02:12 - NXDOMAIN
https://unboundtest.com/m/TXT/_acme-challenge.test18.caas31t16.epaas.s7corp.ru/BI5KDR7K

I would change the name server list at your ccTLD registrar and in the zone file, removing the name servers 91.236.235.254 and 91.236.234.254, they are not answering authoritatively for your domain. Leave only the route53 servers in place.

Normally a route53 update takes 10 seconds. However I seen propagation that takes 1 minute. If you have 9 domains the worst case can be 9 minutes. The key is the behaviour of the k8s cert_manager. Is it updating the domain one by one or accumulates all domains updates and it updates in the zone in a batch? Does it poll for the route53 for the end of the propagation, or just simply expects the predefined wait time to be enough to proceed further with the challenge verification?

As I am not expert in k8s cert_manager, I cannot answer these questions.

4 Likes

Thank you for you answer.
It is not easy to remove name servers 91.236.235.254 and 91.236.234.254, i will talk about it with colleagues.
Propagation over 4 amazon ns takes 5-10 seconds as we can see in logs.
Cert-manager run challenges for all new alt.names in same time (if it is not validated previously), it will be separate challenges for each name, with different token in TXT record.
Cert-manager gettin self-check TXT records propagation before trigger check through LE.
Logs:

I0305 04:23:48.032230       1 dns.go:116] "checking DNS propagation" logger="cert-manager.challenges.Check" resource_name="default-cert-ingress-23-2571711902-3707694521" resource_namespace="openshift-ingress" resource_kind="Challenge" resource_version="v1" dnsName="test24.caas31t16.epaas.s7corp.ru" type="DNS-01" resource_name="default-cert-ingress-23-2571711902-3707694521" resource_namespace="openshift-ingress" resource_kind="Challenge" resource_version="v1" domain="test24.caas31t16.epaas.s7corp.ru" nameservers=["coredns.cert-manager.svc:53"]
I0305 04:23:48.033071       1 logs.go:206] "Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"openshift-ingress\", Name:\"default-cert-ingress-23-2571711902-3707694521\", UID:\"7694b4cb-8e2b-44c4-bf09-e0c71092177c\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"27036549\", FieldPath:\"\"}): type: 'Normal' reason: 'Presented' Presented challenge using DNS-01 challenge mechanism" logger="cert-manager.controller"
I0305 04:23:48.166834       1 wait.go:145] Looking up TXT records for "_acme-challenge.test24.caas31t16.epaas.s7corp.ru."
I0305 04:23:48.166858       1 wait.go:160] Selfchecking using the DNS Lookup method was successful
.....
E0305 04:24:50.258486       1 sync.go:379] "error waiting for authorization" err="acme: authorization error for test24.caas31t16.epaas.s7corp.ru: 400 urn:ietf:params:acme:error:dns: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.test24.caas31t16.epaas.s7corp.ru - check that a DNS record exists for this domain" logger="cert-manager.challenges.acceptChallenge" resource_name="default-cert-ingress-23-2571711902-3707694521" resource_namespace="openshift-ingress" resource_kind="Challenge" resource_version="v1" dnsName="test24.caas31t16.epaas.s7corp.ru" type="DNS-01"
1 Like

Thanks. It is important that your colleagues correct the DNS configuration error. Likely this is the reason of the problem you encounter. I guess, the DNS validation checking of the ACME server of Let'sencrypt sometimes hits the non-functioning DNS servers.

4 Likes

You can do that at the domain registrar.
[nothing needs to change between all the DNS servers]

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.