Cert-manager challenge stuck "Waiting for DNS-01 challenge propagation: read tcp i/o timeout"

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is:
alertmanager.sre.rd.elliemae.io

I ran these commands and It produced this output:

❯ kubectl describe cert alertmanager-tls -n prometheus-stack | tail -20

Conditions:
Last Transition Time: 2022-01-06T19:09:43Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Last Transition Time: 2024-01-07T15:43:47Z
Message: Renewing certificate as renewal was scheduled at 2023-12-17 14:42:48 +0000 UTC
Observed Generation: 1
Reason: Renewing
Status: True
Type: Issuing
Last Failure Time: 2024-01-07T14:43:47Z
Next Private Key Secret Name: alertmanager-tls-c7qsz
Not After: 2024-01-16T14:42:48Z
Not Before: 2023-10-18T14:42:49Z
Renewal Time: 2023-12-17T14:42:48Z
Revision: 11
Events:

❯ kubectl get challenges -n prometheus-stack
NAME STATE DOMAIN AGE
alertmanager-tls-wjfvd-3810699091-2192805427 pending alertmanager.sre.rd.elliemae.io 4d17h

❯ kubectl describe challenge alertmanager-tls-wjfvd-3810699091-2192805427 -n prometheus-stack
Name: alertmanager-tls-wjfvd-3810699091-2192805427
Namespace: prometheus-stack
Labels:
Annotations:
API Version: acme.cert-manager.io/v1
Kind: Challenge
Metadata:
Creation Timestamp: 2024-01-07T13:38:13Z
Finalizers:
finalizer.acme.cert-manager.io
Generation: 1
Managed Fields:
API Version: acme.cert-manager.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"finalizer.acme.cert-manager.io":
f:ownerReferences:
.:
k:{"uid":"8a31bd1f-9717-49dc-aa8e-aa87b8134f31"}:
f:spec:
.:
f:authorizationURL:
f:dnsName:
f:issuerRef:
.:
f:group:
f:kind:
f:name:
f:key:
f:solver:
.:
f:dns01:
.:
f:route53:
.:
f:hostedZoneID:
f:region:
f:secretAccessKeySecretRef:
.:
f:name:
f:selector:
.:
f:dnsZones:
f:token:
f:type:
f:url:
f:wildcard:
Manager: controller
Operation: Update
Time: 2024-01-07T13:38:13Z
API Version: acme.cert-manager.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:presented:
f:processing:
f:reason:
f:state:
Manager: controller
Operation: Update
Subresource: status
Time: 2024-01-07T13:40:36Z
Owner References:
API Version: acme.cert-manager.io/v1
Block Owner Deletion: true
Controller: true
Kind: Order
Name: alertmanager-tls-wjfvd-3810699091
UID: 8a31bd1f-9717-49dc-aa8e-aa87b8134f31
Resource Version: 224677696
UID: 25247b15-25c4-4854-878e-aa127e769754
Spec:
Authorization URL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/301702713276
Dns Name: alertmanager.sre.rd.elliemae.io
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: letsencrypt-prod
Key: MIN_PXvVHcXtK8CQdGR8DRG66gtxBdMECNHsiJlUiFA
Solver:
dns01:
route53:
Hosted Zone ID: Z0025955R6YE7X8EBU2U
Region: us-east-1
Secret Access Key Secret Ref:
Name:
Selector:
Dns Zones:
sre.rd.elliemae.io
Token: Px-dPicOu0JKtVYGEIaD-HweV_jiUlAh1lrsDXS8hLo
Type: DNS-01
URL: https://acme-v02.api.letsencrypt.org/acme/chall-v3/301702713276/WAZaIg
Wildcard: false
Status:
Presented: true
Processing: true
Reason: Waiting for DNS-01 challenge propagation: read tcp 10.216.39.125:49176->205.251.193.107:53: i/o timeout
State: pending
Events:

❯ kubectl exec -it prometheus-stack-prometheus-node-exporter-bpbl9 -n prometheus-stack -- nslookup -type=txt _acme-challenge.alertmanager.sre.rd.elliemae.io
Server: 10.130.39.46
Address: 10.130.39.46:53

Non-authoritative answer:
_acme-challenge.alertmanager.sre.rd.elliemae.io text = "MIN_PXvVHcXtK8CQdGR8DRG66gtxBdMECNHsiJlUiFA"

My web server is (include version):
Nginx on AWS EKS

The operating system my web server runs on is (include version):
Amazon Linux

My hosting provider, if applicable, is:
Amazon AWS

I can login to a root shell on my machine (yes or no, or I don't know):
Yes, I can run kubectl commands to query AWS EKS cluster

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):
No

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):

Haven't tried certbot

Certbot is just an example, the question is the version of the ACME client actually being used. Could be any ACME client really.

I don't have anything to add with regard to your issue unfortunately, no experience with Kubernetes.

1 Like

When you quote log output it's best to wrap it in 3 backticks ``` at the start and end of the block so that the formatting is preserved when you post as this is very hard to read.

Googling the same error suggests that it's having trouble doing it's own validation check (via nameservers) before submitting the challenge for review by the CA:

2 Likes

Is that DNS server [205.251.19.107] useable?
Try another DNS server [the Internet is full of them].

2 Likes

I added more DNS servers unfortunately still getting i/o timeouts I suspect lookups are getting blocked.

  set {
    name  = "extraArgs"
    value = "{--dns01-recursive-nameservers-only,--dns01-recursive-nameservers=8.8.8.8:53\\,1.1.1.1:53}"
  }
Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for DNS-01 challenge propagation: read tcp 10.216.39.168:57150->1.1.1.1:53: i/o timeout
  State:       pending
Events:        <none>

You need to focus your testing on DNS [until that is corrected].
Try:
dig google.com
nslookup ibm.com

If neither work, then you have DNS problem(s).

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.