The LE challenge validation has started failing since last few days with the below error even though the challenge record is present. Verified the same with https://dnschecker.org/ as well.
Appreciate any help/insights into debugging this further please.
During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work.
Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.
My domain is:
I ran this command:
kwaikar@kwaiker-MBP16 bin % dig -t txt _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work +short
It produced this output:
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): NO
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
I'm seeing NXDOMAIN being returned for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work (as well as just xcu2-8y8x.dev.cldr.work). Can you leave the challenge record in your DNS for some time so that others can take a look at querying in various ways to see if they can figure out what's going on?
We're seeing similar issues with the staging environment since earlier this morning -- We seem to fail around the challenge point with the ultimate error being:
Error accepting challenge: 400 urn:ietf:params:acme:error:malformed: Unable
to update challenge :: authorization must be pending'
Interestingly one thing we've noticed is the authz URL seems different between the prod & staging environments.
For the prod environment the authz URL is accessible via a GET and returns some meta information: e.g.
_acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work/TXT: A query for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work results in a NOERROR response, while a query for its ancestor, xcu2-8y8x.dev.cldr.work, returns a name error (NXDOMAIN), which indicates that subdomains of xcu2-8y8x.dev.cldr.work, including _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work, don't exist.
That is, the NXDOMAIN response is supposed to mean that there aren't any responses available at all for any subdomains of it either, but your system is responding to xcu2-8y8x.dev.cldr.work with NXDOMAIN even though subdomains are giving other responses.
But again, I don't really think that's causing the problem you're seeing.
Aren't those supposed to be POST-as-GET requests anyway? I think the latest update still had them enforcing that in staging even though production doesn't.
We're actively using Let's Encrypt staging in our pre-production environment, and we've also noticed that DNS lookup timeout errors are frequently returned since April 13. 3:30-3:45 UTC, example:
one or more domains had a problem:
[*.<domain>] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: query timed out looking up CAA for <domain>
[<domain>] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.<domain>
Our DNS provider is Akamai, certificate generations were working fine until these issues appeared. As far as I see multiple people reporting this problem with different DNS providers, so I think the problem is more likely to be with Let's Encrypt staging.
We're using lets encrypt staging environment from cert-manager inside a kubernetes cluster and starting from 4:00 AM UTC we experience a similar issue as @Evesy above. The errors we get are:
1 sync.go:386] cert-manager/controller/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="context deadline exceeded"
1 sync.go:378] cert-manager/controller/challenges/acceptChallenge "msg"="error accepting challenge" "error"="400 urn:ietf:params:acme:error:malformed: Unable to update challenge :: authorization must be pending"
When switching to work against lets encrypt prod server the ACME flow works as expected.
Yes I think that may not be the cause for the following reasons - DNS Checker - DNS Check Propagation Tool seems fine, NS records do exist for "dev.cldr.work" and its subdomains. Moreover, this has been working for years now and the same domain structure seems to be working fine with prod env right now.
Anything else that you can suspect?
Yes. Staging (and Pebble) have both required POST-as-GET on most endpoints for quite some time. I think directory and nonce are the only ones that allow unauthenticated GET, but I could be wrong on that.
@Evesy If you're using an internal tool to make those requests, you should have it updated. If you are using a third-party tool or library, a newer version that supports post-as-get should be available. If you or your team have problems updating your toolset, feel free to start a dedicated thread in this forum.
Thanks @jvanasco -- We're using cert-manager which I'd envisage is using post-as-get. It was only whilst trying to debug the issue today I tried performing a Get on those URLs (As I recalled many moons ago it had proved useful in troubleshooting), and happened to spot the difference between the two environments.
I assume the same information can be retrieved from the endpoint if the request is made correctly with post as get