[Staging] Record exists but query timing out looking up TXT record

Hi Team,

The LE challenge validation has started failing since last few days with the below error even though the challenge record is present. Verified the same with https://dnschecker.org/ as well.
Appreciate any help/insights into debugging this further please.

 During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work.

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is:
eptest-1.test-001.xcu2-8y8x.dev.cldr.work

I ran this command:
kwaikar@kwaiker-MBP16 bin % dig -t txt _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work +short

It produced this output:
"OWLc0PIwjWBgt9PakqoyKtLYNTmRWqS4KAlS61CsfVw"

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): NO

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
Acme4j v2.10

Same here, in staging env the DNS challenge failed although DNS TXT record is created good - verified using public tools like https://dnschecker.org/

I opened something similar, this morning i tried testing renew my certs on many servers and even some http-01 challenge based certs are failing with this error. My DNS are hosted at Gandi

I'm seeing NXDOMAIN being returned for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work (as well as just xcu2-8y8x.dev.cldr.work). Can you leave the challenge record in your DNS for some time so that others can take a look at querying in various ways to see if they can figure out what's going on?

4 Likes

sorry, didn't realize script has cleaned up. _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work is available now.

We're seeing similar issues with the staging environment since earlier this morning -- We seem to fail around the challenge point with the ultimate error being:

Error accepting challenge: 400 urn:ietf:params:acme:error:malformed: Unable
    to update challenge :: authorization must be pending'

Interestingly one thing we've noticed is the authz URL seems different between the prod & staging environments.
For the prod environment the authz URL is accessible via a GET and returns some meta information: e.g.

GET https://acme-v02.api.letsencrypt.org/acme/authz-v3/<something>

{
  "identifier": {
    "type": "dns",
    "value": "le-prod-test5.testing.k8.atcloud.io"
  },
  "status": "valid",
  "expires": "2022-05-13T13:15:21Z",
  "challenges": [
    {
      "type": "dns-01",
      "status": "valid",
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/<something>/Tj3W-w",
      "token": "<token>",
      "validationRecord": [
        {
          "hostname": "le-prod-test5.testing.k8.atcloud.io"
        }
      ],
      "validated": "2022-04-13T13:15:18Z"
    }
  ]
}

Whereas the authz URL for certificates against the staging environment are not accepting GET requests:

{
  "type": "urn:ietf:params:acme:error:malformed",
  "detail": "Method not allowed",
  "status": 405
}

(We were hoping the authz URL would give us the true cause for the acme failure as I believe the error we're getting is not the root cause)

We've successfully issued one or two certificates through staging today, but the large majority have failed in the same way

1 Like

Well, I don't think it's the root cause of your problem, but you do have an odd issue with your DNS. See the DNSViz report:

https://dnsviz.net/d/_acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work/dnssec/

_acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work/TXT: A query for _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work results in a NOERROR response, while a query for its ancestor, xcu2-8y8x.dev.cldr.work, returns a name error (NXDOMAIN), which indicates that subdomains of xcu2-8y8x.dev.cldr.work, including _acme-challenge.eptest-1.test-001.xcu2-8y8x.dev.cldr.work, don't exist.

That is, the NXDOMAIN response is supposed to mean that there aren't any responses available at all for any subdomains of it either, but your system is responding to xcu2-8y8x.dev.cldr.work with NXDOMAIN even though subdomains are giving other responses.

But again, I don't really think that's causing the problem you're seeing.

Aren't those supposed to be POST-as-GET requests anyway? I think the latest update still had them enforcing that in staging even though production doesn't.

4 Likes

Hi everyone,

We're actively using Let's Encrypt staging in our pre-production environment, and we've also noticed that DNS lookup timeout errors are frequently returned since April 13. 3:30-3:45 UTC, example:

one or more domains had a problem:
[*.<domain>] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: query timed out looking up CAA for <domain>
[<domain>] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.<domain>

Our DNS provider is Akamai, certificate generations were working fine until these issues appeared. As far as I see multiple people reporting this problem with different DNS providers, so I think the problem is more likely to be with Let's Encrypt staging.

3 Likes

We're using lets encrypt staging environment from cert-manager inside a kubernetes cluster and starting from 4:00 AM UTC we experience a similar issue as @Evesy above. The errors we get are:
1 sync.go:386] cert-manager/controller/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="context deadline exceeded"
1 sync.go:378] cert-manager/controller/challenges/acceptChallenge "msg"="error accepting challenge" "error"="400 urn:ietf:params:acme:error:malformed: Unable to update challenge :: authorization must be pending"

When switching to work against lets encrypt prod server the ACME flow works as expected.

1 Like

Yes I think that may not be the cause for the following reasons - DNS Checker - DNS Check Propagation Tool seems fine, NS records do exist for "dev.cldr.work" and its subdomains. Moreover, this has been working for years now and the same domain structure seems to be working fine with prod env right now.
Anything else that you can suspect?

Let's Encrypt SRE is addressing an issue with the DNS resolvers in our staging environment.

It should be resolved momentarily!

9 Likes

Are you sure this isn't impacting production as well?

I'm not sure but it doesn't look like all impacted users were using the staging environment.

All this mess and it was just staging?

There are acme clients in the wild that check staging before actually renewing, by default? O.o

2 Likes

I also have the problem on production environments. And I've switched to staging to debug.

Not just that, but I believe there are acme clients that check Let's Encrypt's staging before using some other CA for production. :slight_smile:

4 Likes

Yes. Staging (and Pebble) have both required POST-as-GET on most endpoints for quite some time. I think directory and nonce are the only ones that allow unauthenticated GET, but I could be wrong on that.

@Evesy If you're using an internal tool to make those requests, you should have it updated. If you are using a third-party tool or library, a newer version that supports post-as-get should be available. If you or your team have problems updating your toolset, feel free to start a dedicated thread in this forum.

3 Likes

Thanks @jvanasco -- We're using cert-manager which I'd envisage is using post-as-get. It was only whilst trying to debug the issue today I tried performing a Get on those URLs (As I recalled many moons ago it had proved useful in troubleshooting), and happened to spot the difference between the two environments.
I assume the same information can be retrieved from the endpoint if the request is made correctly with post as get

2 Likes

You assume correctly. Staging and Pebble (and eventually Boulder) simply now require a POST-as-GET for the account related endpoints.

The manual GET requests are incredibly useful. I just use a small python script to upgrade manual debugging to POST-as-GET.

2 Likes

Doesn't curl -X POST -d "" "${URL}" work?

1 Like

Just change the /acme/ in the URL to /get/ and you won't need POST-as-GET.

4 Likes

That command sends an empty POST, which works for those endpoints that do not require authentication.

Within the scope of the ACME spec and LetsEncrypt (see 6.3 RFC 8555 - Automatic Certificate Management Environment (ACME)), "POST-as-GET" means the client is sending an authenticated (signed) response to the server.

So what you need to do is issue the following command ...

`curl -X POST -d "${JWS}" "${URL}"`

... wherein $JWS is a properly formatted JSON Web Signature object (see 6.2 RFC 8555 - Automatic Certificate Management Environment (ACME) ), which requires both the account key and a server nonce to generate, AND that object has an empty inner payload. For example...:

            {
              "protected": base64url({
                "alg": "ES256",
                "jwk": {...},
                "nonce": "6S8IqOGY7eL2lsGoTZYifg",
                "url": "https://example.com/acme/new-account"
              }),
              "payload": "",
              "signature": "RZPOnYoPs1PhjszF...-nh6X1qtOFPB519I"
            }

It is possible to grab a nonce and create a properly signed JWS object with some shell scripting, but IMHO, it is much easier to accomplish all that with Python code.

This isn't necessary YET in production, but it will eventually be required due to security concerns.

3 Likes