Unable to update challenge :: authorization must be pending

Ahh, I think you might have found the cause of issue. Nice one.

Previously, Boulder would (for the purposes of CAA) consider everything except for SERVFAIL to be a success, and look for CAA RRs in the response. However, the change that was introduced recently would treat NXDOMAIN as an error response, and cause the CAA check to fail.

Maybe there is something happening with cached authorizations where certificates are being re-issued after the original TXT/A/AAAA RRs have been pulled (changing the query result from NOERROR to NXDOMAIN), but the CAA recheck happens anyway, and leads to this error outcome.

8 Likes

the domain should be resolvable to a private address, "e2e-dh-bbqy-master0.e2e-env.s6x8-odui.a0.stg.cldr.work" is an example. For the reported domain the DNS record may have been deleted now.

I tried adding a CAA record for the base domain to see if that helps, but that also failed with the same error. Not sure if I got that all right though.

From: e2e-dh-bbqy-master0.e2e-env.s6x8-odui.a0.stg.cldr.work | DNSViz

2 Likes

Any update? Are you working on a fix?

nslookup e2e-env.s6x8-odui.a0.stg.cldr.work ns-431.awsdns-53.com

Server:  ns-431.awsdns-53.com
Address: 205.251.193.175
*** UnKnown can't find e2e-env.s6x8-odui.a0.stg.cldr.work: Non-existent domain
2 Likes

Hmm… It may be even normal with a DNS-01 challenge for the client to delete the TXT record after the challenge succeeds but before requesting issuance, at which point many DNS servers might switch to NXDOMAIN by the time CAA is being checked. I'm not sure what exactly CAA requires, but either Let's Encrypt should change to allow for NXDOMAIN when checking CAA, or clients need to ensure that their DNS server still returns NOERROR for CAA until after issuance actually happens (in which case maybe some existing certs are misissued?).

Nobody who actually works for Let's Encrypt has responded yet; we're just random people on the Internet trying to help. It's not clear to me yet whether Let's Encrypt needs to change or your ACME client and/or DNS server software would need to change, in order to properly handle CAA checking.

5 Likes

Has this changed since yesterday? since I see there has been no change in the acme client or the DNS server.

Some other people have listed some changes in another thread that Let's Encrypt is working on relating to CAA checking that were recently deployed to staging, yes.

Testing these sorts of things out is really the purpose of the staging environment.

I'm just personally not clear on whether (1) there's a bug in the change, preventing issuance when it should be allowing it, or (2) there's a bug in some (possibly many) ACME clients and/or DNS servers that have CAA records returning NXDOMAIN when it should be NOERROR, even though that used to "work" and Let's Encrypt would issue anyway (and still does work in production).

6 Likes

I have tried adding CAA record to see if that helps but fails with the exact same error. Can you see what is wrong with this record? Or any other workaround? blocked for a long time now :frowning:

dig +short dev.cldr.work CAA
0 issue "letsencrypt.org"

The CAA record is checked at each level of the domain name. And, using https://unboundtest.com it still shows NXDOMAIN for CAA lookups for above name. The CAA record for the apex name are fine.

5 Likes

Yeah, you'd need a CAA record of the full name, or at least convince your DNS server to respond with NOERROR for it (maybe by adding a name on a higher level or something).

Well, this may be the same situation as if Let's Encrypt staging were down entirely: There's no guarantee that Staging will have any particular uptime. (For that matter, there's no guarantee that production would have any particular uptime.) Is there something that you specifically need to test in staging, rather than just getting the cert you need in production? Or can you use some other CA (there are quite a few supporting ACME nowadays, though I don't know which of them have publicly-accessible testing environments for if you really don't want a publicly-trusted cert)?

5 Likes

I believe it is perfectly possible to have NXDOMAIN response to CAA.

Consider this DNS setup. Zone apex is example.com, and subdomain.example.com is not delegated.

As you mentioned, there may be only _acme-challenge.subdomain.example.com TXT record for DNS-01 in public DNS server, and without any subdomain.example.com records. After finishing DNS-01 challenge, _acme-challenge.subdomain.example.com would be deleted.

In this case, checking subdomain.example.com CAA record would return NXDOMAIN, because _acme-challenge.subdomain.example.com record is already deleted and there is not any subdomain.example.com records.

@_az As such I believe Boulder should treat NXDOMAIN as normal response too.

3 Likes

Yes, that's the scenario I was trying to say earlier, if a client deletes the TXT record after the challenge is complete but before the order is completed and CAA gets checked.

It's entirely possible that yes, NXDOMAIN should be considered the same in this case, as there just being no CAA record. I can't find a clear statement in the CAA Specification about it, though. It just talks about what to do if the DNS response is "empty", and I don't know if NXDOMAIN is an "error" instead of being just "empty". (For instance, SERVFAIL is definitely an error that would prohibit issuance even though it also isn't returning any CAA records.) I'm far from an expert on RFC/BR interpretation, though.

4 Likes

The RFC text says

If such an RRset exists, a CA MUST NOT issue a certificate unless...

And says nothing when there is no CAA record. So non-existence would not block issuance, no matter the response code is NOERROR or NXDOMAIN (both are possible for non-existence).

3 Likes

Thank you for the updates, we are investigating.

7 Likes

Looks like this issue is being addressed here:

9 Likes

LE Team, I see the changes were merged a while ago. any ETA on when the staging updated with the build will be of great help.

You mean "one hour"?

It would need to be released into production first. Not sure what the current release cadence is, but it used to be once every week or maybe even once every two weeks.

4 Likes

I think the issue is only in staging; this caught the problem before it got deployed to production. (Which is, after all, the main point of the staging environment.)

5 Likes

It's typically around 1 week, but it doesn't appear to be a fully fixed schedule (for example, there has been a two week gap between this staging build and the previous one). Production is usually one build behind staging.

(I source my data from pulling the deployed build hourly: Let's Encrypts Boulder version history)

6 Likes