My domain is: dfx-e645y8-master0.dfx-e645.js7g-3szl.dev.cldr.work
LetsEncrypt staging server
Acme client: acme4j
Team,
Since ~9 hours all the requests on the LetsEncrypt staging have been erroring out with pending challenge validation. Can someone please look into this ?
Caused by: org.shredzone.acme4j.exception.AcmeServerException: Unable to update challenge :: authorization must be pending Challenge failed with status INVALID because of: DNS problem: NXDOMAIN looking up CAA for dfx-e645y8-master0.dfx-e645.js7g-3szl.dev.cldr.work - check that a DNS record exists for this domain
Please note the same setup has been working fine on the LetsEncrypt prod server. Also on the LE staging server, this was running fine until a day ago.
Well, while a CAA record is optional, if it doesn't exist then I expect that the DNS server needs to return NOERROR (indicating that the name exists, with zero CAA records) rather than NXDOMAIN (which indicates there really is nothing there).
Can you give more details on your DNS setup? I see NXDOMAIN for dfx-e645y8-master0.dfx-e645.js7g-3szl.dev.cldr.work when querying for A, AAAA, or CAA. Is this name supposed to be resolvable?
Ahh, I think you might have found the cause of issue. Nice one.
Previously, Boulder would (for the purposes of CAA) consider everything except for SERVFAIL to be a success, and look for CAA RRs in the response. However, the change that was introduced recently would treat NXDOMAIN as an error response, and cause the CAA check to fail.
Maybe there is something happening with cached authorizations where certificates are being re-issued after the original TXT/A/AAAA RRs have been pulled (changing the query result from NOERROR to NXDOMAIN), but the CAA recheck happens anyway, and leads to this error outcome.
the domain should be resolvable to a private address, "e2e-dh-bbqy-master0.e2e-env.s6x8-odui.a0.stg.cldr.work" is an example. For the reported domain the DNS record may have been deleted now.
I tried adding a CAA record for the base domain to see if that helps, but that also failed with the same error. Not sure if I got that all right though.
Hmm… It may be even normal with a DNS-01 challenge for the client to delete the TXT record after the challenge succeeds but before requesting issuance, at which point many DNS servers might switch to NXDOMAIN by the time CAA is being checked. I'm not sure what exactly CAA requires, but either Let's Encrypt should change to allow for NXDOMAIN when checking CAA, or clients need to ensure that their DNS server still returns NOERROR for CAA until after issuance actually happens (in which case maybe some existing certs are misissued?).
Nobody who actually works for Let's Encrypt has responded yet; we're just random people on the Internet trying to help. It's not clear to me yet whether Let's Encrypt needs to change or your ACME client and/or DNS server software would need to change, in order to properly handle CAA checking.
Some other people have listed some changes in another thread that Let's Encrypt is working on relating to CAA checking that were recently deployed to staging, yes.
I'm just personally not clear on whether (1) there's a bug in the change, preventing issuance when it should be allowing it, or (2) there's a bug in some (possibly many) ACME clients and/or DNS servers that have CAA records returning NXDOMAIN when it should be NOERROR, even though that used to "work" and Let's Encrypt would issue anyway (and still does work in production).
I have tried adding CAA record to see if that helps but fails with the exact same error. Can you see what is wrong with this record? Or any other workaround? blocked for a long time now
The CAA record is checked at each level of the domain name. And, using https://unboundtest.com it still shows NXDOMAIN for CAA lookups for above name. The CAA record for the apex name are fine.
Yeah, you'd need a CAA record of the full name, or at least convince your DNS server to respond with NOERROR for it (maybe by adding a name on a higher level or something).
Well, this may be the same situation as if Let's Encrypt staging were down entirely: There's no guarantee that Staging will have any particular uptime. (For that matter, there's no guarantee that production would have any particular uptime.) Is there something that you specifically need to test in staging, rather than just getting the cert you need in production? Or can you use some other CA (there are quite a few supporting ACME nowadays, though I don't know which of them have publicly-accessible testing environments for if you really don't want a publicly-trusted cert)?
As you mentioned, there may be only _acme-challenge.subdomain.example.com TXT record for DNS-01 in public DNS server, and without any subdomain.example.com records. After finishing DNS-01 challenge, _acme-challenge.subdomain.example.com would be deleted.
In this case, checking subdomain.example.com CAA record would return NXDOMAIN, because _acme-challenge.subdomain.example.com record is already deleted and there is not any subdomain.example.com records.
@_az As such I believe Boulder should treat NXDOMAIN as normal response too.
Yes, that's the scenario I was trying to say earlier, if a client deletes the TXT record after the challenge is complete but before the order is completed and CAA gets checked.
It's entirely possible that yes, NXDOMAIN should be considered the same in this case, as there just being no CAA record. I can't find a clear statement in the CAA Specification about it, though. It just talks about what to do if the DNS response is "empty", and I don't know if NXDOMAIN is an "error" instead of being just "empty". (For instance, SERVFAIL is definitely an error that would prohibit issuance even though it also isn't returning any CAA records.) I'm far from an expert on RFC/BR interpretation, though.
If such an RRset exists, a CA MUST NOT issue a certificate unless...
And says nothing when there is no CAA record. So non-existence would not block issuance, no matter the response code is NOERROR or NXDOMAIN (both are possible for non-existence).
It would need to be released into production first. Not sure what the current release cadence is, but it used to be once every week or maybe even once every two weeks.