From: e2e-dh-bbqy-master0.e2e-env.s6x8-odui.a0.stg.cldr.work | DNSViz
Any update? Are you working on a fix?
nslookup e2e-env.s6x8-odui.a0.stg.cldr.work ns-431.awsdns-53.com
Server: ns-431.awsdns-53.com
Address: 205.251.193.175
*** UnKnown can't find e2e-env.s6x8-odui.a0.stg.cldr.work: Non-existent domain
Hmm… It may be even normal with a DNS-01 challenge for the client to delete the TXT record after the challenge succeeds but before requesting issuance, at which point many DNS servers might switch to NXDOMAIN by the time CAA is being checked. I'm not sure what exactly CAA requires, but either Let's Encrypt should change to allow for NXDOMAIN when checking CAA, or clients need to ensure that their DNS server still returns NOERROR for CAA until after issuance actually happens (in which case maybe some existing certs are misissued?).
Nobody who actually works for Let's Encrypt has responded yet; we're just random people on the Internet trying to help. It's not clear to me yet whether Let's Encrypt needs to change or your ACME client and/or DNS server software would need to change, in order to properly handle CAA checking.
Has this changed since yesterday? since I see there has been no change in the acme client or the DNS server.
Some other people have listed some changes in another thread that Let's Encrypt is working on relating to CAA checking that were recently deployed to staging, yes.
Testing these sorts of things out is really the purpose of the staging environment.
I'm just personally not clear on whether (1) there's a bug in the change, preventing issuance when it should be allowing it, or (2) there's a bug in some (possibly many) ACME clients and/or DNS servers that have CAA records returning NXDOMAIN when it should be NOERROR, even though that used to "work" and Let's Encrypt would issue anyway (and still does work in production).
I have tried adding CAA record to see if that helps but fails with the exact same error. Can you see what is wrong with this record? Or any other workaround? blocked for a long time now
dig +short dev.cldr.work CAA
0 issue "letsencrypt.org"
The CAA record is checked at each level of the domain name. And, using https://unboundtest.com it still shows NXDOMAIN for CAA lookups for above name. The CAA record for the apex name are fine.
Yeah, you'd need a CAA record of the full name, or at least convince your DNS server to respond with NOERROR for it (maybe by adding a name on a higher level or something).
Well, this may be the same situation as if Let's Encrypt staging were down entirely: There's no guarantee that Staging will have any particular uptime. (For that matter, there's no guarantee that production would have any particular uptime.) Is there something that you specifically need to test in staging, rather than just getting the cert you need in production? Or can you use some other CA (there are quite a few supporting ACME nowadays, though I don't know which of them have publicly-accessible testing environments for if you really don't want a publicly-trusted cert)?
I believe it is perfectly possible to have NXDOMAIN response to CAA.
Consider this DNS setup. Zone apex is example.com, and subdomain.example.com is not delegated.
As you mentioned, there may be only _acme-challenge.subdomain.example.com TXT record for DNS-01 in public DNS server, and without any subdomain.example.com records. After finishing DNS-01 challenge, _acme-challenge.subdomain.example.com would be deleted.
In this case, checking subdomain.example.com CAA record would return NXDOMAIN, because _acme-challenge.subdomain.example.com record is already deleted and there is not any subdomain.example.com records.
@_az As such I believe Boulder should treat NXDOMAIN as normal response too.
Yes, that's the scenario I was trying to say earlier, if a client deletes the TXT record after the challenge is complete but before the order is completed and CAA gets checked.
It's entirely possible that yes, NXDOMAIN should be considered the same in this case, as there just being no CAA record. I can't find a clear statement in the CAA Specification about it, though. It just talks about what to do if the DNS response is "empty", and I don't know if NXDOMAIN is an "error" instead of being just "empty". (For instance, SERVFAIL is definitely an error that would prohibit issuance even though it also isn't returning any CAA records.) I'm far from an expert on RFC/BR interpretation, though.
The RFC text says
If such an RRset exists, a CA MUST NOT issue a certificate unless...
And says nothing when there is no CAA record. So non-existence would not block issuance, no matter the response code is NOERROR or NXDOMAIN (both are possible for non-existence).
Thank you for the updates, we are investigating.
Looks like this issue is being addressed here:
LE Team, I see the changes were merged a while ago. any ETA on when the staging updated with the build will be of great help.
You mean "one hour"?
It would need to be released into production first. Not sure what the current release cadence is, but it used to be once every week or maybe even once every two weeks.
I think the issue is only in staging; this caught the problem before it got deployed to production. (Which is, after all, the main point of the staging environment.)
It's typically around 1 week, but it doesn't appear to be a fully fixed schedule (for example, there has been a two week gap between this staging build and the previous one). Production is usually one build behind staging.
(I source my data from pulling the deployed build hourly: Let's Encrypts Boulder version history)
We have merged the fix that Osiris linked above, and tagged a hotfix release which includes that fix. It should go to Staging soonish, and the current version which is exhibiting this broken behavior in Staging will not go to Prod.
In general, we release once weekly -- to Staging on tuesdays, and that same version to Prod on thursdays -- but we do not make any external commitments to that release cadence, and will regularly release more or less frequently than that depending on various circumstances.
The hotfix to staging went out about an hour ago. Seems like the errors have died down in our logs.