We have been using LE certs for a while and they're great. We use Azure DNS and the KeyVault Acmebot to issue certs. We have one main zone (company.com) and several delegations to other NS for subdomains (sub1.company.com, sub2.company.com etc.) which also use Azure DNS. When a new cert is to be issued or renewed the DNS challenge record had been created in the main zone, even if it is for a subdomain that has been delegated. This has been working fine for over a year, but has started failing this January. I believe that LE has always honored DNS NS delegations when looking for the challenge resource record, and that is how it should be. But how can this ever have worked, and what can possibly account for the change in behavior this January?
Not looking to make it work "the old way", but would like to understand what has happened. Thanks!
I'm a bit confused about what the scenario you're saying has changed. Are you talking about which Azure DNS zone your Key Vault Acmebot is writing the records to? Or are you saying that something about how the validation of Let's Encrypt reading your DNS has changed in some way?
I'm pretty sure nothing would have changed on the Let's Encrypt side, but possibly something in Azure or with your client might have. I'm not familiar with Keyvault Acmebot at all, but looking at that github you linked, it looks like they release new versions somewhat regularly. Do you know what version you are on, and if when it worked "differently" you were using a different version?
Nothing has changed in our setup. We are still writing our challenge records to the same zone in Azure DNS. Nor do I think that Let's Encrypt has changed their way of looking for the challenge record. Based on my reading of this forum it seems to honor regular DNS setups. Given those two things I just mentioned I really can't say what has changed, and that is why I posted this question; to see if anyone had an idea.
We update our KV Acmebot regularly and are running on the latest version. But I think that is not affecting this. It's LE that decides how to look for the DNS challenges, not our bot.
My only theory as to what might have changed is that sine our delegated zones are in Azure, Azure might have decided to make other NSs authoritative for our delegated zones. Since they control the hierarchy (both the parent zone and the delegated zone), this would be a "safe" change as it would not require any external NS updates. But I can't find any mention of such a change being done on the Azure side. So the mystery remains...
Then I'm still very unclear on what exactly it is that has "changed". Maybe you want to add some screenshots or other clarification on what you're looking at, and what it used to be. Or maybe your description is clear to people who have used Azure DNS and KeyVault Acmebot and I'm just unfamiliar so I should stop talking.
The defacto standard test for DNS challenges is to try resolving the record using https://unboundtest.com/ which has a somewhat similar approach to queries as Let's Encrypt validation uses. So if the first test is to check it works using that. Other tests include https://letsdebug.net and https://dnsviz.net/ - another thing to look out for is DNSSEC config gone awry.
If you could share real domains we could possibly help you figure it out, at least why it's failing (not necessarily why it changed).
I doubt anything has changed with regard to the way DNS is resolved.
It starts at the root "." and works its' way to the FQDN asking each step along the way.
(as in the case you describe) When a step prior to the FQDN returns a valid authoritative answer, that is what is used [and the search ceases].
What is the TXT for .com.example.sub1.serverA?
Ask: "." What is the TXT for .com.example.sub1.serverA?
Answer: "see .com nameservers"
Ask: ".com" What is the TXT for .com.example.sub1.serverA?
Answer: "see .com.example nameservers"
Ask: ".com.example" What is the TXT for .com.example.sub1.serverA?
[search stops - even though .com.example may have delegated .com.example.sub1 to other nameservers]
This happens because:
That "main zone" is authoritative for such a response.
If this has started to fail, you need to review your specific authoritative DNS tree path.
And the placement/existence of those TXT records.
@rg305 & @webprofusion Thanks for this very useful information. We are seeing a different behavior in our setup.
Again the TXT record is created in the parent zone (company.com) and the subdomain the record is in is delegated (sub.company.com). If I test with https://unboundtest.com/it does not find the record in the parent zone, or it considers it to be non-authoritative. It was my understanding that if you did DNS delegation then a DNS lookup would ignore your records for a delegated subdomain. Another set of NSs are responsible for that part of the namespace now, not you. So until you remove the delegation your records in those delegated subdomains will be ignored.
Is it possible any of these sub-zones weren't previously private/internal zones rather than public zones? Or perhaps the zones were created without the delegations in place for a while?
I've just tried to replicate this setup in my own Azure environment and the only way it works is if the NS record delegation to the sub doesn't exist. There's an option during sub-zone creation to auto-link it to a parent zone that can be skipped. So the child zone would still exist and you can query it directly if you know the nameservers, but the wider Internet doesn't know it exists.
These are the relevant records in the "root" (for our purposes) zone, paaz.xyto.cc.
There are two sub-zones that exist, sublink.paaz.xyto.cc and subnolink.paaz.xyto.cc. As you can see, only sublink has an NS record in the root and I've created a test TXT record that would theoretically live in each sub. Here are the records in each sub.
Each sub also has a test TXT record with a unique answer to make it easier to see which zone the query responses are coming from. Here are standard recursive queries from my local resolver that show you only get the answer from the root for the non-linked sub.
>dig +noall +answer test.sublink.paaz.xyto.cc txt
test.sublink.paaz.xyto.cc. 3027 IN TXT "\"answer from sublink\""
>dig +noall +answer test.subnolink.paaz.xyto.cc txt
test.subnolink.paaz.xyto.cc. 3018 IN TXT "\"answer from root\""
But you can still get the root answer for the linked sub if you query the root's nameservers directly as well as the unlinked sub answer if you query its nameservers. It's just that normal recursive resolvers would never do that.
>dig +noall +answer test.sublink.paaz.xyto.cc txt @ns1-07.azure-dns.com
test.sublink.paaz.xyto.cc. 3600 IN TXT "\"answer from root\""
>dig +noall +answer test.subnolink.paaz.xyto.cc txt @ns1-06.azure-dns.com
test.subnolink.paaz.xyto.cc. 3600 IN TXT "\"answer from subnolink\""
In any case, how it is working now is how it is supposed to work. If you've got sub-zones properly delegated to different nameservers, the records you want to query need to live in the sub-zone, not as dotted records in the parent zone.
No, it has been this way for quite a while and we have successfully issued multiple certificates for delegated zones in this setup. It was only this January that this started to break down. In other words; none of our zones are "unlinked".
I agree, I am just trying to understand why it suddenly has broken
It was my impression that you could either use GLUE or NS records to delegate, not both, and that GLUEing and DELEGATING are two different things. I might be wrong, so I appreciate the education.
My test with unboundtest.com seems to verify that a delegation is in fact giving the responsibility for a delegated zone to a different set of NSs, but again I might be reading the output wrong. It does not sound right to me if a parent zone can delegate away a subzone but still be able to create RRs for that subzone. I pity the admin of the subzone who will have no idea that the parent might have started creating records that might override his own...
The testing done by @rmbolger also seems to confirm that this is how delegation works.
Ok, granted in the case of GLUE, but can you then help me understand why @rmbolger's testing and my own test with unboundtest.com always honors the delegation path and ignores RRs in the parent zone when they belong to a delegated subzone?
Having a delegation for AND any other type of record within a particular zone creates a conflict.
How that conflict is handled, I guess, depends on the DNS server software in use that allowed it.
Yes, it should not even be allowed.
How should a DNS handle something that isn't supposed to be allowed???
That is anyone's guess [OR tests].
Here is the first of such testing...
In Windows DNS: [The conflict is not even allowed to happen]
When trying to create the TXT record after the sub domain has been delegated: Not Allowed.
When trying to create the delegated sub domain after the TXT record: Not Allowed.
My understanding is that glue records are just normal A (or AAAA) records that live in the parent zone for records in a delegated sub-zone. Though how they are specifically implemented may vary between DNS server implementations. They get sent in the "Additional" field of a DNS response (as opposed to "Answer" or "Authority") when necessary and exist to solve the catch-22 problem of delegations referring to nameservers whose records live within the zone they're delegating to.
More detailed explanation in this serverfault question.
In any case, delegations are always just NS records. But depending on what the NS records point to, you might also need glue records in the parent zone. Usually there's no glue records in cloud hosted DNS stuff because you're pointing to the cloud host's DNS servers which don't live in your domains. Though some providers do offer DNS nameserver white labeling that alias their own nameservers with names from your domain.
As an aside, what failure do you get from LE? Could it be the client testing DNS itself that fails or is it really LE that's failing?
You could try the same certificate request using something like Certify The Web (which I develop) to see if you get the same failure. That app also has the benefit of allowing multiple challenge response configs so you could for instance build a cert made up of several different DNS zones each with their own API credentials, if you wanted to. It will also deploy to keyvault.