I’m currently in the process of implementing a LetsEncrypt setup, and so far everything is going smooth. We currently support HTTPs for around 200 custom domains. Everything is validating perfectly using the http-01 challenge due to a CNAME record on our customers domains.
However, of all the domains there’s one single domain that I simply cannot validate: brochure.kia.com.
It keeps returning with a “SERVFAIL looking up A for brochure.kia.com” error and I cannot understand why.
LetsDebug.net can consistently give me the error, but If I look up the A record for this domain using any other tool out there, even the unboundtest.com, I simply do not get any errors, and everything seems to resolve as normal.
Any help is greatly appreciated - at this point I don’t know what to tell the customer what the problem with their domain is.
Seems like Let’s Encrypt’s resolvers have a dislike for ns.hyundai-motor.com. and ns1.hyundai-motor.com..
They can resolve the CNAME target to viewer.ipaper.io. just fine, but pointing any domain at those nameservers (I tried kia.plugindev.ga.) leads to either a query timeout/SERVFAIL in the production and staging environments, respectively.
Since it affects both environments and they are on completely different networks, I would guess that it is more likely to be a DNS protocol thing than it is a networking thing, but I can’t actually identify what’s triggering it.
My theory is that Let’s Encrypt’s resolvers pay be picking up those last two NS records and bailing the entire lookup. It also goes some way to explaining the flapping between query timeout and SERVFAIL.
As for why unboundtest and Let’s Debug don’t produce the exact same error - they are both configured as “one-shot” resolvers. The resolver only exists for the few seconds that the lookup takes. Let’s Encrypt’s actual environment is probably using persistent resolvers, that, while configured with identical parameters, give a different lookup behavior.
Based on this, would you consider this a scenario that the LetsEncrypt resolvers handle incorrectly? Or could there be some misconfiguration on the customers end? Based on my limited DNS knowledge, having the private IP’s there should be okay. Besides CAA records (which are not present), could the customer alternatively be actively blocking these lookups from LetsEncrypt?
Having NS records pointing at private IPs is wrong in multiple ways, but I would have assumed that resolvers would typically ignore them without issue, yeah.
Unbound will sometimes try to resolve them as names, which wastes bandwidth and time, but shouldn't usually result in a failure...
It's certainly possible. There could be an Internet routing issue, or they could've actively blocked Let's Encrypt's IPs for some reason.
Not sure, the lack of external reproducibility makes it a bit foggy. We'd have to ask the staff/ops people to look into it. But doing an elimination test by masking/removing those two records could potentially be a much quicker solution.
Edit: for what it's worth, I tried setting up a couple of domains in a similar scenario (valid glue records but phony authoritative NS records) and couldn't reproduce the failure. So it might be a network timeout after all ...
@mnordhoff Thanks for you input. I’m afraid I probably cannot ask the customer to change this setup, but it would definitely be worth mentioning to them when I reach out.
Thanks a lot for testing that out! So it would actually seem that LetsEncrypts resolvers handle the general scenario correctly and that the issue most likely lies with the customers' actual nameserver, or at network level causing some routing issue as @mnordhoff mentioned.
I will contact the customer forwadring all this information, as it would seem I don't have any other options at this point. Thanks for helping out