DNS problem: SERVFAIL looking up A for brochure.kia.com

Carsten-iPaper · February 25, 2019, 12:27pm

I’m currently in the process of implementing a LetsEncrypt setup, and so far everything is going smooth. We currently support HTTPs for around 200 custom domains. Everything is validating perfectly using the http-01 challenge due to a CNAME record on our customers domains.

However, of all the domains there’s one single domain that I simply cannot validate: brochure.kia.com.

It keeps returning with a “SERVFAIL looking up A for brochure.kia.com” error and I cannot understand why.

LetsDebug.net can consistently give me the error, but If I look up the A record for this domain using any other tool out there, even the unboundtest.com, I simply do not get any errors, and everything seems to resolve as normal.

Any help is greatly appreciated - at this point I don’t know what to tell the customer what the problem with their domain is.

https://letsdebug.net/brochure.kia.com/24874?debug=y
https://unboundtest.com/m/A/brochure.kia.com/NZZEE7JZ

Thank you in advance

_az · February 25, 2019, 12:57pm

Seems like Let’s Encrypt’s resolvers have a dislike for ns.hyundai-motor.com. and ns1.hyundai-motor.com..

They can resolve the CNAME target to viewer.ipaper.io. just fine, but pointing any domain at those nameservers (I tried kia.plugindev.ga.) leads to either a query timeout/SERVFAIL in the production and staging environments, respectively.

Since it affects both environments and they are on completely different networks, I would guess that it is more likely to be a DNS protocol thing than it is a networking thing, but I can’t actually identify what’s triggering it.

_az · February 25, 2019, 1:29pm

OK, I think this might be it:

$ dig +noall +answer @ns.hyundai-motor.com hyundai-motor.com ns
hyundai-motor.com.      600     IN      NS      ns.hyundai-motor.com.
hyundai-motor.com.      600     IN      NS      ns1.hyundai-motor.com.
hyundai-motor.com.      600     IN      NS      10.10.111.50.
hyundai-motor.com.      600     IN      NS      10.10.111.1.

Discovered with the help of https://ednscomp.isc.org/ednscomp/5f481c5361 which was randomly listing those 10/8 addresses.

My theory is that Let’s Encrypt’s resolvers pay be picking up those last two NS records and bailing the entire lookup. It also goes some way to explaining the flapping between query timeout and SERVFAIL.

As for why unboundtest and Let’s Debug don’t produce the exact same error - they are both configured as “one-shot” resolvers. The resolver only exists for the few seconds that the lookup takes. Let’s Encrypt’s actual environment is probably using persistent resolvers, that, while configured with identical parameters, give a different lookup behavior.

Carsten-iPaper · February 25, 2019, 3:36pm

Hi @_az

Thank you so much for your debugging efforts.

Based on this, would you consider this a scenario that the LetsEncrypt resolvers handle incorrectly? Or could there be some misconfiguration on the customers end? Based on my limited DNS knowledge, having the private IP’s there should be okay. Besides CAA records (which are not present), could the customer alternatively be actively blocking these lookups from LetsEncrypt?

Your advice is greatly appreciated!

mnordhoff · February 25, 2019, 3:48pm

Having NS records pointing at private IPs is wrong in multiple ways, but I would have assumed that resolvers would typically ignore them without issue, yeah.

Unbound will sometimes try to resolve them as names, which wastes bandwidth and time, but shouldn't usually result in a failure...

It's certainly possible. There could be an Internet routing issue, or they could've actively blocked Let's Encrypt's IPs for some reason.

_az · February 25, 2019, 8:43pm

Not sure, the lack of external reproducibility makes it a bit foggy. We'd have to ask the staff/ops people to look into it. But doing an elimination test by masking/removing those two records could potentially be a much quicker solution.

Edit: for what it's worth, I tried setting up a couple of domains in a similar scenario (valid glue records but phony authoritative NS records) and couldn't reproduce the failure. So it might be a network timeout after all ...

Carsten-iPaper · February 27, 2019, 11:58am

@mnordhoff Thanks for you input. I’m afraid I probably cannot ask the customer to change this setup, but it would definitely be worth mentioning to them when I reach out.

Carsten-iPaper · February 27, 2019, 12:02pm

Thanks a lot for testing that out! So it would actually seem that LetsEncrypts resolvers handle the general scenario correctly and that the issue most likely lies with the customers' actual nameserver, or at network level causing some routing issue as @mnordhoff mentioned.

I will contact the customer forwadring all this information, as it would seem I don't have any other options at this point. Thanks for helping out

system · March 29, 2019, 12:02pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.