DNS problem: query timed out looking up TXT

petercooperjr · September 14, 2023, 1:46pm

If you look at the debug log from the production resolver which J.C. posted (which I know is hard to understand), I think it shows it connecting to all of the name servers. There are good-sized gaps in the timestamps too, which I think is what's showing it's waiting a while for any responses before eventually timing out. But that might just be that it didn't see any responses with the same case as the query it sent, so then it starts its fallback process. That thing at the end of "Capsforid fallback: getting different replies, failed" makes me think that it's giving up because it's getting different responses from different servers, and since none of them are echoing the case of the query it thinks it might be an attacker trying to inject a different response.

Also, the fact that the whole thing takes over a minute makes me think that the validation server may be giving up (giving the timeout error) before the whole process is done running (even if it would complete with an error). That is, it's not one particular server taking a long time, it's the entire process that's going over a limit.

letsuser · September 14, 2023, 2:16pm

I've parsed the unboundtest.com's output (https://unboundtest.com/m/TXT/_acme-challenge.abisoft.spb.ru/PUXEEOBD) and queried (used an AWS instance in Virginia region) every IP address (ipv4 and v6) that is mentioned and all of them replied w/o any delays.

The whole process (from unboundtest) took 30 seconds, which should not be that much.

The last line in the unboundtest report is not that clear for me:
Error running query: read udp 127.0.0.1:59674->127.0.0.1:1053: i/o timeout

It does not look like there's an issue with resolving. It's rather something else..

jcjones · September 14, 2023, 3:45pm

That's indicating that the DNS client didn't get a response from Unbound in time; Unbound there is running on port 1053.

letsuser · September 14, 2023, 4:10pm

Right. It took 30 seconds for unbound to perform all the checks. Do you think it's not enough for the client?

jsha · September 14, 2023, 6:03pm

@petercooperjr I confirmed that unboundtest is still running the same version of Unbound as prod, and all config changes are reflected in unboundtest's config.

@letsuser The DNS client part of unboundtest sets a timeout of 30 seconds. That's actually more generous than the DNS client -> Unbound timeout in prod, which is 10 seconds. Still, in this case it seems like it's not actually a matter of servers being slow, but rather servers giving bad capsforid answers and triggering repeated fallback.

letsuser · September 14, 2023, 6:48pm

Gotcha! Thanks, @jsha. Don't you think the failback currently works kinda weird, if it can't failback within the reasonable time? How should that be handled if the remote nameservers don't support capsforid?

NikiN2 · September 15, 2023, 10:56am

Exactly the same problem with the domain in the zone spb.ru According to statistics in the zone spb.ru > 29000 domains. Perhaps there will be a lot of messages about this soon

LexBart · September 15, 2023, 11:17am

Yes, I also have the same problem with domains spb.ru . I would like to understand the reason. In extreme cases, you can connect cloudflare, but I would not like to.

petercooperjr · September 15, 2023, 12:23pm

I want to make sure I'm understanding this: The Let's Encrypt validation system gives up (and returns a timeout error to the ACME client) after 10 seconds, but the log J.C. posted earlier showed the production Unbound in fact taking 77 seconds. But even if the timeout was longer, presumably the failure at the end of the Unbound log would mean that it would return an error (SERVFAIL?) anyway?

I think that the issue is actually with the .ru servers, not anything specific to spb.ru. (Though again, I'm doing a lot of guessing.) Does anyone know if other names under .ru are working? Has anyone managed to try another CA to see if it works (or at least gives a different error message)?

Serg1900 · September 15, 2023, 12:34pm

'.ru' - ok.
'.spb.ru' - same problem

ddzhan · September 15, 2023, 1:00pm

'.msk.ru' - same problem

DNS problem: query timed out looking up TXT for _acme-challenge.harvia.msk.ru

ddzhan · September 15, 2023, 1:04pm

DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru

DNS problem: query timed out looking up CAA for school153.spb.ru

During secondary validation: DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru

DNS problem: looking up CAA for spb.ru: DNSSEC: Bogus

All errors on the same site for different days.

NikiN2 · September 15, 2023, 1:57pm

Tested:
.ru - works
.spb.ru - doesn't work

petercooperjr · September 15, 2023, 2:02pm

Can you give a more specific example of a domain which is under .ru (and not under .spb.ru), which got a certificate from Let's Encrypt in the last week or so? I'd love to try to dig into what the difference is in the response from the .ru name servers for them. (Though be aware that I might not get a chance until the weekend. I'm just a random guy on the Internet posting on here when I get a chance, sometimes to put off the things that I'm actually supposed to be doing. It may be that other people would have better insights that I would, too.)

NikiN2 · September 15, 2023, 2:10pm

successful example: devpegass.ru - the check was through a dns txt record

unsuccessful example: dus.adc.spb.ru

Serg1900 · September 15, 2023, 3:15pm

novag.ru - certbot renew --dry-run - ok
kronus.spb.ru - not ok

jsha · September 15, 2023, 6:42pm

Yep, you've got it right. One more bit of information: boulder-va times out its requests to Unbound after 10 seconds, and will try again with a different randomly-selected Unbound, up to a total of 3 queries.

I'm guessing this was done manually, since he would have had to take a resolver out of rotation and change its debug level in order to get a log like that. But also worth noting that Unbound will in many cases continue working on a resolution even after Boulder has given up on its query.

Yep! Some DNS problems show up either as "query timed out" (from the Boulder side) or "SERVFAIL" (from the Unbound side) depending on details of Unbound's resolution process, infra cache, and what other queries have run recently. It would be nice if we could be consistent here but Unbound's model of "keep trying really hard to answer the query so we can eventually cache it" isn't an exact match for Boulder's timeouts.

Osiris · September 15, 2023, 6:44pm

And a next Boulder attempt at resolution would hit the cache? I believe Let's Encrypt does cache for a few seconds/minutes.

jcjones · September 16, 2023, 3:32pm

Unlikely. The cache is 60s per resolver, and we have upwards of 100 to choose from randomly. You'd have to be fast and very lucky

Osiris · September 16, 2023, 3:34pm

Ah, I see..

Penny for your thought though: what's the use of a cache if it has less than 1 % of hitting?

Topic		Replies	Views
DNS problem SERVFAIL looking up A for sub.domain.de Help	27	4760	December 30, 2018
SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally Help	46	4434	September 6, 2018
DNS problem - SERVFAIL for (seemingly) correctly replied names Help	39	5346	January 6, 2022
My Letsencrypt certificate fails to renew randomly Help	28	3669	August 25, 2019
During secondary validation: DNS problem: SERVFAIL looking up A for - the domain's nameservers may be malfunctioning Help	45	8861	November 6, 2020

DNS problem: query timed out looking up TXT

Related topics