DNS problem: query timed out looking up TXT

If you look at the debug log from the production resolver which J.C. posted (which I know is hard to understand), I think it shows it connecting to all of the name servers. There are good-sized gaps in the timestamps too, which I think is what's showing it's waiting a while for any responses before eventually timing out. But that might just be that it didn't see any responses with the same case as the query it sent, so then it starts its fallback process. That thing at the end of "Capsforid fallback: getting different replies, failed" makes me think that it's giving up because it's getting different responses from different servers, and since none of them are echoing the case of the query it thinks it might be an attacker trying to inject a different response.

Also, the fact that the whole thing takes over a minute makes me think that the validation server may be giving up (giving the timeout error) before the whole process is done running (even if it would complete with an error). That is, it's not one particular server taking a long time, it's the entire process that's going over a limit.

5 Likes

I've parsed the unboundtest.com's output (https://unboundtest.com/m/TXT/_acme-challenge.abisoft.spb.ru/PUXEEOBD) and queried (used an AWS instance in Virginia region) every IP address (ipv4 and v6) that is mentioned and all of them replied w/o any delays.

The whole process (from unboundtest) took 30 seconds, which should not be that much.

The last line in the unboundtest report is not that clear for me:
Error running query: read udp 127.0.0.1:59674->127.0.0.1:1053: i/o timeout

It does not look like there's an issue with resolving. It's rather something else..

1 Like

That's indicating that the DNS client didn't get a response from Unbound in time; Unbound there is running on port 1053.

3 Likes

Right. It took 30 seconds for unbound to perform all the checks. Do you think it's not enough for the client?

@petercooperjr I confirmed that unboundtest is still running the same version of Unbound as prod, and all config changes are reflected in unboundtest's config.

@letsuser The DNS client part of unboundtest sets a timeout of 30 seconds. That's actually more generous than the DNS client -> Unbound timeout in prod, which is 10 seconds. Still, in this case it seems like it's not actually a matter of servers being slow, but rather servers giving bad capsforid answers and triggering repeated fallback.

5 Likes

Gotcha! Thanks, @jsha. Don't you think the failback currently works kinda weird, if it can't failback within the reasonable time? How should that be handled if the remote nameservers don't support capsforid?

2 Likes

Exactly the same problem with the domain in the zone spb.ru According to statistics in the zone spb.ru > 29000 domains. Perhaps there will be a lot of messages about this soon

1 Like

Yes, I also have the same problem with domains spb.ru . I would like to understand the reason. In extreme cases, you can connect cloudflare, but I would not like to.

I want to make sure I'm understanding this: The Let's Encrypt validation system gives up (and returns a timeout error to the ACME client) after 10 seconds, but the log J.C. posted earlier showed the production Unbound in fact taking 77 seconds. But even if the timeout was longer, presumably the failure at the end of the Unbound log would mean that it would return an error (SERVFAIL?) anyway?

I think that the issue is actually with the .ru servers, not anything specific to spb.ru. (Though again, I'm doing a lot of guessing.) Does anyone know if other names under .ru are working? Has anyone managed to try another CA to see if it works (or at least gives a different error message)?

6 Likes

'.ru' - ok.
'.spb.ru' - same problem

3 Likes

'.msk.ru' - same problem

DNS problem: query timed out looking up TXT for _acme-challenge.harvia.msk.ru

4 Likes

DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru

DNS problem: query timed out looking up CAA for school153.spb.ru

During secondary validation: DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru

DNS problem: looking up CAA for spb.ru: DNSSEC: Bogus

All errors on the same site for different days.

2 Likes

Tested:
.ru - works
.spb.ru - doesn't work

1 Like

Can you give a more specific example of a domain which is under .ru (and not under .spb.ru), which got a certificate from Let's Encrypt in the last week or so? I'd love to try to dig into what the difference is in the response from the .ru name servers for them. (Though be aware that I might not get a chance until the weekend. I'm just a random guy on the Internet posting on here when I get a chance, sometimes to put off the things that I'm actually supposed to be doing. It may be that other people would have better insights that I would, too.)

6 Likes

successful example: devpegass.ru - the check was through a dns txt record

unsuccessful example: dus.adc.spb.ru

1 Like

novag.ru - certbot renew --dry-run - ok
kronus.spb.ru - not ok

1 Like

Yep, you've got it right. One more bit of information: boulder-va times out its requests to Unbound after 10 seconds, and will try again with a different randomly-selected Unbound, up to a total of 3 queries.

I'm guessing this was done manually, since he would have had to take a resolver out of rotation and change its debug level in order to get a log like that. But also worth noting that Unbound will in many cases continue working on a resolution even after Boulder has given up on its query.

Yep! Some DNS problems show up either as "query timed out" (from the Boulder side) or "SERVFAIL" (from the Unbound side) depending on details of Unbound's resolution process, infra cache, and what other queries have run recently. It would be nice if we could be consistent here but Unbound's model of "keep trying really hard to answer the query so we can eventually cache it" isn't an exact match for Boulder's timeouts.

6 Likes

And a next Boulder attempt at resolution would hit the cache? I believe Let's Encrypt does cache for a few seconds/minutes.

1 Like

Unlikely. The cache is 60s per resolver, and we have upwards of 100 to choose from randomly. You'd have to be fast and very lucky

4 Likes

Ah, I see..

Penny for your thought though: what's the use of a cache if it has less than 1 % of hitting?