If you look at the debug log from the production resolver which J.C. posted (which I know is hard to understand), I think it shows it connecting to all of the name servers. There are good-sized gaps in the timestamps too, which I think is what's showing it's waiting a while for any responses before eventually timing out. But that might just be that it didn't see any responses with the same case as the query it sent, so then it starts its fallback process. That thing at the end of "Capsforid fallback: getting different replies, failed" makes me think that it's giving up because it's getting different responses from different servers, and since none of them are echoing the case of the query it thinks it might be an attacker trying to inject a different response.
Also, the fact that the whole thing takes over a minute makes me think that the validation server may be giving up (giving the timeout error) before the whole process is done running (even if it would complete with an error). That is, it's not one particular server taking a long time, it's the entire process that's going over a limit.
@petercooperjr I confirmed that unboundtest is still running the same version of Unbound as prod, and all config changes are reflected in unboundtest's config.
@letsuser The DNS client part of unboundtest sets a timeout of 30 seconds. That's actually more generous than the DNS client -> Unbound timeout in prod, which is 10 seconds. Still, in this case it seems like it's not actually a matter of servers being slow, but rather servers giving bad capsforid answers and triggering repeated fallback.
Gotcha! Thanks, @jsha. Don't you think the failback currently works kinda weird, if it can't failback within the reasonable time? How should that be handled if the remote nameservers don't support capsforid?
Exactly the same problem with the domain in the zone spb.ru According to statistics in the zone spb.ru > 29000 domains. Perhaps there will be a lot of messages about this soon
Yes, I also have the same problem with domains spb.ru . I would like to understand the reason. In extreme cases, you can connect cloudflare, but I would not like to.
I want to make sure I'm understanding this: The Let's Encrypt validation system gives up (and returns a timeout error to the ACME client) after 10 seconds, but the log J.C. posted earlier showed the production Unbound in fact taking 77 seconds. But even if the timeout was longer, presumably the failure at the end of the Unbound log would mean that it would return an error (SERVFAIL?) anyway?
I think that the issue is actually with the .ru servers, not anything specific to spb.ru. (Though again, I'm doing a lot of guessing.) Does anyone know if other names under .ru are working? Has anyone managed to try another CA to see if it works (or at least gives a different error message)?
DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru
DNS problem: query timed out looking up CAA for school153.spb.ru
During secondary validation: DNS problem: query timed out looking up A for school153.spb.ru; DNS problem: query timed out looking up AAAA for school153.spb.ru
DNS problem: looking up CAA for spb.ru: DNSSEC: Bogus
Can you give a more specific example of a domain which is under .ru (and not under .spb.ru), which got a certificate from Let's Encrypt in the last week or so? I'd love to try to dig into what the difference is in the response from the .ru name servers for them. (Though be aware that I might not get a chance until the weekend. I'm just a random guy on the Internet posting on here when I get a chance, sometimes to put off the things that I'm actually supposed to be doing. It may be that other people would have better insights that I would, too.)
Yep, you've got it right. One more bit of information: boulder-va times out its requests to Unbound after 10 seconds, and will try again with a different randomly-selected Unbound, up to a total of 3 queries.
I'm guessing this was done manually, since he would have had to take a resolver out of rotation and change its debug level in order to get a log like that. But also worth noting that Unbound will in many cases continue working on a resolution even after Boulder has given up on its query.
Yep! Some DNS problems show up either as "query timed out" (from the Boulder side) or "SERVFAIL" (from the Unbound side) depending on details of Unbound's resolution process, infra cache, and what other queries have run recently. It would be nice if we could be consistent here but Unbound's model of "keep trying really hard to answer the query so we can eventually cache it" isn't an exact match for Boulder's timeouts.