Well, there's no harm in trying I guess. I doubt it's their equipment causing this - it would have to be two independent and different devices deciding to do the same thing at the same time - but they might at least be able to reach to their upstream which we can't.
There is definitively other DNS traffic being sent and received, approx. 500 packets per second according to the packet dump. Seems to matches the average for night traffic.
Well, we're talking Central Europe to United States, so can't really affect much. And the route between all our nameservers and LE does indeed meet after few hops. One just has to wonder if the problem really only affects us or if everyone else decided to skip any attempts to fix it and just decided to work around the issue.
It doesn't for us. To put it bluntly - if you can't reach our nameservers, there's little chance of you being able to reach the service/name you're quering the nameservers for.
FYI, this won't force DNS via TCP. Each name is a separate DNS query/response, so a certificate with a lot of names does not increase the size of each packet. The way to force DNS via TCP would be to include some extra records in responses so the size becomes larger than the edns-buffer-size, leading PowerDNS to set the TC (truncated) bit.
I'll check with our SREs.
I'll second what @petercooperjr said. We've had a lot of experience with mysterious DNS problems being triggered by anti-DDoS measures. I recommend contacting the provider. You're right that we wouldn't generally expect the anti-DDoS device to stop traffic on the way out. But there may be additional queries blocked on the way in that you never see, leading Unbound to conclude that a server is down. It still doesn't fully explain the packet logs, but.. worth a look.
HTTPS is a little off-topic. The working hypothesis here is that there's a problem between @tlghacc's nameservers and Let's Encrypt specifically. We don't have reason to believe there is a more general connectivity problem between their servers and the rest of the world.
Presumably it would be easier to ask the server housing provider to temporarily turn off the DDoS service with respect to certain hosts. Spinning up a copy of a service with a whole different provider is usually quite a lot of work.
And I agree with @petercooperjr and @tlghacc here - I don't think having DNS providers in multiple countries is a requirement, or likely to fix the issue.
Several of us have looked into this and tried to interpret some traceroutes and other data, and we're still pretty stumped. It does seem like DDoS mitigation or rate limiting could be the cause.
But there's one more thing I noticed, which this thread hasn't covered yet:
The domain has three nameservers, but only two of them have "glue" records at nic.cz. As a resolver works upwards from the root, it's going to receive the glue records for thinline.cz's nameservers right away, and I'll bet it will immediately try to query them before (or while) it tries to find ns3.cesky-hosting.eu's IP address in order to query it. Depending on the resolver's implementation, it might not even try that third nameserver. In turn, if that's the case, then thinline's DDoS protection could indeed kill the whole query and having cesky doesn't help.
I'm sorry I don't have time to dig deeply into this at the moment, but maybe this is worth exploring and fact-checking.
A good observation, though to be clear, it would be incorrect for [abcd].ns.nic.cz's response for NS thinline.cz to contain glue for ns3.cesky-hosting.eu, since it's out of bailiwick. But I think what you're saying is: the ns.thinline.cz servers will probably be hit a little more often by Unbound because it can find them right away when the cache is empty rather than having to recurse further.
I don't think we have seen a distinction between the nameservers in terms of likelihood to error, right?
I would have expected that if Unbound was getting no response back from one authoritative DNS server, that it would retry using a different one. Isn't that kind of the point of having multiple?
I had thought that while @rg305's suggestion of spinning up a DNS server somewhere closer to Let's Encrypt's servers might be overkill and have some downsides, that it would in fact solve the problem because Unbound would eventually try it and get responses.
I still somewhat doubt it but I will ask our housing provider about this next week.
I compared the dump of NS1 and NS3 and counted packets from/to 18.104.22.168 - they pretty much don't differ (approx. 3% difference with NS3 getting more traffic on that night actually.)
Just to add, NS3 is sitting pretty much next to NS1, its purpose is not to be geographically/provider redundant but top-level name redundant in case .cz fails. (Or "fails" - not that we have much fear of all .cz nameservers going down, but some years ago we had a case when someone wanted some domain's content removed and went to court over it. The judge issued a preliminary order, which ordered NIC.CZ to shutdown the offending domain by removing thinline.cz from the .cz zone. Making domain's nameservers unavailable certainly shuts it down, but the side effects this solution would have if it actually went through are obvious.)
True, but if we are talking working around the problem instead of solving it, so would requesting LE to start the validation process over from the beginning. (Which - to reiterate - we are completely willing to do if the issue proves unsolvable and no one from LE says "don't do that".)
As for the 4th nameserver suggestion itself, I am not going to debate how easy it is or is not to create a new nameserver, because it is actually irrelevant. Even if I could create that nameserver with a snap of a finger, there would still be the issue of changing nameserver records for some tens of thousands of domains, which would need to be done on registrar level. And as far as I know, from the major top-level names only .cz allows you to do a mass change, everything else needs to be changed on per-domain basis and we don't even have administrative privilege to do it for every domain using our nameservers - for those cases we'd need to ask the customers.
On the other hand, making the renewal process start over once or twice on DNS error is a programming task for one or two afternoons.
As I'd said in my very first reply, yes you should be retrying on failures, up to a few times a day, and really only get a human involved if renewal doesn't succeed for like a whole week or two. Even if all the packets did work one day, it may be that on the next day a different Internet route is down or Let's Encrypt is down or whatever. I would have expected that with most people's setups a 5% failure rate really isn't something they'd notice (unless they dug into logs as you are doing), as most common clients (like certbot) are configured to just try each certificate renewal twice a day until it succeeds.
I surmise that there's probably some type of research out there that states a non-linear correlation between sporadic failure rate and the distance between the querier and the queried. Roughly in short: the more variable the routes, the more ways to failure.
I think I'm not reading that right, the logic has left me.
Let me put it this way (you know I like analogies):
PROBLEM: If I'm trying to locate you (like I would try to locate your website).
GIVEN: All the participants always know where you are (like DNS server do).
Choice #1: Call your cell phone and ask you where you are.
[until question has been answered]
Weakness: reliant on you have cell service, etc.
Choice #2: Call your cell phone and your significant other's cell phone and ask where you are.
[until question has been answered]
Weakness: reliant on one of you having cell service, etc.
[and since you travel together and have the same cell company, when one is down so is the other]
Choice #3: Call you, "other" you, your siblings, your parents, and a few of your closest friends.
[until question has been answered] WeaknessStrength: only one of a dozen people needs to have cell service, etc.
The first two choices put all cellular eggs into one provider's basket.
The third choice varies greatly with cellular providers and is no longer considered a weakness.
But who am I to piont out the obvious?
Master Of The Totally Obvious
[for those that happen to miss the totally obvious: That's my "MOTTO"]
In conclusion, I would say:
*the more variable the routes, the more ways to failure succeed!
When querying a DNS server on the other side of the world, there's almost guaranteed to be a lot more infrastructure, technical problems, and geopolitical interference than if the DNS server were down the street in the same country. Thus, if most DNS servers for a particular domain name are further rather than nearer to the querier (LE), failure seems more likely.
Individualized failures: YES.
As a whole: NO. More is always better than less.
The presumption that DNS servers located within the same country implies that all the required DNS servers are located within that country lacks common DNS use insight.
With the advent of things like "22.214.171.124" and "126.96.36.199" (and now even "188.8.131.52"), it stands to reason that you can't always get all the DNS servers within every single country.
So the guy looking for your website will likely be asking CF or Google (outside their country) for your IP (in their country). [especially more so as the size of the country's Internet presence is reduced]
That is a different scenario though. If the route to LE is down or LE itself is down (or returning HTTP 503 errors like few days ago), that's a transient error. Such an error says "whatever you tried might be ok or it might be not - as of now, we don't know, you can try the same thing again later" to me. And in response I (meaning the software, automatically) of course try again.
However, this is not the case I am trying to deal with here. Here LE clearly says to me "there is an error on your side, you did something wrong, you should not try again until you fix it", underlined with the 400 "status" code, which is a permanent "do not try the same thing again, you will get the same result" error for HTTP (although admittedly I did not check the docs to confirm that the code indeed has the HTTP semantics.) Unlike the previous case, here I shouldn't just wave my hand with "yeah, they say that sometimes" and simply retry. Because whatever error LE sees might be a canary for something other users encounter later on larger scale. That's why I came here.
From what I concluded from this discussion, most likely there is a repeating transient problem which breaks down the connection between LE and our nameservers. That transient problem should probably be handled as such by LE (keeping the validation in a "pending" state, retrying later) but it isn't - it's reported as a permanent error instead (and validation goes to an "invalid" state)
To be clear I am not blaming LE here for doing it wrong - this feels more like a result of the shortcomings in DNS itself, where the recursive resolver returns the same kind of error for transient problems (nameservers unreachable) and permanent ones (incorrectly setup DNSSEC for example). Fixing that would mean asking LE to develop their own recursive resolver, which would - at least in my opinion - be unreasonable.
Nevertheless, what I can now be fairly certain about is that this kind of permanent error reported by LE might be (and in this case most likely is) caused by a transient error en route and can be dealt with it as such. (I am still going to try reaching to our provider though in case they have any insight.)
If I find any more information, I will post it here. In any case, thanks for your time.
I think you've got everything right in your analysis there: Let's Encrypt (or rather, the DNS resolver they're using) can't always distinguish between a permanent or transient failure, and to some extent they really can't given how UDP and DNS work.
While it's always annoying when not everything matches the specs perfectly, I don't think there are any clients that really try to distinguish between types of errors (even when they should). Like I said, the popular client certbot always tries to renew a couple times a day, regardless of the reason why a failure happens, until it succeeds. This actually is pretty terrible in some ways, in that if a site is moved elsewhere (DNS pointed to a different server) but the original server is still running and nobody updates certbot on it, that certbot will dutifully attempt to keep on renewing (and failing each time) forever. I think things like this are what lead to about 80% of HTTP-01 validations failing, which means more clients should actually be smarter about not retrying if it fails for long enough.
But if what you're looking for is the "official advice", the closest thing to that which I know of is in the last section of the Integration Guide, which doesn't really make any attempt at distinguishing between temporary or permanent errors, and seems to me to just say that any "Renewal failure should not be treated as a fatal error."