Wow, thanks for including all that detail. I'm not sure if it'll be me, but I suspect somebody here will be able to help you.
First, to try to answer your questions:
Well, I've got some questions below to help dig into this, but it does seem like something weird is going on, that you'd see a specific response but Let's Encrypt would see a SERVFAIL, especially if the domain isn't using DNSSEC.
Let's Encrypt does check multiple nameservers, from multiple places on the Internet. They need to ensure that you actually own the name as seen from everywhere on the Internet, and the Internet doesn't always show the same things at different places, so they need to be thorough.
If the error message says "secondary validation" in it, then it worked from one location (their primary) but failed at a couple of their other locations. I don't know as that's what happening for you here, though.
Well, the general advice for when to renew is to start 30 days before certificate expiration, and then if you have a problem you can continue retrying once or twice a day. Intermittent problems do happen, and Let's Encrypt does occasionally go down or suspend issuance, and you should probably having monitoring that alerts you if several attempts for a name haven't worked and the expiration is nearer. But if you're consistently having 5%-ish of your attempts fail, then that does seem high and yes it's probably good to dig into it like you are here.
(And just to be clear, since you're saying you want "confirmation from LE side", I'm just a random person on the Internet and not any sort of official spokesperson for Let's Encrypt.)
Secondly, I've got a few questions for you, too:
- Are you always both the web hosting provider as well as the DNS provider for the domains that (sometimes) have problems? That is, is it the same DNS servers for all these names?
- Are you always using the DNS-01 challenge?
- Can you leave some test value in a
_acme-challenge
TXT record so that we can try hammering it with various DNS clients and seeing if we can see something odd? - It seems weird to me that your packet captures are for IPv4 addresses, when your DNS servers (or at least the servers for
_acme-challenge.lidovapisen.cz
) support IPv6 as well. Let's Encrypt would use IPv6 if possible. Do you have any packet captures for IPv6? - Are your packet captures for UDP, TCP, or both?
- When you try your hundreds of certificates and 5–10 fail, how often do those failures then work on a retry attempt? Like, does it almost always work the second time?
- What time of day does your renewal process happen? Is it spread out throughout the day for those hundreds of certificates, or do they all happen "at once"? Does it often happen at zero minutes past an hour?
- Do any of the names use DNSSEC?
- What DNS server software and version are you running?
Hopefully that will help people here be able to dig into this more.