Thank you for the inputs, we might very well rework the way we handle certificates with a large number of SAN in them.
I think in the past this exact scenario happened organically as the cached validations timed out in different intervals only a few of the domains were up for validation each run.
But then LE stopped processing the responses because they grew bigger and bigger leading to even more domains expiring from the cache only compounding on the size of the subsequent validation TXT records.
What stumped me for far to long during debugging is that the primary validation always passed and the secondary did not. So I was looking at the anycast setup of our dns provider thinking they were serving a different set or maybe errors when queried in a particular location..