We are also running servers within New Zealand and are all experiencing the same thing this week. They are connected via the 2degrees/Vocus network from Auckland. We have not been made aware of any upstream network configuration changes to our IP ranges.
The differences we have from the above issue is we are running Plesk on them however the symptoms are the same:
Fails to renew if HTTPS redirects are enabled.
Able to renew if HTTPS redirect is disabled.
Get the status 400: "Timeout after connect (your server may be slow or overloaded)" message on all sites on all different servers when renewing.
Our ethernet MTU is set to default 1500 across all devices in our network.
Firewall is Watchguard, firmware was last updated just over a month ago. Can confirm we have had successful renewals after the last update though.
As we're running this via Plesk and the SSL It! extension which speaks ACME I can't provide certbot versions. Can try and help with any debugging necessary as well although pcaps could be tricky on these hosts.
A tcptraceroute on port 443 from the Let's Encrypt production service back to each of the affected servers we've identified really would help us enormously, even if only for us to bring to Cloudflare support to ask them what's going on, since this seems to be specific to New Zealand providers so far.
I can do that soon. Sorry, I haven't read all the posts, but for my own information: is this only happening from our primary perspective, and not secondaries (in AWS)?
Tomorrow will mark the week long duration of this issue. Please do read the information provided, we have all taken time to provide what we can (while respecting customer privacy etc).
All of the validation requests for production certificates I've seen so far have come from IPs in 23.178.112.0/24. If I add --staging to my requests, what I assume is both primary and secondary validation (multiple requests from very different networks) has worked every time I've tried it so far.
This thread's issue came right for us sometime before 11:45am today; production cert validation now seems to work on all of the previously affected hosts, with a normal 1500 MTU interface, so I thought I should mention that to @mcpherrinm if he does get TCP traceroutes back to us; they may be misleading now as we're no longer suffering the problem, as far as I can tell. I'll update here if it breaks again.
I'm not on the networking side of the room here, but will ask if there have been any relevant changes to how those servers are connected.
It is interesting that @tomryder-inspirenet has seen some change (see Tom's edit + next post) The issue is not resolved for us. This issue is external to any of the hosting providers in the thread.
Scratch that, sorry, I mis-understood how the caching works, and should have been forcing renewal with a fresh account. The problem in fact persists for us, too.