HTTPS connection failures (timeouts) from validation servers

Hey team,

We are also running servers within New Zealand and are all experiencing the same thing this week. They are connected via the 2degrees/Vocus network from Auckland. We have not been made aware of any upstream network configuration changes to our IP ranges.

The differences we have from the above issue is we are running Plesk on them however the symptoms are the same:

  • Fails to renew if HTTPS redirects are enabled.
  • Able to renew if HTTPS redirect is disabled.
  • Get the status 400: "Timeout after connect (your server may be slow or overloaded)" message on all sites on all different servers when renewing.
  • Our ethernet MTU is set to default 1500 across all devices in our network.
  • Firewall is Watchguard, firmware was last updated just over a month ago. Can confirm we have had successful renewals after the last update though.

A domain name for checking in logs is: lambo.enlightenhosting.com

As we're running this via Plesk and the SSL It! extension which speaks ACME I can't provide certbot versions. Can try and help with any debugging necessary as well although pcaps could be tricky on these hosts.

2 Likes

Can you see in lowering your MTU fixes it, and if so, to what value?

It seems there is likely some ISP or router, probably in New Zealand, which is dropping packets over a certain size.

3 Likes

A tcptraceroute on port 443 from the Let's Encrypt production service back to each of the affected servers we've identified really would help us enormously, even if only for us to bring to Cloudflare support to ask them what's going on, since this seems to be specific to New Zealand providers so far.

1 Like

I can do that soon. Sorry, I haven't read all the posts, but for my own information: is this only happening from our primary perspective, and not secondaries (in AWS)?

4 Likes

Permanent link to this check report shows a few places around the world
with Result of "Connection timed out" for https://compairtech.co.nz.

1 Like

So it seems, but no such problem here for https://webhost19.inspire.net.nz/ : Check website performance and response : Check host - online website monitoring

1 Like

Tomorrow will mark the week long duration of this issue. Please do read the information provided, we have all taken time to provide what we can (while respecting customer privacy etc).

Good information exists here, with an endpoint Let's Encrypt staff can test against. Here: "your server may be slow or overloaded" acme-v02.api - #16 by jc-rimu

I will re-assert: The MTU workaround is NOT a fix.

For the Let's Encrypt team: we do appreciate your steps towards identification/fix.

All of the validation requests for production certificates I've seen so far have come from IPs in 23.178.112.0/24. If I add --staging to my requests, what I assume is both primary and secondary validation (multiple requests from very different networks) has worked every time I've tried it so far.

1 Like

Then the fix would be on who ever is dropping the packet over a certain size.
Do you yet know who they are? And if so have you contacted them as well?

Hi Bruce,

That's what we're trying to determine. I'll take it from here, thank you.

5 Likes

This thread's issue came right for us sometime before 11:45am today; production cert validation now seems to work on all of the previously affected hosts, with a normal 1500 MTU interface, so I thought I should mention that to @mcpherrinm if he does get TCP traceroutes back to us; they may be misleading now as we're no longer suffering the problem, as far as I can tell. I'll update here if it breaks again.

I'm not on the networking side of the room here, but will ask if there have been any relevant changes to how those servers are connected.

EDIT: I mis-understood how caching works; it's in fact still broken for me: HTTPS connection failures (timeouts) from validation servers - #35 by tomryder-inspirenet

2 Likes

It is interesting that @tomryder-inspirenet has seen some change (see Tom's edit + next post) The issue is not resolved for us. This issue is external to any of the hosting providers in the thread.

Scratch that, sorry, I mis-understood how the caching works, and should have been forcing renewal with a fresh account. The problem in fact persists for us, too.

2 Likes

I've seen successful renewals on an affected server, at MTU 1500, what are other people seeing?

1 Like

Yes, it works for me now too: "your server may be slow or overloaded" acme-v02.api - #20 by tomryder-inspirenet

1 Like

Yes, ours are all renewing at MTU 1500 now too as of Friday.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.