HTTPS connection failures (timeouts) from validation servers

Per: "your server may be slow or overloaded" acme-v02.api - #4 by MikeMcQ

My domain is: webhost19.inspire.net.nz

I ran this command: sudo certbot renew --cert-name=webhost19.inspire.net.nz

It produced this output:

Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/webhost19.inspire.net.nz.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Renewing an existing certificate for webhost19.inspire.net.nz

Certbot failed to authenticate some domains (authenticator: webroot). The Certificate Authority reported these problems:
  Domain: webhost19.inspire.net.nz
  Type:   connection
  Detail: 203.114.129.15: Fetching https://webhost19.inspire.net.nz/.well-known/acme-challenge/JyK8tXaWUdqsB8E9aCdTL7kDNho_F-ck8zKteDj_lNU: Timeout after connect (your server may be slow or overloaded)

Hint: The Certificate Authority failed to download the temporary challenge files created by Certbot. Ensure that the listed domains serve their content from the provided --webroot-path/-w and that files created there can be downloaded from the internet.

Failed to renew certificate webhost19.inspire.net.nz with error: Some challenges have failed.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
All renewals failed. The following certificates could not be renewed:
  /etc/letsencrypt/live/webhost19.inspire.net.nz/fullchain.pem (failure)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 renew failure(s), 0 parse failure(s)

My web server is (include version): Apache HTTPD 2.4.65-1~deb12u1

The operating system my web server runs on is (include version): Debian GNU/Linux v12.12

My hosting provider, if applicable, is: Inspire Net Ltd

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): v2.1.0

Quoting my initial reply to @jc-rimu with further technical details:

The certificates that have begun to fail validation have HTTP to HTTPS redirects. The HTTP request from the validation server completes, getting a 301 response, but we never see the redirected HTTPS request come in. Interestingly, dry runs with the --staging environment flag do work, both HTTP and HTTPS.

We have had upstream networking changes in the last few weeks which is what we're exploring now, but it stands out to me that HTTP works and HTTPS does not.

All of the missing HTTPS requests were expected from validation servers in 23.178.112.0/24.

We have noticed that in at least one case an HTTPS request is in fact hitting our Fortigate firewall, but ending in "client-rst" and "close" with no trace in the Apache HTTPD log files. We're pondering if there was some failure in SSL/TLS negotiation.

This has started happening on five different webservers this morning, all various versions of Debian GNU/Linux. All affected certificates have been renewing with no problems until this morning, in some cases for years.

Happy to provide further information!

1 Like

Are they all behind the same Fortigate firewall?

Is Fortigate terminating the TLS connection? I don't know the name for it with them but some firewalls do deep packet inspection whereby they terminate TLS to inspect packets and send its own HTTPS to your server(s).

2 Likes

Are they all behind the same Fortigate firewall?

They are.

Is Fortigate terminating the TLS connection?

It does not. TLS is terminated on the webservers themselves. There is no "deep packet inspection" for HTTPS configured on that firewall.

The firewall being an issue had occurred to us, but:

  1. There have been no configuration changes to the firewall.
  2. Only Let's Encrypt validation seems to be failing.
  3. Only the Let's Encrypt production validation servers' requests seem to be failing. Adding --dry-run and --staging makes everything work. They come from a set of very different IP ranges. All of those requests complete.
1 Like

Any firmware changes? A few years ago a different vendor of firewalls changed a key setting during an update. Created lots of problems :slight_smile:

Based on the error message it looks like the connection from the LE primary center is failing. It uses a Cloudflare networking product for outbound traffic (IIRC).

This does look like it might be on LE side. At the same time, plenty of people redirect HTTP->HTTPS. And, if the LE Primary center failed all of those the number of failures would be huge. And, if this started a few days ago the LE alarms would have been at "max" :slight_smile: Not to mention the number of complaints here would have skyrocketed. Based on that it could still be on LE side (sort of) but limited in scope for some reason.

As a volunteer I don't have access to internal LE logs. We may need to wait for LE staff to check in (more likely tomorrow US time) or perhaps someone else with suggestions.

Thanks for the great info by the way.

1 Like

None.

Based on the error message it looks like the connection from the LE primary center is failing. It uses a Cloudflare networking product for outbound traffic (IIRC).

Yes, this is the way we had been leaning. We did have a change in our list of upstream network providers not long before this, but had largely ruled that out as a cause, as the routing seemed not to be at issue, and HTTP requests were still coming through from the same IPs.

Thanks for the great info by the way.

You're most welcome; thanks to you in turn for the helpful replies.

1 Like

Keep in mind this's thread is a continuation of the existing "heads up" thread here: "your server may be slow or overloaded" acme-v02.api .

What we have in common is we are both hosting providers based in New Zealand. 103.248.176.0/24 and 202.37.129.0/24 contain the servers being affected (that I've observe thus far).

1 Like

I have a pcap file showing what looks to my lesser-trained eye as a failed TLSv1.3 negotiation, as taken from our webserver. It's attached.

lets-encrypt-failed-tls.pcap (13.2 KB)

The four "TCP Retransmission" packets toward the end stand out to me.

1 Like

That is helpful information. Thanks

1 Like

Based on the selective acknowledgement in the fin packet and the dup ack packet, it would appear that you might have set the maximum segment size or MTU incorrectly and the larger packets are being dropped, try decreasing the MTU on your server to e.g. 1280 bytes and see if that fixes your issue.

3 Likes

It does!

webhost19:~$ sudo ip link set dev enX0 mtu 1280
…
webhost19:~$ sudo certbot renew --cert-name=webhost19.inspire.net.nz
[sudo] password for tom:
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/webhost19.inspire.net.nz.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Renewing an existing certificate for webhost19.inspire.net.nz

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Congratulations, all renewals succeeded:
  /etc/letsencrypt/live/webhost19.inspire.net.nz/fullchain.pem (success)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

That does not (yet) explain why other inbound HTTPS TLSv1.3 connections are working just fine, but I'll bring this to the attention of the networking-heavy side of the room now.

2 Likes

try decreasing the MTU on your server to e.g. 1280 bytes and see if that fixes your issue.

This isn't a fix for the issue. This is a work around for the external problem.

1 Like

Is it possible your networks or systems aren't handling Path MTU Discovery (PMTUD)? Maybe one of your new upstream providers is blocking the necessary ICMP packets; or maybe you always have been, and now a new provider is sending traffic to Cloudflare over a tunnel with an MTU < 1,436 bytes?

What does a tcptraceroute to the validation endpoint look like with different packet sizes, and at what packet size does it change? E.g. tcptraceroute outbound1a.letsencrypt.org 1280, then step up from there? It may be informative to see which hop starts to fail at what packet size.

Let's Encrypt uses Cloudflare Magic Transit for inbound traffic, including reply traffic, to the primary validation endpoints. Because of this, MSS is clamped to 1,436 bytes on all outbound primary validation packets, as per Cloudflare's documentation. I know traffic was successfully tested with 1,436 byte packets.

4 Likes

I am unsure which TCP port I would target, but with a guess of 443, there seems to be little difference in the output of a tcptraceroute with:

  1. No packet size set;
  2. 1280 packet size explicitly set; and
  3. 1500 packet size set.
webhost19:~$ tcptraceroute outbound1a.letsencrypt.org 443
Selected device enX0, address 203.114.129.15, port 60569 for outgoing packets
Tracing the path to outbound1a.letsencrypt.org (23.178.112.100) on TCP port 443 (https), 30 hops max
 1  203-114-129-254.ve1024.pmr-br.inspire.net.nz (203.114.129.254)  0.631 ms  0.325 ms  0.350 ms
 2  121-79-193-65.sta.inspire.net.nz (121.79.193.65)  0.721 ms  0.463 ms  1.335 ms
 3  203-114-165-15.dsl.sta.inspire.net.nz (203.114.165.15)  1.257 ms  1.530 ms  1.458 ms
 4  203-114-134-208.lo.sta.inspire.net.nz (203.114.134.208)  8.793 ms  9.082 ms  8.709 ms
 5  as13335.akl.ix.nz (43.243.21.2)  8.698 ms  8.791 ms  18.185 ms
 6  198.41.236.29  8.989 ms  9.209 ms  9.034 ms
 7  172.69.0.53  8.523 ms  8.061 ms  8.196 ms
 8  172.69.0.53  10.405 ms  8.499 ms  8.568 ms
 9  172.69.0.53  8.719 ms  8.616 ms  8.747 ms

webhost19:~$ tcptraceroute outbound1a.letsencrypt.org 443 1280
Selected device enX0, address 203.114.129.15, port 40795 for outgoing packets
Tracing the path to outbound1a.letsencrypt.org (23.178.112.100) on TCP port 443 (https), 30 hops max, 1280 byte packets
 1  203-114-129-254.ve1024.pmr-br.inspire.net.nz (203.114.129.254)  0.746 ms  0.462 ms  0.294 ms
 2  121-79-193-65.sta.inspire.net.nz (121.79.193.65)  0.840 ms  2.107 ms  0.667 ms
 3  203-114-150-178.ge-0-1-5-1143.wlg-br.inspire.net.nz (203.114.150.178)  1.400 ms  1.582 ms  1.248 ms
 4  203-114-134-208.lo.sta.inspire.net.nz (203.114.134.208)  8.817 ms  8.677 ms  8.900 ms
 5  as13335.akl.ix.nz (43.243.21.2)  8.755 ms  8.923 ms  9.300 ms
 6  198.41.236.29  8.576 ms  8.775 ms  8.476 ms
 7  172.69.0.15  8.229 ms  8.123 ms  8.140 ms
 8  172.69.0.15  8.554 ms  8.439 ms  8.296 ms
 9  172.69.0.15  8.488 ms  8.686 ms  8.742 ms

webhost19:~$ tcptraceroute outbound1a.letsencrypt.org 443 1500
Selected device enX0, address 203.114.129.15, port 34199 for outgoing packets
Tracing the path to outbound1a.letsencrypt.org (23.178.112.100) on TCP port 443 (https), 30 hops max, 1500 byte packets
 1  203-114-129-254.ve1024.pmr-br.inspire.net.nz (203.114.129.254)  0.876 ms  0.420 ms  0.456 ms
 2  121-79-193-65.sta.inspire.net.nz (121.79.193.65)  0.886 ms  0.804 ms  6.491 ms
 3  203-114-150-178.ge-0-1-5-1143.wlg-br.inspire.net.nz (203.114.150.178)  1.517 ms  1.781 ms  1.323 ms
 4  203-114-134-208.lo.sta.inspire.net.nz (203.114.134.208)  8.864 ms  8.740 ms  9.071 ms
 5  as13335.akl.ix.nz (43.243.21.2)  9.020 ms  16.867 ms  8.954 ms
 6  198.41.236.29  8.674 ms  8.818 ms  8.431 ms
 7  172.69.0.22  8.122 ms  7.925 ms  8.100 ms
 8  172.69.0.22  8.538 ms  8.778 ms  8.495 ms
 9  172.69.0.22  8.511 ms  8.801 ms  8.215 ms

I can't comment on the possibility of our upstream providers' PMTUD implementation being incorrect. We are not blocking such packets ourselves. Only IPv4 is relevant here (for these systems).

Perhaps we could try this the other way around: are you able to send us the output of tcptraceroute (or analogous) calls connecting to webhost18.inspire.net.nz on port 443—with different packet sizes—from the Cloudflare service's point of view?

1 Like

We usually don't have the capacity to run this kind of troubleshooting on request, but I'll ask my colleagues to take a look at this pattern of problems and see if they're able to do this.

3 Likes

Also echoing this issue (again from New Zealand). Even a straight acme http request is failing in this manner:

23.178.112.107 - - [08/Dec/2025:15:29:51 +1300] "GET /.well-known/acme-challenge/Yn8ARHxO6XWEpY63naaHrIpRl2sJleZzJdt9G6W2scQ HTTP/1.1" 200 1488 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)"

The web server is responding with a straight 200 (no redirect to HTTPS, so there's no TLS involved), but the letsencrypt server seems to never receive the response.

The same renewal with --dry-run succeeds just fine.

2 Likes

Could more affected folks please share packet captures, if possible? That'll help identify and confirm this pattern of problems.

2 Likes

le.pcap (40.3 KB)

1 Like

tomryder-inspirenet / nvnoc
Time permitting can you please test a renewal (--force-renew) from your affected server(s) with MTU at 1500. It will be good to keep the Let's Encrypt team aware that an issue persists, the trouble with a workaround is it masks an ongoing problem.

I've just attempted a renewal for one of our certs and yes, the issue persists; same symptoms. We're not too worried for the moment as there are three possible workarounds (dropping MTU, switching to DNS-01, removing HTTPS redirect), so we shouldn't lose any certs, but yes it would be good to have it fixed. Happy to provide further information if the Let's Encrypt team needs it.

2 Likes

Here are some statuses for New Zealand Internet, possibly related.