3 seconds between Hellos on cellular

curl --trace-time shows 3 seconds lag between Client Hello and Server Hello when accessing from cellular. Firefox, Chrome, Android, desktop linux (wired tethering same 5G) everything I tested gives the same 2.5-3.5 lag on the initial connection, after that ssl session is kept and all works as a charm.

At the moment I can reproduce it with 100% on Hotmobile 5G (Israel), I think it has been like that for a while. I can't reproduce it on my other connections and it seems my friends in Israel and worldwide who I asked are not experiencing it. BUT:

It does reproduce sometimes on pagespeed.web.dev on mobile tab, which uses "Network throttling: 150 ms TCP RTT, 1,638.4 kb/s throughput (Simulated)", never on the desktop tab. Yesterday it was so stable, that I didn't even think of taking screenshoot lol.

I can't blame entirely ISP, because the same curl gives no lag for google.com, stackoverflow.com, unsplash.com, etc, and web surfing, in general, feels fast. However... I do see this issue on some sites with LE, including letsencrypt.org itself (!!!). That's why I am finally here. Also does pagespeed

My domain is screentranslate.app with t3 micro behind and virtually no load.
I am not a guy to blame others first, so I tried a lot of stuff I lost track of. Apache configs (e.g. minifying config, tls v.1.3, http2), changing mtu on EC2 from 9000 to 1500 and 500, nothing made a visible difference. Ditched Apache, set up HAProxy (you can see it with the same cert on port 444), configured OCSP stapling didn't change a thing. (I assumed success by running openssl s_client -connect screentranslate.app:444 -status | grep -E 'OCSP|', for some reason, I can't see the same for letsencrypt.org, but I do for gogl. Pls confirm whether I succeeded or not)

Up-to-date Ubuntu 22.04.4 LTS, Apache 2.4.52-1ubuntu4.9, HAproxy 2.4.24-0ubuntu0.22.04.1. In case you need any diag data, I have full ssh access.

time (echo | openssl s_client -debug -connect $IP:$PORT -servername screentranslate.app)

  • gives <20ms from localhost to both 443 (apache) and 444 (haproxy)
  • gives 3.1-3.3s on named 5G to both

I was unable to see any collateral requests with tcpdump (idk, i expected some queries to CAs about domain validity, that could take extra time), dump looks similar to google.com.

time curl -s --trace-time -vI https://screentranslate.app:443 -o /dev/null
20:52:59.566709 *   Trying 51.16.158.30:443...
20:52:59.602526 * Connected to screentranslate.app (51.16.158.30) port 443 (#0)
20:52:59.609275 * ALPN, offering h2
20:52:59.609391 * ALPN, offering http/1.1
20:52:59.686368 *  CAfile: /etc/ssl/certs/ca-certificates.crt
20:52:59.686393 *  CApath: /etc/ssl/certs
20:52:59.686584 * TLSv1.0 (OUT), TLS header, Certificate Status (22):
20:52:59.686611 } [5 bytes data]
20:52:59.686648 * TLSv1.3 (OUT), TLS handshake, Client hello (1):
20:52:59.686668 } [512 bytes data]
20:53:02.957691 * TLSv1.2 (IN), TLS header, Certificate Status (22):
20:53:02.957812 { [5 bytes data]
20:53:02.958002 * TLSv1.3 (IN), TLS handshake, Server hello (2):
20:53:02.958112 { [122 bytes data]
20:53:02.959047 * TLSv1.2 (IN), TLS header, Finished (20):
20:53:02.959172 { [5 bytes data]
20:53:02.959389 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:02.959501 { [5 bytes data]
20:53:02.959713 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
20:53:02.959820 { [19 bytes data]
20:53:02.960013 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:02.960125 { [5 bytes data]
20:53:02.960361 * TLSv1.3 (IN), TLS handshake, Certificate (11):
20:53:02.960466 { [2393 bytes data]
20:53:02.965182 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:02.965307 { [5 bytes data]
20:53:02.965509 * TLSv1.3 (IN), TLS handshake, CERT verify (15):
20:53:02.965595 { [79 bytes data]
20:53:02.966197 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:02.966311 { [5 bytes data]
20:53:02.966543 * TLSv1.3 (IN), TLS handshake, Finished (20):
20:53:02.966660 { [52 bytes data]
20:53:02.967016 * TLSv1.2 (OUT), TLS header, Finished (20):
20:53:02.967132 } [5 bytes data]
20:53:02.967307 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
20:53:02.967410 } [1 bytes data]
20:53:02.967682 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:02.967794 } [5 bytes data]
20:53:02.967987 * TLSv1.3 (OUT), TLS handshake, Finished (20):
20:53:02.968092 } [52 bytes data]
20:53:02.968432 * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
20:53:02.968552 * ALPN, server accepted to use h2
20:53:02.968662 * Server certificate:
20:53:02.968775 *  subject: CN=screentranslate.app
20:53:02.968884 *  start date: May  9 05:18:16 2024 GMT
20:53:02.968993 *  expire date: Aug  7 05:18:15 2024 GMT
20:53:02.969110 *  subjectAltName: host "screentranslate.app" matched cert's "screentranslate.app"
20:53:02.969223 *  issuer: C=US; O=Let's Encrypt; CN=R3
20:53:02.969321 *  SSL certificate verify ok.
20:53:02.969478 * Using HTTP2, server supports multiplexing
20:53:02.969590 * Connection state changed (HTTP/2 confirmed)
20:53:02.969702 * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
20:53:02.969836 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:02.969940 } [5 bytes data]
20:53:02.970177 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:02.970289 } [5 bytes data]
20:53:02.970478 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:02.970621 } [5 bytes data]
20:53:02.970861 * Using Stream ID: 1 (easy handle 0x56055daf1e90)
20:53:02.971030 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:02.971164 } [5 bytes data]
20:53:02.971369 > HEAD / HTTP/2
20:53:02.971369 > Host: screentranslate.app
20:53:02.971369 > user-agent: curl/7.81.0
20:53:02.971369 > accept: */*
20:53:02.971369 > 
20:53:03.007687 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:03.007806 { [5 bytes data]
20:53:03.008097 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
20:53:03.008179 { [57 bytes data]
20:53:03.008472 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:03.008579 { [5 bytes data]
20:53:03.008781 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
20:53:03.008890 { [57 bytes data]
20:53:03.009105 * old SSL session ID is stale, removing
20:53:03.009234 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:03.009336 { [5 bytes data]
20:53:03.009567 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
20:53:03.009678 } [5 bytes data]
20:53:03.025030 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:03.025150 { [5 bytes data]
20:53:03.025382 * TLSv1.2 (IN), TLS header, Supplemental data (23):
20:53:03.025462 { [5 bytes data]
20:53:03.025689 < HTTP/2 200 
20:53:03.025859 < last-modified: Sun, 21 Apr 2024 22:06:51 GMT
20:53:03.025978 < etag: "106c1-616a289ba5f8d"
20:53:03.026082 < accept-ranges: bytes
20:53:03.026188 < content-length: 67265
20:53:03.026283 < vary: Accept-Encoding
20:53:03.026381 < content-type: text/html
20:53:03.026482 < date: Mon, 01 Jul 2024 17:53:02 GMT
20:53:03.026581 < server: Apache/2.4.52 (Ubuntu)
20:53:03.026683 < 
20:53:03.026799 * Connection #0 to host screentranslate.app left intact

real	0m3.488s

Is it time for paid SSL? Kiddin, I am in love with LE, level of automation and configuration, I don't believe I can find anything like this for money I have, or maybe at all. Thank you guys for your work.

1 Like

I can't reproduce your issue from my end, not using my WiFi/ISP nor using USB tethering using my 5G connection with my phone:

real 0m0.311s
user 0m0.038s
sys 0m0.010s

Maybe you can try to check what kind of traffic is going on while curl is waiting using Wireshark?

Your stapling works by the way.

1 Like

# tcpdump -i usb0 -U -n -w hot-444.pcap

18:39:37.771229 IP6 2a01:73c0:66f:4bf1:d040:daf0:4f14:b19.54634 > 2a01:73c0:66f:4bf1::82.domain: 23004+ AAAA? screentranslate.app. (37)
18:39:37.773350 IP6 2a01:73c0:66f:4bf1::82.domain > 2a01:73c0:66f:4bf1:d040:daf0:4f14:b19.54634: 23004 0/0/0 (37)
18:39:37.773703 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [S], seq 3430561495, win 64240, options [mss 1460,sackOK,TS val 3432482031 ecr 0,nop,wscale 7], length 0
18:39:37.816577 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [S.], seq 1628496274, ack 3430561496, win 62643, options [mss 1360,sackOK,TS val 4040595485 ecr 3432482031,nop,wscale 7], length 0
18:39:37.816662 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 1, win 502, options [nop,nop,TS val 3432482074 ecr 4040595485], length 0
18:39:37.902137 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 3432482159 ecr 4040595485], length 517
18:39:37.944543 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [.], ack 518, win 8840, options [nop,nop,TS val 4040595485 ecr 3432482159], length 0
18:39:37.955667 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33822: Flags [S.], seq 1973057878, ack 637888708, win 65535, options [mss 1360,sackOK,TS val 3546579270 ecr 3893544196,nop,wscale 8], length 0
18:39:37.955686 IP 192.168.94.72.33822 > 82.221.107.34.bc.googleusercontent.com.http: Flags [R], seq 637888708, win 0, length 0
18:39:38.158229 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33826: Flags [S.], seq 180548429, ack 1624385926, win 65535, options [mss 1360,sackOK,TS val 3065398711 ecr 3893544196,nop,wscale 8], length 0
18:39:38.158274 IP 192.168.94.72.33826 > 82.221.107.34.bc.googleusercontent.com.http: Flags [R], seq 1624385926, win 0, length 0
18:39:38.222842 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33838: Flags [R.], seq 486458896, ack 3442060896, win 66, options [nop,nop,TS val 2571453062 ecr 1815098018], length 0
18:39:38.360491 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33824: Flags [S.], seq 3947899919, ack 2979810991, win 65535, options [mss 1360,sackOK,TS val 246887587 ecr 3893544200,nop,wscale 8], length 0
18:39:38.360548 IP 192.168.94.72.33824 > 82.221.107.34.bc.googleusercontent.com.http: Flags [R], seq 2979810991, win 0, length 0
18:39:38.437440 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33844: Flags [S.], seq 929014436, ack 510598516, win 65535, options [mss 1360,sackOK,TS val 3401999722 ecr 3893569271,nop,wscale 8], length 0
18:39:38.437496 IP 192.168.94.72.33844 > 82.221.107.34.bc.googleusercontent.com.http: Flags [.], ack 1223314452, win 502, options [nop,nop,TS val 1815104264 ecr 2571443130], length 0
18:39:38.446923 IP 192.168.94.72.45218 > 209.250.254.15.vultrusercontent.com.21116: UDP, length 15
18:39:38.549441 IP 209.250.254.15.vultrusercontent.com.21116 > 192.168.94.72.45218: UDP, length 2
18:39:38.633738 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33844: Flags [S.], seq 929014436, ack 510598516, win 65535, options [mss 1360,sackOK,TS val 3401999921 ecr 3893569271,nop,wscale 8], length 0
18:39:40.833066 IP 149.154.167.41.https > 192.168.94.72.45810: Flags [P.], seq 4093461239:4093462163, ack 1045647705, win 2316, options [nop,nop,TS val 3167180915 ecr 3045059500], length 924
18:39:40.833117 IP 192.168.94.72.45810 > 149.154.167.41.https: Flags [.], ack 924, win 501, options [nop,nop,TS val 3045062767 ecr 3167180915], length 0
18:39:40.950887 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [.], seq 1:1349, ack 518, win 486, options [nop,nop,TS val 4040598624 ecr 3432482159], length 1348
18:39:40.950954 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 1349, win 501, options [nop,nop,TS val 3432485208 ecr 4040598624], length 0
18:39:40.953469 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 1349:2697, ack 518, win 486, options [nop,nop,TS val 4040598624 ecr 3432482159], length 1348
18:39:40.953512 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 2697, win 501, options [nop,nop,TS val 3432485211 ecr 4040598624], length 0
18:39:40.953536 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 2697:2740, ack 518, win 486, options [nop,nop,TS val 4040598624 ecr 3432482159], length 43
18:39:40.953552 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 2740, win 501, options [nop,nop,TS val 3432485211 ecr 4040598624], length 0
18:39:40.962018 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [P.], seq 518:582, ack 2740, win 501, options [nop,nop,TS val 3432485219 ecr 4040598624], length 64
18:39:40.963691 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [P.], seq 582:692, ack 2740, win 501, options [nop,nop,TS val 3432485221 ecr 4040598624], length 110
18:39:41.007959 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 2740:2819, ack 692, win 485, options [nop,nop,TS val 4040598684 ecr 3432485219], length 79
18:39:41.008036 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 2819, win 501, options [nop,nop,TS val 3432485265 ecr 4040598684], length 0
18:39:41.007960 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 2819:2898, ack 692, win 485, options [nop,nop,TS val 4040598684 ecr 3432485219], length 79
18:39:41.008066 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 2898, win 501, options [nop,nop,TS val 3432485265 ecr 4040598684], length 0
18:39:41.008079 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 2898:3060, ack 692, win 485, options [nop,nop,TS val 4040598684 ecr 3432485219], length 162
18:39:41.008096 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [.], ack 3060, win 501, options [nop,nop,TS val 3432485265 ecr 4040598684], length 0
18:39:41.011666 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [P.], seq 692:716, ack 3060, win 501, options [nop,nop,TS val 3432485269 ecr 4040598684], length 24
18:39:41.013288 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [F.], seq 716, ack 3060, win 501, options [nop,nop,TS val 3432485270 ecr 4040598684], length 0
18:39:41.018662 IP 192.168.94.72.33848 > 82.221.107.34.bc.googleusercontent.com.http: Flags [.], ack 4132197860, win 501, options [nop,nop,TS val 1815106845 ecr 880526773], length 0
18:39:41.018696 IP 192.168.94.72.33846 > 82.221.107.34.bc.googleusercontent.com.http: Flags [.], ack 2614187691, win 501, options [nop,nop,TS val 1815106845 ecr 3690502866], length 0
18:39:41.049091 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [P.], seq 3060:3084, ack 717, win 485, options [nop,nop,TS val 4040598725 ecr 3432485269], length 24
18:39:41.049176 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [R], seq 3430562212, win 0, length 0
18:39:41.049380 IP ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp > 192.168.94.72.54478: Flags [F.], seq 3084, ack 717, win 485, options [nop,nop,TS val 4040598725 ecr 3432485269], length 0
18:39:41.049400 IP 192.168.94.72.54478 > ec2-51-16-158-30.il-central-1.compute.amazonaws.com.snpp: Flags [R], seq 3430562212, win 0, length 0
18:39:41.049459 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33846: Flags [.], ack 1, win 261, options [nop,nop,TS val 3690512994 ecr 1815096749], length 0
18:39:41.053459 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33848: Flags [.], ack 1, win 261, options [nop,nop,TS val 880536873 ecr 1815096780], length 0
18:39:41.728635 IP 82.221.107.34.bc.googleusercontent.com.http > 192.168.94.72.33842: Flags [S.], seq 1524249715, ack 763725869, win 65535, options [mss 1360,sackOK,TS val 3765347558 ecr 3893564276,nop,wscale 8], length 0
18:39:41.728691 IP 192.168.94.72.33842 > 82.221.107.34.bc.googleusercontent.com.http: Flags [.], ack 3157224585, win 502, options [nop,nop,TS val 1815107555 ecr 2571438145], length 0

I did not use any filters intentionally to see related traffic to other destinations, but it seems near TLS handshake it's only firefox

full dump: https://screentranslate.app/s/u9vSA9zm/hot-444.pcap
(yes I, inspected it in wireshark already, didn't find anything intresting, but it may be a skill issue)

I love LE too, but if you want to try some other CAs to help diagnose what's going on, there are several other free ones that use the same automated ACME protocol. Posh-ACME keeps a list that regularly gets linked around here. ZeroSSL and BuyPass Go are probably the easiest for just testing around with.

5 Likes

I tried zerossl in the past, it was not that easy... Thanks for idea, I'll give it a go tomorrow. I hope I won't break existing LE setup :sweat_smile:
Is there any objectives (measurements?) for test besides seeing if the same issue exists?
Is it ok in terms of test cleanliness to issue it for 3rd level domain under screentranslate.app? I'd prefer not changing LE at the moment

1 Like

The pcap trace file you provided is interesting and does give us some hints. In particular, both machines had the TCP timestamp option enabled (the default on Linux) which is very helpful for this kind of diagnosis.

I've displayed the sender's timestamp value in the third column. Note that these values aren't unix timestamps, they're essentially arbitrary values decided by each sender. (Modern) Linux (we're assuming that both machines are Linux here) uses a randomly-offset timer that runs either with milliseconds or microseconds precision (with milliseconds being the usual default).

Analyzing the TCP timestamp values send by 51.16.158.30, it appears that the sender uses timestamps with milliseconds precision. If that's the case, the sender's time difference between packets 7 and 22 (highlighted in the screenshot; the ACK of the Client Hello by the server and the server's ServerHello) would be 3139 ms, or just over 3 seconds. That roughly fits with the the receiver's timestamps - the difference of about 100ms could be explained by network jitter.

However, that would indicate that the issue is indeed not network-performance related, but rather caused by the server somehow needing 3 seconds to generate a ServerHello. That's very strange indeed. Tracing the server would be interesting in that case, to figure out what's taking so long to send a (relatively simple) TCP message.

PS: It would also be helpful to rule out the possibility of a TCP middlebox being present on the network. Can you provide traces from both client and server perspective?

10 Likes

Possibly unrelated but sounds similar to https://forums.raspberrypi.com/viewtopic.php?t=362979 and that issue could be CPU bound and may be affected by key type.

See also Slow TLS handshake - #2 by _az

4 Likes

Why does it say port 444?

Because that's where OPs haproxy runs.

3 Likes

I've been there.
About rpi forum -- neither mdbelts, nor lwip afaik have nothing to do with my setup.
From LE thread I stole openssl s_client trick and idea to play with MTU.
As expected MaxRequestWorkers (formerly MaxClients) didn't affect issue.
I don't think it can be CPU bound if issue is not present from any other connection?
It can be key type dependant if it affect packets count and sizes, but tcpdump of same curl shows exactly same packet sizes.

I think the low hanging fruit here is to verify that it's not your network provider doing some funky stuff. We can verify this by checking whether the TCP sessions (SEQ numbers + timestamps) are in agreement on both sides, i.e. capture the same connection on both ends and compare. If they don't match, there's a TCP middlebox in between that's causing the delay.

If they do match, there's no TCP middlebox in between and the server is causing the delay. That would then suggest that we need to look at what your server does (e.g. some IP reputation script that only triggers on some addresses?) when it receives the "slow" connections.

6 Likes

By traces from both client and server perspective you mean traceroute?

$ mtr -c 32 --report-wide --show-ips screentranslate.app
Start: 2024-07-02T18:15:16+0300
HOST: kitchen                                                            Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway (192.168.94.2)                                             0.0%    32    1.6   2.3   1.6   3.7   0.7
  2.|-- ???                                                                100.0    32    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.202.77.68                                                        0.0%    32   28.2  36.2  24.6  91.0  13.2
  4.|-- 10.202.77.66                                                       62.5%    32   38.8  32.0  24.6  41.3   5.3
  5.|-- 10.202.79.202                                                       0.0%    32   48.1  80.3  28.1 753.7 129.3
  6.|-- 10.202.72.101                                                       0.0%    32   24.9  33.0  24.3  47.2   6.0
  7.|-- 10.202.72.209                                                      28.1%    32   36.0  77.8  27.9 706.8 142.0
  8.|-- hotnet-213-57-0-73.hotnet.net.il (213.57.0.73)                     96.9%    32   40.5  40.5  40.5  40.5   0.0
  9.|-- ???                                                                100.0    32    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                                                                100.0    32    0.0   0.0   0.0   0.0   0.0
 11.|-- ???                                                                100.0    32    0.0   0.0   0.0   0.0   0.0
 12.|-- ???                                                                100.0    32    0.0   0.0   0.0   0.0   0.0
 13.|-- ec2-51-16-158-30.il-central-1.compute.amazonaws.com (51.16.158.30)  0.0%    32   46.0  38.1  27.9  62.0   8.3

Didn't think previously about reverse, not sure what to trace from server perspective -- my public ip or the first public from direct trace? Anyway there are both:

$ mtr -c 32 --report-wide --show-ips 37.142.165.41 # my public ip according to ipify.org
Start: 2024-07-02T18:43:03+0300
HOST: aaws                                                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 244.5.0.133                                          0.0%    32    1.7  11.4   0.8  67.9  16.1
  2.|-- 240.3.172.6                                          0.0%    32    0.3   0.2   0.2   0.4   0.0
  3.|-- 240.3.160.15                                         0.0%    32    1.1   1.0   1.0   1.1   0.0
  4.|-- 151.148.10.132                                       0.0%    32    1.2   1.3   1.2   3.7   0.5
  5.|-- 151.148.10.133                                      96.9%    32    1.4   1.4   1.4   1.4   0.0
  6.|-- hotnet-213-57-3-246.hotnet.net.il (213.57.3.246)    93.8%    32    2.5   2.5   2.5   2.6   0.1
  7.|-- dynamic-213-57-120-55.hotnet.net.il (213.57.120.55)  0.0%    32   13.9  18.1   5.7  74.5  14.8
  8.|-- ???                                                 100.0    32    0.0   0.0   0.0   0.0   0.0
$ mtr -c 32 --report-wide --show-ips 213.57.0.73 # taken from trace to server
Start: 2024-07-02T18:42:07+0300
HOST: aaws                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 244.5.0.129                                     0.0%    32    4.4   9.0   0.8  60.4  13.9
  2.|-- 240.3.172.6                                     0.0%    32    0.2   0.2   0.2   0.3   0.0
  3.|-- 242.10.66.135                                   0.0%    32   12.6   2.3   0.8  12.6   3.0
  4.|-- 240.3.156.13                                    0.0%    32    2.0   2.1   2.0   2.1   0.0
  5.|-- 242.10.37.129                                   0.0%    32    2.6   3.3   1.9  11.5   2.4
  6.|-- 99.82.179.254                                   0.0%    32    1.9   2.0   1.9   2.3   0.1
  7.|-- ???                                            100.0    32    0.0   0.0   0.0   0.0   0.0
  8.|-- hotnet-213-57-0-73.hotnet.net.il (213.57.0.73)  0.0%    32    2.6   2.6   2.4   3.2   0.1

traceroute in the meanwhile seem to be lost in the forest in all 3 cases:

traceroute to screentranslate.app (51.16.158.30), 64 hops max
  1   192.168.94.2  2.138ms  1.287ms  1.244ms 
  2   *  *  * 
  3   10.202.77.68  60.388ms  21.606ms  29.936ms 
  4   *  10.202.77.66  28.928ms  * 
  5   10.202.79.205  30.123ms  64.383ms  46.285ms 
  6   10.202.72.101  27.436ms  79.904ms  39.564ms 
  7   10.202.72.209  40.815ms  40.162ms  40.097ms 
  8   *  *  * 
# ...same up to 64
# from server to IP from direct mtr:
traceroute to 213.57.0.73 (213.57.0.73), 64 hops max
  1   244.5.0.133  8.287ms  9.715ms  9.588ms 
  2   240.3.172.7  0.310ms  0.306ms  0.365ms 
  3   242.10.66.129  1.201ms  1.069ms  1.057ms 
  4   240.3.156.15  2.019ms  2.023ms  1.967ms 
  5   242.10.36.5  1.956ms  10.969ms  2.504ms 
  6   99.82.179.254  1.940ms  1.940ms  1.913ms 
  7   *  *  * 
# ...same up to 64
# from server to my public IP:
traceroute to 37.142.165.41 (37.142.165.41), 64 hops max
  1   244.5.0.133  5.047ms  8.717ms  8.216ms 
  2   240.3.172.5  0.206ms  0.254ms  0.225ms 
  3   240.3.160.15  0.973ms  0.961ms  0.934ms 
  4   151.148.10.132  1.157ms  1.123ms  1.141ms 
  5   *  *  * 
  6   *  *  * 
  7   213.57.120.55  37.365ms  35.050ms  23.576ms 
  8   *  *  * 
# ...same up to 64

I am not sure what conclusions can be made from this, given that connections are not failing... If there is a middlebox, can it be seem as middlebox here? Can I somehow see routes for exact TCP packets? I mean one route for plaintext, another for TLS handshake. I've been literally dreaming about DPI this night. Israel providers do some "cyber protection service", also filtering is in demand by religious communities. In case of Hotmobile I did not succeed discarding cyber protection service which is extremely annoying by blocking duckdns.org and nip.io all together and charging me for it. I can imagine them sending every LE handshake to DPI pit stop because anything free seem to be not trustworthy enough for them. Maybe I'll see difference with google cert. But why then pagespeed sees exactly same timings with their 4G throttling simulation?

If you want to see strace of apache/haproxy, please assist me with exact command or link to a suitable guide, I failed to get anything by myself :frowning:

Thank you for detailed analysis of pcap. Yes, both are Linux, I can do something from Windows or Android/Termux if you are intrested.

P.S. it is possible that these traces were taken when issue was already gone.

1 Like

And...

real	0m0.393s
user	0m0.102s
sys	0m0.028s

...the issue is gone. Oh boy, did I just waste 4 days for nothing. I hate it when issues are disappearing on their own without fix or at least knowing the cause :face_with_symbols_over_mouth:
The only thing I did on server side is 4 traceroutes, not a single service restarted or file changed.

Did I just get whitelisted by mossad?? By port. Because :444 behaves the same +3s way. I just briefly switched haproxy to :443 instead of apache -- no issue. So at least my software configuration on server is not at fault. I guess..

Thank you all for your time. For now I think I'll adopt the DPI theory. If anybody has other ideas I'll be happy to listen.

I'd prefer to leave this topic open without solution till issue comes back or 30 days timeout.

1 Like

No, a traceroute can't tell us anything about a layer 7 transparent middlebox. I meant packet captures (pcap) for the same connection from both sides (client & server).*

It's unfortunate if this seemingly goes away without any visible changes. If it occurs again we can still look at it.


*To expand on this: The timestamp values confirmed that the delay happens on the "server" side. However, if there's a TCP middlebox in between, the "server side" is actually the middlebox, since it's the clients TCP peer. This is revealed by looking at the server's TCP session, which won't match the client. If they do match, there's either no middlebox or the middlebox is entirely transparent. But in the latter case (a fully transparent middlebox) you wouldn't see the server timestamps being that slow, unless the server itself is slow (hence also ruling out the middlebox as the culprit).

4 Likes

Missed this reply while baking traceroutes.

We can verify this by checking whether the TCP sessions (SEQ numbers + timestamps) are in agreement on both sides, i.e. capture the same connection on both ends and compare. If they don't match, there's a TCP middlebox in between that's causing the delay.

Can you please guide me how can I do this (port 444 is probably still blacklisted)?

:thinking:
Do you mean two tcpdumps run simultaneously on server and client machines?

1 Like

Exactly. You can filter the dumps to only contain the TCP session we're interested in if you want.

5 Likes

I see 14 packets on server side and 13 on client... (upd: its my ssh session)
Please take a look with me

tcpdump -i usb0 -U -n -w client-pov-444.pcap dst host 51.16.158.30 and dst port 444
https://screentranslate.app/s/u9vSA9zm/client-pov-444.pcap

tcpdump -i ens5 -U -n -w server-pov-444.pcap src host 37.142.165.41
https://screentranslate.app/s/u9vSA9zm/server-pov-444.pcap

Due to the filters we can't see the entire connection, we only see what the client sends. However, this is sufficient to confirm that a TCP middlebox is not present - TCP sequence numbers and timestamps are in agreement on both sides.

2 Likes

That would be an argument against a middlebox, right?

4 Likes

Why? Isn’t that exactly what a middle box wanting to remain hidden would do?

1 Like