Pageload Times Increased after DST Root X1 / Shortening Change

Testing on Ubuntu 22.

I just updated my earlier post after realizing the granularity of their testing. Not sure testing across internet is helpful

4 Likes

Yep, 50ms here or there is pretty much unnoticeable if the request travels across the ocean. Also it doesn't seem to affect real users on Chrome, so it's not really an issue for me, just slightly annoying that it broke my graph and kind of interesting what causes it :slight_smile:

When I first saw the http load time graphs suddenly jumping ~5x on all 7 machines it seemed way more dramatic than it actually turned out to be :smiley:

5 Likes

So at this point I'm really out of ideas, but just as a quick recap on what I (believe to) know about the TLS logic:

  1. On Debian, wget uses GnuTLS for the TLS/SSL protocol, while curl uses OpenSSL for TLS/SSL. Differences in validation between GnuTLS/OpenSSL could easily explain timing differences.
  2. However, @Mad182 reports that wget is still slow even with verification disabled. I glanced over the source code, and with verification disabled wget+GnuTLS doesn't do much related to certificates: There should be no trust store scanning, no path building, no signature verification [except the TLS handshake signature]. This seems to contradict 1. being the plausible explanation.
  3. It is possible that other TLS handshake I/O is causing the slowdown (not directly related to path building), but we haven't seen much evidence for this. In fact, so far we have hardly any evidence to support that the TLS stack is the culprit - how do the times look without any https, e.g. just plain http?
  4. It is also possible that the TTFB slowdown is server-side, potentially due to GnuTLS/OpenSSL triggering slightly different code paths in nginx, which somehow is more troublesome with the newer cert. The only path I could think of that is relevant here is OCSP, and apparently we have also ruled out OCSP as a possible culprit.
4 Likes

how do the times look without any https, e.g. just plain http

When testing from the machine itself locally, with no dns request (using hosts file):
Plain http is: ~0.003s
https with the old cert: ~0.015
https with the new cert: ~0.060

It doesn't seem to affect curl and Chrome, so I don't think it's an nginx issue.

3 Likes

What about an IPS type device inline?
Anything doing certificate validation inline?

2 Likes

Default "certificate inspection" profile for Fortinet:
image
image

1 Like

Nope. There is a hardware firewall but I don't think it's doing anything with https certs. And it's not involved if I run a request to 127.0.0.1 on the machine itself.

3 Likes

So I just spend some time looking at some flamegraphs and syscalls and also tried out the suggestion from @kenh1.

While I can't quite reproduce your performance numbers (and also no real difference for long vs short chain), I do see that my curl is faster (by order of up to 50 ms) than wget. Thus I compared some benchmarks between them to check where wget is loosing time. Unfortunately the results are about what I expected - the difference is solely due to slightly different polling behaviour in curl vs wget (+ different init times) and thus fully unrelated to anything about TLS/certificates.

However, I did see a potential for a race condition in wget: If your server is really (and I mean really fast, on the order of < 50 microseconds) for its reply to a HTTP GET request, wget accidentally clears its file descriptor readiness flag without having drained its I/O buffers. It realizes this mistake after an internal timeout is elapsed - which happens to be set at 61 milliseconds on my machine - and continues reading the response.

If you still want this debugged, I guess the next step really would be raw performance benchmarks from wget in both scenarios. The strace logs suggested by @kenh1 may be helpful. Do note that these contain minor amounts of sensitive data (some about of your system's CSPRNG will be leaked, potentially some system configuration).

6 Likes

Well, if you want to look into it I can spin up a VM with couple of clicks. I don't have any experience with these tools myself.

We just need the output of strace -ttt wget https://[s1/s2].ezgif.com/robots.txt for both the slow and fast config. You can get strace via apt. strace shows the system calls performed by wget, we're primarily interested in the time of these intervals (printed with microsecond precision using the above option).

4 Likes

HUH. Maybe the shorter chain triggers suboptimal polling behavior in wget since there are significantly less bytes sent over the network?

Also, we are talking about changes that in the tens of milliseconds.

6 Likes

Yeah I'm wondering whether this may relate to different TCP segmenting causing some slowdown on a poll system call. I can't reproduce locally though, so this is all just speculation. wget is a bit weird: It uses synchronous I/O but still uses polling (as if it were doing nonblocking I/O)* to peek for file descriptor readiness before doing reads. curl by default just does "normal" I/O.

*I guess this is done to support fetching multiple resources in parallel?

5 Likes

Note that you almost certainly want to add -o trace-output-file to those invocations, as there will be a LOT of output and the default of outputting to the terminal can mess up system call tracing times. Substitute an appropriate filename for trace-output-file.

4 Likes

I looked at the outputs and it seems the main difference in time is at this line:

1708034561.400895 poll([{fd=4, events=POLLIN}], 1, 61) = 1 ([{fd=4, revents=POLLIN}])
1708034561.401204 recvfrom(4, "\27\3\3\1\"", 5, 0, NULL, NULL) = 5
1708034561.401226 poll([{fd=4, events=POLLIN}], 1, 61) = 1 ([{fd=4, revents=POLLIN}])
1708034598.469602 poll([{fd=4, events=POLLIN}], 1, 60) = 1 ([{fd=4, revents=POLLIN}])
1708034598.511632 recvfrom(4, "\27\3\3\1\"", 5, 0, NULL, NULL) = 5
1708034598.511689 poll([{fd=4, events=POLLIN}], 1, 60) = 1 ([{fd=4, revents=POLLIN}])

s2-trace.txt (31.4 KB)
s1-trace.txt (31.8 KB)

1 Like

This seems like a very plausible hypothesis to me. I wonder if it behaves the same with a page that has a longer body than that robots.txt, or if it behaves the same with like smaller TCP or frame windows or whatnot.

5 Likes

Interesting, my system uses synchronous I/O with ppoll + timeouts, your system uses async I/O with poll. I also used a Debian 12 machine for the tests.

The s2 trace looses around 42 milliseconds on a poll syscall, while the s1 trace returns moderately fast (310 microseconds) from it. I've looked at the encrypted message in fairly great detail and I believe that the poll syscall in question is executed by wget after the TLS handshake is complete. The sendmsg() in line 310 is almost certainly the final Handshake Finished message from the client (the flow & length matches).

Afterwards, the client could start by sending a HTTP GET request. However, that's not what happens: Instead wget blocks on a poll syscall, waiting for data from server. This wait is what takes so long. The data that eventually arrives from the server are most likely TLS 1.3 session tickets. Only after the session tickets have been received the client sends of its HTTP GET request (sendmsg in line 317) and reads the corresponding response, which seems to have a delay of < 1 ms on your system (quick!).

I'm not entirely sure why GnuTLS is waiting for a session ticket to begin with (it's an optional post-handshake message, though in TLS 1.3 it is common to send one right after the handshake is complete) or why it's taking so long to get one.

I suspect that the different size of the certificate chain somehow affects the TCP segmenting, which causes these session tickets to be received in a different segment. This may affect things like when ACKs are send, which might delay the session tickets. Because the implementation clears its buffer readiness just before the poll (see line 311), it is possible that the application blocks at the wrong time - the session tickets are already in the network buffers, but the I/O notification has been inadvertently cleared. This then creates an unnecessary delay before the data is actually processed by the application. At least that's what sounds most plausible to me, given this trace. Correction, the call in line 311 just sets the FD to read-only. I thought that this also cleared I/O readiness flags, but that's not the case.

5 Likes

Filesize seems to have no effect. I also tried 20kb css file and 200kb image, there was no difference.

But what is interesting - if I run the requests from my home computer (also debian 12), which is 1000+km away and adds pretty stable ~200ms of latency, I don't see any difference in timings, even after running a bunch of times.

Running the request both from the machine itself, and from other machines in the datacenter, there is the ~50ms difference pretty much always.

So I guess it only manifests if the response is very fast.

4 Likes

Hm. Maybe on your local network you have a different MTU and you are being bitten by the Nagle algorithm with the shorter chain? Would be interesting to compare tcpdump output with a system call trace, IF you really want to dig down that far. Also, it occurs to me that next time you should add the -T option to record the time spent in each system call to make sure that the actual delay is happening in the poll system call (I suspect that it is, but it would be nice to confirm). I should have suggested that before, dang it.

It looks to me like the difference might be from GnuTLS rather than wget.

5 Likes

I think I have already spent way too much time chasing 50ms that turned out to only manifest on wget and only when the response is very fast :smiley: I initially thought this was a more serious issue. Thank you everyone for looking into it!

But if somebody cares to debug this, it seems to be reproducible on other Debian 12 systems with fast reponse. Besides the ezgif servers I tried it on another fast vm (different domain, different hardware) and it manifested there as well, but on a slower machine with older Debian and apache it doesn't seem to be the case.

5 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.