we just had a problem with servers under load and an online store with Let's Encrypt certificate:
We have a distributed environment with many online stores and different SSL certificates. With about 10,000 concurrent users (flash sale, increase from 500 to 10,000 users within a few minutes) the servers were no longer accessible. The servers themselves had no load at all and could be reached very fast in parallel via HTTP. Only the connection establishment via HTTPS took so long that 95% of the requests ran into a timeout. The logs showed an abort in the TLS handshake.
We ran the stores parallel on servers in our own data center and at AWS (Debian 9 and 10, Apache, PHP). Both environments had different subdomains of the same domain, all certificates from Let's Encrypt (partly wildcard, partly individually generated, Certbot 0.28.0). The servers were not accessible via https for more than 1.5 hours. We have shops (different vHosts) on the same servers with certificates from other issuers, otherwise the setup is identical. There we regularly have 3x higher load peaks and no problems. OCSP Stapling is off with us (with all stores, LE and other CAs).
As far as I have read, LE should have no problems with load, but the problem can only be narrowed down to the HTTPS connection setup and the certificates. As I said, the problem occurred simultaneously in two environments: a) internal datacenter, high performance servers running Debian 9 and b) AWS with several m5.16xlarge servers (64 vCPU, 256 GiB, Debian 10), each of which had only 600 concurrent users and did not allow connection establishment. For vHosts with other certificates we have no problems with 5k-10k concurrent users per server.
Could it be a basic problem with sudden load increase in LE? Should OCSP Stapling solve the problem? The previous forum research is not clear...
Thanks for your help!
Here are some logfile/server status excerpts:
[Thu Nov 19 19:25:30.805404 2020] [ssl:info] [pid 34646] [client ...:61980] AH02008: SSL library error 1 in handshake (server www..de:443)
[Thu Nov 19 19:25:30.805516 2020] [ssl:info] [pid 34646] SSL Library Error: error:140760FC:SSL routines:SSL23_GET_CLIENT_HELLO:unknown protocol -- speaking not SSL to HTTPS port!?
[Thu Nov 19 19:25:30.805531 2020] [ssl:info] [pid 34646] [client ...:61980] AH01998: Connection closed to child 2324 with abortive shutdown (server www..de:443)
[Thu Nov 19 19:25:30.889934 2020] [ssl:debug] [pid 34647] ssl_engine_io.c(1044): [client ...:50689] AH02001: Connection closed to child 2502 with standard shutdown (server www.****.de:443)
Current Time: Thursday, 19-Nov-2020 19:20:20 CET
Restart Time: Thursday, 19-Nov-2020 18:16:13 CET
Parent Server Config. Generation: 5
Parent Server MPM Generation: 4
Server uptime: 1 hour 4 minutes 6 seconds
Server load: 8.75 10.22 16.62
Total accesses: 844729 - Total Traffic: 16.5 GB
CPU Usage: u578.36 s236.06 cu134.75 cs2.56 - 24.7% CPU load
220 requests/sec - 4.4 MB/second - 20.5 kB/request
1960 requests currently being processed, 0 idle workers
SSL/TLS Session Cache Status:
cache type: SHMCB, shared memory: 512000 bytes, current entries: 457
subcaches: 32, indexes per subcache: 88
time left on oldest entries' objects: avg: 289 seconds, (range: 288...289)
index usage: 16%, cache usage: 20%
total entries stored since starting: 457
total entries replaced since starting: 0
total entries expired since starting: 0
total (pre-expiry) entries scrolled out of the cache: 0
total retrieves since starting: 27 hit, 940 miss
total removes since starting: 0 hit, 0 miss