Random max retries errors caused by SSLV3_ALERT_BAD_RECORD_MAC

Hi all

My domains are: bausznern.org carvaka.de charvaka.org new.gwup.org and 24 more.

As I checked the letsencrypt settings with the command "sudo certbot renew –dry-run"

I received for 2-4 out of 28 certifivates the output:

Failed to renew certificate FAILEDDOMAIN.TLD with error:
HTTPSConnectionPool(host='acme-staging-v02.api.letsencrypt.org', port=443):
Max retries exceeded with url: /acme/cert/2b02404155dc830f9010c89f84281ebe8ab7
(Caused by SSLError(SSLError(1,
'[SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2635)')))

Web server: Apache/2.4.52 (Ubuntu)
Operating system web server: VPS, Ubuntu 22.04.4 LTS
Server provider: Strato
Login to a root shell: Yes
No control panel
certbot 2.11.0

Note: The certificates with error messages vary randomly from run to run! It therefore has nothing to do with the renew settings.

If I enter

sudo certbot renew --cert-name FAILEDDOMAIN.TLD --dry-run

immediately afterwards, I get always a success message. All 28 certificats are currently valid. But the behaviour of the simulation seems rather strange to me.

Here are the debug messages showing the lines of code where the error occurs:
https://beginnersmind.de/doc/random_max_retry_errors.html

Kind regards

hmm...

What shows?:
curl -Ii https://acme-staging-v02.api.letsencrypt.org/

1 Like

Result:

HTTP/2 200
server: nginx
date: Wed, 04 Sep 2024 16:58:38 GMT
content-type: text/html
content-length: 1556
last-modified: Wed, 19 Jun 2024 20:20:54 GMT
etag: "66733da6-614"
x-frame-options: DENY
strict-transport-security: max-age=604800

ping acme-staging-v02.api.letsencrypt.org

PING acme-staging-v02.api.letsencrypt.org(2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026)) 56 data bytes
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=1 ttl=59 time=1.52 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=2 ttl=59 time=1.19 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=3 ttl=59 time=1.19 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=4 ttl=59 time=1.17 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=5 ttl=59 time=1.17 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=6 ttl=59 time=1.24 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=7 ttl=59 time=1.24 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=8 ttl=59 time=1.13 ms
64 bytes from 2606:4700:60:0:f41b:d4fe:4325:6026 (2606:4700:60:0:f41b:d4fe:4325:6026): icmp_seq=9 ttl=59 time=1.11 ms
^C
--- acme-staging-v02.api.letsencrypt.org ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 8013ms
rtt min/avg/max/mdev = 1.107/1.216/1.520/0.114 ms

That's an IPv6 address....
Good to see that works, but let's check the IPv4 path as well:
curl -Ii4 https://acme-staging-v02.api.letsencrypt.org/

2 Likes

Result:

HTTP/2 200
server: nginx
date: Thu, 05 Sep 2024 09:31:07 GMT
content-type: text/html
content-length: 1556
last-modified: Wed, 19 Jun 2024 20:20:54 GMT
etag: "66733da6-614"
x-frame-options: DENY
strict-transport-security: max-age=604800

ping -4 acme-staging-v02.api.letsencrypt.org
PING (172.65.46.172) 56(84) bytes of data.
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=1 ttl=59 time=1.22 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=2 ttl=59 time=1.19 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=3 ttl=59 time=1.09 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=4 ttl=59 time=1.08 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=5 ttl=59 time=1.40 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=6 ttl=59 time=1.17 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=7 ttl=59 time=1.14 ms
64 bytes from 172.65.46.172 (172.65.46.172): icmp_seq=8 ttl=59 time=1.25 ms
^C
--- ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7010ms
rtt min/avg/max/mdev = 1.075/1.192/1.399/0.095 ms

I also performed a longer run:
--- ping statistics ---
207 packets transmitted, 207 received, 0% packet loss, time 206249ms
rtt min/avg/max/mdev = 1.027/1.165/1.751/0.086 ms

#!/bin/sh
sudo certbot renew --cert-name bausznern.org --dry-run
sleep 10
sudo certbot renew --cert-name blog.gwup.net --dry-run
sleep 10
sudo certbot renew --cert-name carvaka.de --dry-run

... 25 further certificates

Result: 5 failures out of 28 certificates

Earlier I tried "cerbot renew --dry-run": 2 failures out of 28 cases

What is shown in the logs for those failures?

1 Like

Always the same:

"Failed to renew certificate software-theband.com with error: HTTPSConnectionPool(host='acme-staging-v02.api.letsencrypt.org', port=443): Max retries exceeded with url: /directory (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2635)')))"

Note that the certificates actually affected by the error vary randomly. They are different for each run!

Here are the detailed error messages:
https://beginnersmind.de/doc/random_max_retry_errors.html
I have already provided this link in my first post.

I also tested the webserver with openssl. Here is the output for one of the virtual hosts. I cannot see any problems:

openssl s_client -connect www.ecso.org:443 -servername www.ecso.org

CONNECTED(00000003)
depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = E5
verify return:1
depth=0 CN = ecso.org
verify return:1
---
Certificate chain
 0 s:CN = ecso.org
   i:C = US, O = Let's Encrypt, CN = E5
   a:PKEY: id-ecPublicKey, 256 (bit); sigalg: ecdsa-with-SHA384
   v:NotBefore: Aug 31 07:25:05 2024 GMT; NotAfter: Nov 29 07:25:04 2024 GMT
 1 s:C = US, O = Let's Encrypt, CN = E5
   i:C = US, O = Internet Security Research Group, CN = ISRG Root X1
   a:PKEY: id-ecPublicKey, 384 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 13 00:00:00 2024 GMT; NotAfter: Mar 12 23:59:59 2027 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDfTCCAwSgAwIBAgISBEHcnu3cQpFXA/Y+vg3KSKsHMAoGCCqGSM49BAMDMDIx
...more lines
-----END CERTIFICATE-----
subject=CN = ecso.org
issuer=C = US, O = Let's Encrypt, CN = E5
---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2391 bytes and written 394 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: 41EBA0D3D641082AC8B691489FFEDEF7DAA7F61240728F48D6B7625DE4D5B715
    Session-ID-ctx:
    Resumption PSK: 2B76D815581B999003CBE071E394ABBBFB08BAC6AE680646922090D9F13D04E867C17891AD626E30263B05CA9CF8FC03
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 300 (seconds)
    TLS session ticket:
    0000 - db 53 fb d5 3a 44 f8 58-37 45 11 80 63 e6 03 98   .S..:D.X7E..c...
    0010 - 08 67 98 ed 81 1e 0a ea-29 ec ba 7f 2c c9 06 e9   .g......)...,...

    Start Time: 1725540275
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: C14EAEBA6B28930280734C390CE7D908081C64443926F60048430417F034240E
    Session-ID-ctx:
    Resumption PSK: 2BE0CB9DBF973C333EF7A4594EAACD26CECA62DAD50E1A5DBC469D75A504EFC5CCC78293E2C5A9C493323F6389892135
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 300 (seconds)
    TLS session ticket:
    0000 - 81 d5 88 7b 80 84 21 11-4d 1d 8d 53 07 03 66 04   ...{..!.M..S..f.
    0010 - 0c 10 46 c9 16 04 55 b0-3a 4b bc 7a fc bf 90 27   ..F...U.:K.z...'

    Start Time: 1725540275
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
closed

Have you rebooted?

1 Like

It seems to be that the error message ‘SSLV3_ALERT_BAD_RECORD_MAC’ indicates a problem with the integrity of the data transferred between the client and the server. This error message occurs if the Message Authentication Code (MAC) of an SSL data record is invalid.

So I did some more SSL testing:

  • We store tar.gz backup archives on a different node every night.
  • I downloaded an archive of size 1.3GB. There were no errors. By the way, we use SFTP extensivley without any problems.

Yes of course, I already tried this 3 days ago after I detected this effect.

I'm at a loss at why this error occurs at all.

Maybe you could uninstall certbot, check that no other versions remain, then reinstall certbot.
OR
Try using another ACME client [like: acme.sh] to see if the problem persists.

1 Like

This issue could be caused by filesystem corruption or hardware problems, difficult to tell. Perhaps you could do a packet capture (pcap) with tcpdump of a failed renew so that we can look if there's an obvious problem?

4 Likes

Yeah, if it's intermittent and inconsistent it may be some networking hardware or firewall or the like corrupting or dropping packets or something along those lines.

4 Likes

Googling suggests this is often related to the TLS 1.3 Zero Round Trip Time Resumption 0-RTT feature: Introducing Zero Round Trip Time Resumption (0-RTT) which would be consistent with cloudflare being the endpoint for the API traffic. (Related-ish article: SSL_ERROR_BAD_MAC_ALERT | Firefox Support Forum | Mozilla Support)

As a guess I'd suggest that this could be improved by updating the version of Python certbot is using (or the system openssl version).

1 Like

Let's Encrypt does not terminate TLS at Cloudflare (they use E2EE via Cloudflare Spectrum) and does not have 0-RTT enabled. The issue you linked is related to a (partially incompatible) TLS middlebox, and so far we haven't seen evidence that a middlebox is actually involved. Given that this is a VPS running at Strato, it's unlikely as they don't employ middleboxes.

5 Likes

The following program versions are installed:

  • snap version of certbot: 2.11.0 which seems to be the latest. I have read that the Snap release comes with its own Python libraries.
  • openssl is: "openssl/jammy-updates,jammy-security,now 3.0.2-0ubuntu1.18 amd64"
    Index of /changelogs/pool/main/o/openssl
    This seems to be the latest for Ubuntu 22.04.4 LTS

A random memory error seems unlikely to me, as there should be more errors than just in the case of ‘certbot renew --dry-run’. The last command currently always returns the specified error, which would indicate a frequently occurring memory error.

The randomness of the error event does not really seem to me to indicate that the SSD memory is defective.

Now I did a tcpdump during ‘certbot renew --dry-run’ and stopped the recording after the first error occurred:

tcpdump host acme-staging-v02.api.letsencrypt.org -w dump.pcap
tshark -r dump.txt -V > dump.txt

dump.pcap 321K
dump.txt 2.8M

But I don't know what I could look for.