Intermittent time-outs on /acme endpoints from specific IP

This morning, we are getting intermittent problems "Connection reset by peer" on https://acme-v02.api.letsencrypt.org/acme/new-nonce. The verification seems to go alright, but when downloading the certificate there is an unusually long waiting period after which it fails. For now, this only seems to happen from one particular client IP address. The error occurred on multiple retries. Since we do not want to be rate-limited we stopped trying.

What we have tried:

  • Retrying a few times after a few minutes: Connection reset by peer
  • Trying with another domain on the same server, roywilssl.dev.slik.eu: Connection reset by peer
  • Trying with yet another domain on the same server, le-test.dev.slik.eu: this worked!?
  • Trying on another server, this worked
  • Trying to reach the /acme/new-nonce endpoint with curl, sometimes it works, sometimes it fails after a long waiting period

(Note that these sites are configured to give an authentication prompt, but the ACME challenge directory is not affected by this and as you can see the validation is successful)

I have searched the forums and read about ratelimiting by Cloudflare. That could be possible because this is a VPS provider. The very strange thing is that it DID work for le-test.dev.slik.eu. Perhaps we hit a certain node that ratelimited by accident?

Is there something we can do? Or should we just wait until tomorrow and try again?

My domain is: slik.dev.slik.eu

I ran this command: /usr/local/bin/acme_tiny.py --account-key /etc/letsencrypt.key --csr /tmp/letsencrypt-csrYWu6iF --acme-dir /etc/apache2/acme-challenge

It produced this output:

Parsing account key...
Parsing CSR...
Found domains: slik.dev.slik.eu, www.slik.dev.slik.eu
Getting directory...
Directory found!
Registering account...
Already registered!
Creating new order...
Order created!
Verifying slik.dev.slik.eu...
slik.dev.slik.eu verified!
Verifying www.slik.dev.slik.eu...
www.slik.dev.slik.eu verified!
Signing certificate...
Traceback (most recent call last):
  File "/usr/local/bin/acme_tiny.py", line 214, in <module>
    main(sys.argv[1:])
  File "/usr/local/bin/acme_tiny.py", line 210, in main
    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca, disable_check=args.disable_check, directory_url=args.directory_url, contact=args.contact, alternate_chain=args.alternate_chain)
  File "/usr/local/bin/acme_tiny.py", line 178, in get_crt
    certificate_pem, _, _ = _send_signed_request(order['certificate'], None, "Certificate download failed", download_cert=True)
  File "/usr/local/bin/acme_tiny.py", line 61, in _send_signed_request
    data = _prepare_data(url, payload)
  File "/usr/local/bin/acme_tiny.py", line 51, in _prepare_data
    new_nonce = _do_request(directory['newNonce'])[2]['Replay-Nonce']
  File "/usr/local/bin/acme_tiny.py", line 46, in _do_request
    raise ValueError("{0}:\nUrl: {1}\nData: {2}\nResponse Code: {3}\nResponse: {4}".format(err_msg, url, data, code, resp_data))
ValueError: Error:
Url: https://acme-v02.api.letsencrypt.org/acme/new-nonce
Data: None
Response Code: None
Response: <urlopen error [Errno 104] Connection reset by peer>
Acme client failed, aborting

Edit: on the final retry we got a different result, already getting a urlopen error [Errno 104] Connection reset by peer at https://acme-v02.api.letsencrypt.org/acme/new-acct:

Parsing account key...
Parsing CSR...
Found domains: slik.dev.slik.eu, www.slik.dev.slik.eu
Getting directory...
Directory found!
Registering account...
Traceback (most recent call last):
  File "/usr/local/bin/acme_tiny.py", line 214, in <module>
    main(sys.argv[1:])
  File "/usr/local/bin/acme_tiny.py", line 210, in main
    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca, disable_check=args.disable_check, directory_url=args.directory_url, contact=args.contact, alternate_chain=args.alternate_chain)
  File "/usr/local/bin/acme_tiny.py", line 126, in get_crt
    account, code, acct_headers = _send_signed_request(directory['newAccount'], reg_payload, "Error registering")
  File "/usr/local/bin/acme_tiny.py", line 74, in _send_signed_request
    return _do_request(url, data=data.encode('utf8'), err_msg=err_msg, depth=depth)
  File "/usr/local/bin/acme_tiny.py", line 46, in _do_request
    raise ValueError("{0}:\nUrl: {1}\nData: {2}\nResponse Code: {3}\nResponse: {4}".format(err_msg, url, data, code, resp_data))
ValueError: Error registering:
Url: https://acme-v02.api.letsencrypt.org/acme/new-acct
Data: {"protected": "-REMOVEDFROMPOST-", "signature": "-REMOVEDFROMPOST-"}
Response Code: None
Response: <urlopen error [Errno 104] Connection reset by peer>
Acme client failed, aborting

My web server is (include version): Apache 2.4.54

The operating system my web server runs on is (include version): Ubuntu 18.04 LTS

My hosting provider, if applicable, is: Tilaa

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): acme-tiny fork with support for --alternate-chain (GitHub - SlikNL/acme-tiny: A tiny Let's Encrypt client that supports alternate chains)

I have tried curl -vvv https://acme-v02.api.letsencrypt.org/acme/new-nonce, that works sometimes and fails sometimes. Here is a successful one:

*   Trying 172.65.32.248...
* TCP_NODELAY set
* Connected to acme-v02.api.letsencrypt.org (172.65.32.248) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Unknown (8):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Client hello (1):
* TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=acme-v02.api.letsencrypt.org
*  start date: Jan  6 22:31:53 2023 GMT
*  expire date: Apr  6 22:31:52 2023 GMT
*  subjectAltName: host "acme-v02.api.letsencrypt.org" matched cert's "acme-v02.api.letsencrypt.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* Using Stream ID: 1 (easy handle 0x55e41872c540)
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
> GET /acme/new-nonce HTTP/2
> Host: acme-v02.api.letsencrypt.org
> User-Agent: curl/7.58.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 204 
< server: nginx
< date: Wed, 08 Feb 2023 11:29:35 GMT
< cache-control: public, max-age=0, no-cache
< link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
< replay-nonce: F9776ePia0lib1EnWx5L7hNNFMFervovWZJE8Kb6GahsNEY
< x-frame-options: DENY
< strict-transport-security: max-age=604800
< 
* Connection #0 to host acme-v02.api.letsencrypt.org left intact

Here is an unsuccessful one from the same IP, which took a very long time and then failed:

curl -vvv https://acme-v02.api.letsencrypt.org/acme/new-nonce
*   Trying 172.65.32.248...
* TCP_NODELAY set
* Connected to acme-v02.api.letsencrypt.org (172.65.32.248) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to acme-v02.api.letsencrypt.org:443 
* stopped the pause stream!
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to acme-v02.api.letsencrypt.org:443 
1 Like

Would you be willing to share that specific IP Address(es)?

1 Like

Sure, it is in the post, it can be found from the domain :slight_smile: 84.22.98.25

2 Likes

Okay, after a few hours I had some more time to look at the problem. I started thinking, maybe the error message "Connection reset by peer" was not truthful, but there could have just been an unreachable endpoint and therefore a timeout instead (especially since it took VERY long before the error appeared, and curl did not seem to say the connection was reset).

So I tried doing a few tcptraceroutes which look fine now:

traceroute to 172.65.32.248 (172.65.32.248), 30 hops max, 60 byte packets
 1  hlm1-pod13-vc13-v1-1.tilaa.net (84.22.98.1)  0.810 ms  1.321 ms  1.348 ms
 2  hlm1-cr1-v1066.tilaa.net (164.138.24.66)  1.129 ms  1.070 ms  1.133 ms
 3  hlm1-bfr1-v1032.tilaa.net (164.138.24.32)  0.242 ms  0.244 ms  0.319 ms
 4  * * *
 5  172.71.96.2 (172.71.96.2)  1.337 ms 172.70.44.2 (172.70.44.2)  7.277 ms 172.71.180.2 (172.71.180.2)  2.418 ms
 6  172.65.32.248 (172.65.32.248) <syn,ack>  1.071 ms  1.091 ms  1.086 ms

Since I got only positive tcptraceroutes, I retried the original requests that failed, and they are now succeeding! :slight_smile:

I am now thinking of perhaps a temporary connectivity problem where some flows were getting to the 172.65.32.248 proxy correctly, and some flows ... for some reason ... didn't?

It would be almost impossible to debug where the problem was afterwards (routing problems between our provider and the proxy? maybe it never saw some of the requests), so I guess this topic comes to a slightly-unsatisfying but positive end. I should have thought about trying tcptraceroute earlier. Well, I will do so if we stumble on the problem again. :slight_smile:

3 Likes

At second thought, no, I don't think it was an unreachable Cloudflare server. After all, the failing curls did say:

* Connected to acme-v02.api.letsencrypt.org (172.65.32.248) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1

which is information that comes from Cloudflare. So, Cloudflare WAS reachable from our IP. However, after the initialization, there seemed to be radio silence leading to a timeout.

So I'm now leaning towards that either Cloudflare abandoned the connection without resetting it, or there was a long waiting period between Cloudflare and the backend. That said, I haven't seen many other posts about the problem, which I've would expected if there was an outage. Maybe it was specific to a certain Cloudflare PoP?

If it was a Cloudflare policy issue I would also have expected them to drop the connection immediately, instead of keeping it open until the client timeouted.

So I'm still puzzled, and will update if we have further trouble.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.