SSL handshake failure to /directory endpoint

Hello,

we have ongoing problems with accessing /directory endpoint via dehydrated client (latest release) with error:

LE[8850]: ERROR: Problem connecting to server (get for https://acme-v02.api.letsencrypt.org/directory; curl returned with 35)

Fix in October partially resolves the problem - connecting issues are less frequent, but still persist randomly across all servers (higher dozens per day). More information about our setup is in locked thread here.

In a meantime we lowered a frequency of checking certificates renewal, no luck so far. According to captured data (Wireshark) there is RST packet coming from IP 2606:4700:60:0:f53d:5624:85c7:3a2c (Cloudflare) approx. 3 seconds after Client Hello. I suspect that is because remote peer expects ACK for Server Hello, but it did not come from remote/network.

Details with example of successful and unsuccessful connection in .pcap file . Times are in UTC.
le.pcap (3.6 KB)

Is there anything we can check from our side? We will be glad for any assistance. Thank you.

1 Like

Hi @403,

We’ll take a look and see what we can do.

2 Likes

Thanks @Phil_LE. Any updates on this issue? As I see it, the problem is likely to be on route between our network and CF edge, or directly at CF edge.

In meantime, connecting issues still persist. LE log with curl errors from one host (karen.onebit.cz, 178.238.37.225, 2a01:430:13::225) is saved in le-log.txt (4.5 KB). Timestamps are in UTC+1. I do not if traceroute output will be useful here (ends on cloudflare.com or peering.cz). But attached too.

traceroute to acme-v02.api.letsencrypt.org (2606:4700:60:0:f53d:5624:85c7:3a2c), 30 hops max, 80 byte packets
 1  ip6gw.onebit.cz (2a01:430:13::1)  0.838 ms  0.867 ms  0.942 ms
 2  vl1399.cr4.r2-1-3.dc2.cejl.brq.masterinter.net (2a01:430:ff:1399:1::2)  0.356 ms  0.370 ms  0.412 ms
 3  vl1384.cr3.r1-8.dc1.4d.prg.masterinter.net (2a01:430:ff:1384:1::2)  5.203 ms  5.345 ms  6.871 ms
 4  nix4-ipv6.cloudflare.com (2001:7f8:14::81:1)  4.371 ms  4.388 ms  5.235 ms
 5  * * *
 6  * * *

traceroute to acme-v02.api.letsencrypt.org (172.65.32.248), 30 hops max, 60 byte packets
 1  b1hsrp1.onebit.cz (89.185.231.2)  0.839 ms  0.879 ms  0.981 ms
 2  vl1400.cr3.r3-1-4.dc3.cejl.brq.masterinter.net (83.167.254.129)  0.247 ms  0.463 ms  0.434 ms
 3  vl1385.cr2.c16.127.cecolo.prg.masterinter.net (83.167.254.154)  3.555 ms  3.781 ms  3.696 ms
 4  cloudflare.peering.cz (91.213.211.102)  5.206 ms  5.220 ms  5.299 ms
 5  * * *
 6  * * *

Thank you for help.

1 Like

Hi @403,

Apologies for the super long delay here.

  • Do the curl 35 errors from le-log.txt happen with any other client or just dehydrated/0.6.5?
    2019-12-25 04:00:16 sid-225 LE[345025]: ERROR: Problem connecting to server (get for https://acme-v02.api.letsencrypt.org/directory; curl returned with 35)
    
  • Does the problem persist when using the dehydrated master branch which includes a logic fix per https://github.com/lukas2511/dehydrated/issues/684#issuecomment-539754841
  • Could you run dehydrated in debug mode and provide failure logs from that please?
  • Can you paste the exact dehydrated commands you run when you receive the error(s) and the case you stated at cURL error to /directory endpoint

According to https://curl.haxx.se/libcurl/c/libcurl-errors.html, error 35 is

CURLE_SSL_CONNECT_ERROR (35)
A problem occurred somewhere in the SSL/TLS handshake. You really want the error buffer and read the message there as it pinpoints the problem slightly more. Could be certificates (file formats, paths, permissions), passwords, and others. 

We’re still digging in and we know this is a pain point for the community.

3 Likes

As an update, we just finished reviewing all of our physical firewall configurations and found several ways to eek out extra performance - e.g. utilize less CPU and RAM while still serving the same amount of active firewall sessions.

3 Likes

Thanks again @Phil_LE for looking at this problem.

We are using only dehydrated ACME client across our servers. To change that we have to made bigger adjustment to our system around LE certificates.

I have applied small fix from linked issue on Github. I let you know on this later.

I am not aware that dehydrated have debug option. But curl 35 error occurs right after GET request to /directory endpoint while checking domain+sans whether is necessary to renew a certificate.

2020-01-09 08:00:20 sid-225 LE[290722]: # INFO: Using main config file config
2020-01-09 08:00:20 sid-225 LE[290722]: # INFO: Using additional config file conf.d/10-onebit-conf.sh
2020-01-09 08:00:20 sid-225 LE[290723]: ERROR: Problem connecting to server (get for https://acme-v02.api.letsencrypt.org/directory; curl returned with 35) 

2020-01-09 08:00:22 sid-225 LE[290823]: # INFO: Using main config file /etc/le/config
2020-01-09 08:00:22 sid-225 LE[290823]: # INFO: Using additional config file /etc/le/conf.d/10-onebit-conf.sh
2020-01-09 08:00:23 sid-225 LE[290823]: Processing domain.tld with alternative names: www.domain.tld
2020-01-09 08:00:23 sid-225 LE[290823]: + Checking domain name(s) of existing cert... unchanged.
2020-01-09 08:00:23 sid-225 LE[290823]: + Checking expire date of existing cert...
2020-01-09 08:00:23 sid-225 LE[290823]: + Valid till Mar 16 19:01:15 2020 GMT (Longer than 14 days). Skipping renew!

Relevant code is in init_system function, line 275.

# Get CA URLs
CA_DIRECTORY="$(http_request get "${CA}")"

Dehydrated is run periodically via CRON and this opts (output to syslog).

dehydrated -c -d '$domain $sans' -t $challenge

Testing connection outside acme client was done this way. I run it again, so far no problem (TLS connection ok, no errors).

#!/bin/bash

while :; do
  for c in `seq 1 $(shuf -i 1-5 -n1)`; do
    { curl -A "dehydrated/0.6.5 curl/7.29.0" -L -s -w "`date '+%F %T'` %{ssl_verify_result} %{http_code} " -o /tmp/curlout -D /tmp/curlhead https://acme-v02.api.letsencrypt.org/directory; echo $?; } >> /tmp/curlstatus
  done
  sleep `shuf -i 10-30 -n1`
done
3 Likes

The connections to the server acme-v02.api.letsencrypt.org may be grouped very much. I see from the log that you start the connection immediately after the wallclock hour changed, something like: 0 * * * * <job>. Is it possible that you change this to some other starting minutes different in each server, and at the beginning sleep random seconds in the range of 0-59?

3 Likes

LE cron job runs every 30 minutes with variable number of domains to process. Each run of dehydrated is separated with random delay. Before the change from Akamai to CloudFlare job runs every 10 minutes (more often, but with a smaller number of domains) without delay and it was absolutely fine.

If we are hitting some rate limit I would assume info message from edge and not just closing TCP connection in TLS handshake stage (waiting for ACK for Server hello that is not send? or ends in some black hole of Internet). Can you check whether we are hitting some limit (fw, …)? If so, is it per IP or whole block?

Also I have first result from testing connection outside cron. Four connection errors in the last 24 hours. Times are mostly different from cron schedule and at that time (xx:20, xx:55) there shouldn’t be any other connection to /directory endpoint.

IP 178.238.37.179 / 2a01:430:13::179
Time[GMT+1] / SSL check / HTTP status / curl return code

2020-01-09 19:10:04 0 000 35
2020-01-10 09:00:28 0 000 35
2020-01-10 13:20:08 0 000 35
2020-01-10 14:55:11 0 000 35
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.