Certbot is frequently timing out since a few weeks ago

OK, I learned how to do the packet dump.

tcpdump -s0 port 80 > /root/tcpdump.txt

It produced hundreds of lines that contain letsencrypt.org. I removed all the other lines. I’m certain this dump contains all the activity for just 1 validation event. I verified that my script does NOT repeat this action when it fails. Any repetition should be internal to certbot behavior in this dump.

This forum doesn’t allow me to upload files as a new user. I have posted it in plain text on my web site:

https://www.farbeyondcode.com/z/-vf.0.0.0.88.3C8E6DBD179731A7CE4859D08992502E2F7DE816D81DBAB23F43EDB455710FF1

I was able to renew successfully like 80% of the time right now which was annoying since I wanted it to fail in this case.

I do have tweaks in my linux /etc/sysctl.conf file to protect and improve performance under load. I tried removing/changing some of those values today too, but it didn’t make the problem stop, and it hasn’t been a problem with other normal internet robots/users to have the configuration I do. I’m just mentioning it, since you are talking about SYN, and linux can protect against some bad packet behavior automatically like syn flood or slow packets. I wouldn’t know if you have a system that skips or does the sequence of packets wrong.

Just to be clear, I don’t know enough to be able to interpret what I’m sending you in this dump. I’m not saying that the dump indicates a problem. I’m just providing what was requested to be helpful.

Concurrently to you writing this post, I noticed that in my own investigation (that jsha referred to), the SYNs that are being rejected have a TCP Timestamp value in a vastly different range to the SYNs that succeed.

I am in the process of getting permission to tweak the tcp_tw_recycle and timestamp (if necessary) sysctls to see if it affects the reproduction of the issue.

BTW, it would be way better if you could store a non-text version of the packet capture (-w out.pcap) and then use Wireshark UI with a display filter to cut out the non-Let's Encrypt traffic, and then re-save the pcap file. I think your text file is missing some stuff and a pcap file allows automatic analysis of TCP flows.

1 Like

I’m using certbot 0.28.0 on ubuntu 14.04 with recent patches. We’ll be switching to ubuntu 18.04 soon.

The command is structured like this:
/usr/bin/certbot certonly -a webroot --email … --webroot-path … --csr … --renew-by-default --agree-tos 2>&1

So, the problem on my side is fixed after a few days of work.

Here is the details:

The client applied some settings following an attack. These settings were:

fs.suid_dumpable=1
vm.swappiness=0
kernel.memcg_oom_disable=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_synack_retries=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1

Upon reverting them to their defauls, the dropped SYNs disappeared and the timeouts are now gone and impossible to reproduce.

:partying_face:

2 Likes

This is what I use on some of those settings right now. I don’t know if it would help to change it to 0 on my end, but I don’t need permission lol.

net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_keepalive_time = 600

I’ll try to upload a pcap filtered with wireshark soon. I’m might be a network pro by the end of this.

oh nice. so was that made live to all users including me and I should see if it fails?

did you mean i should turn off tw_reuse and tw_recycle perhaps?

Yes, exactly that. That’s the solution that worked for my customer.

I updated /etc/sysctl.conf
with

net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_tw_recycle=0

and ran sysctl --system to load it.

I did 10 certificates without a problem. I’ll assume there is no further issue for now.

Hopefully this helps reliability of other connections too.

I don’t think I had this problem when I originally used the service, so maybe something in the stack has changed since 2-3 months ago to make it not work well with that feature. Hitting a server with a lot of connections in a load test is one of the reasons I have the configuration I do. I wanted to have it make the maximum number of ports open as quickly as possible. Not sure if these settings in particular are necessary, but that’s what I had done with my online research.

Thank you for providing what seems to be an answer!

3 Likes

I agree. My customer used our ACME client for like 2 years before reporting this issue in the last few weeks.

Maybe Let's Encrypt slightly changed their networking (NAT or mitigation against attacks) and began producing different TCP timestamps, idk.

Great to hear that it worked for you!

2 Likes

Thank you for the excellent and thorough debugging. I’m glad you’ve both solved it. Thinking back to what might have changed somewhat recently: In mid-November we changed our EDNS buffer size to 512, triggering TCP fallback for a much larger fraction of our DNS queries. I believe our DNS and our HTTP connections come from the same IP via NAT. It’s possible that the much higher rate of TCP connection creation caused some change in the generated TCP timestamps. Does that match up with approximately when you started seeing the problem, @skyflare?

3 Likes

Edit2: I don’t think this explains it in my case. because the nameservers were totally different hosts to the webserver. Unless the TS generation is actually invalid, but I’m not an expert there (are they basically just opaque values? or did the NAT cause some kind of wild wraparound due to the amount of traffic?).

Yes, the original problem was happening in December for me, when my renewals happened. I don’t have renewals every day, so I can’t be more specific.

1 Like

I have a new theory about what changed. These are the sysctls you had in common:

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_fin_timeout = 15 (/10)

I did some searching to better understand tw_recycle, and found this article:

I haven't read the whole thing yet, but right up at the top it says:

the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you

On December 10, we went from a single VA instance per datacenter to two VA instances per datacenter. These instances are behind NAT. The timing seems to match up with the start of the problems.

6 Likes

Would you be able give a hint as to whether Let’s Encrypt plan to counteract this on their end or it’s going to be left as-is?

I was thinking of updating our ACME client to include it in telemetry and notify the cPanel administrator if it detects the presence of those sysctls on the system. I’m not sure how common the flag is but I remember seeing it a lot when researching how to handle large amounts of traffic on Linux.

Perhaps include the output for -debug requests…

We could probably fix this by changing our network so the VA is not NATed. Right now I think we probably shouldn't put in the work, since what I've been reading suggests that tcp_tw_recycle is an anti-pattern, and we shouldn't make special efforts to accommodate it. However, I'm open to changing my mind, especially if we find that a lot of people have this problem.

BTW, I read more of that article, and it explains why you found that the TCP Timestamps correlated with errors:

Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp, unless the TIME-WAIT state would have expired
When the remote host is in fact a NAT device , the condition on timestamps will forbid all the hosts except one behind the NAT device to connect during one minute because they do not share the same timestamp clock.

Starting from Linux 4.10 (commit 95a22caee396), Linux will randomize timestamp offsets for each connection, making this option completely broken, with or without NAT. It has been completely removed from Linux 4.12.

3 Likes