Certbot is frequently timing out since a few weeks ago

DDoS protection requires a ticket to be applied to my server for HTTP traffic, it is not always on. I have ufw for firewall, which I configured myself to limit connections from the same ip at 20 per second. It shouldn’t be blocking any other conditions on port 80.

I’m also using nginx for the web server, and I have a dedicated rewrite rule and location for the /.well-known/ urls to let it bypass my application.

We’ve also seen a few reports of service providers blocking Let’s Encrypt’s validation servers due to presence on abuse lists. Does your service provider use any such list?

If we were blocking anyone, it wouldn’t work again just seconds later. My host provides hardware and network and there aren’t others involved besides me, so I know how it is setup. They don’t get involved with our machine or the traffic to it.

I’m surprised that it wasn’t found to be a capacity problem on your end. If you want me to loop certbot every few seconds until it works, I can do that. I just wanted to share that it became unreliable recently, which made me nervous since we are now using the service for hundreds of domains.

I thought there might be something you could do to adjust the capacity / concurrency / queuing of the system or certbot internals to retry on X times on timeout failures. It does timeout fairly quickly. I’d think there are temporary high speed bursts of activity on your end, which may exceed the connection limits or just take too long to finish.

Thanks for checking and following up with us! Might it be possible for you to take packet captures (with tcpdump or similar) during validation attempts, and inspect them if the attempt fails? I think that could be interesting to look at, since I haven’t been able to correlate the timestamps of your errors with any known periods of capacity or network problems.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

I wanted to let you know that the problem described in my original post is still on-going and generates a constant stream of error logs because the validation is so unreliable.

https://community.letsencrypt.org/t/certbot-is-frequently-timing-out-since-a-few-weeks-ago/81411/13

About 70% of all my certbot validations fail.

We did MMR and traceroute tests from our server to the public internet and didn’t find any packet loss or problems with our network/hardware. I don’t know how to test connections to your systems since the tool makes the connection. Is there a hostname I can use to query an ip and test it? I can assure you we’re on premium hosting/bandwidth with low latency. Our host does not filter any of our traffic.

I have to repeat the commands against your system 2 to 5 times in order to get each certificate. I repeat the command once an hour until it works. It ALWAYS is able to renew all the certificates eventually. The issue is only with reliability of the service.

I may have to redesign the script to stop sending me errors until it fails like 5 times if this can’t be fixed, since I’m seeing a lot of error alerts from this when renewals come up.

I don’t know how this tool works, but it seems like it should not fail this often especially if all it has to do is pull a small static file via HTTP.

I still use http webroot plugin as before. Is it possible for me to implement the function of the certbot validation in my application instead of relying on this tool? Maybe there is a timing or configuration problem with it that is too fast.

Moving your post to the existing thread and reopening, in the hope that it gets the attention of the appropriate people :slight_smile:

It might be that something is blocking too many concurrent requests (like: DDOS protection).
Each time you run it some are allowed and thusly cached and those are not tried again.
So that eventually all have been allowed and the process can proceed.

So, it there an IPS in place that is affecting the (multiple simultaneous) inbound connections?

Or perhaps there are multiple systems (load balanced) involved and sometimes it hits the requester and sometimes it doesn’t.
Again, caching the ones that did pass the test (recently); So that it will eventually pass all the tests.

Hard to say exactly without more specific details…

For what it's worth, I'm currently working with a customer who is experiencing around 40% of ACMEv2 orders failing due to challenge connect timeouts.

If I ever get to the bottom of it I'll be happy to share my learnings here.

1 Like

We don’t do any network filtering on port 80 with this request, it is setup to bypass my application with a nginx rewrite to the static file. There is no reason for this activity to look like an attack unless certbot is written wrong.

My script is automated and repeats, but I’m spacing out the letsencrypt command to be very slow, 5 seconds apart and it never retries the same domain until an hour later. It’s not going to hit our usage limits unless it is internally doing a bulk amount of requests per individual command.

Is there a way to do what certbot does without using it?

There are lots of alternative applications and libraries that do what certbot does, and variations thereof. However it doesn’t seem like Certbot is the problem per se - it’s using the HTTP-01 challenge, which causes Let’s Encrypt (the certificate authority, not Certbot) to make a HTTP request back to your web server, and that request seem to be timing out.

1 Like

Hi @skyflare

We’ve gotten a report from another user with similar symptoms. When they collected packet captures, they found that their machine was receiving SYN packets from Boulder’s validation attempts, but was not replying to those packets. This does not fully resolve the issue of course, and they are still investigating, but it helped narrow the zone of possible problems. Would you be willing to take some packet captures during one of your failed validation attempts and tell us whether you see the same symptoms?

Thanks,
Jacob

OK, I learned how to do the packet dump.

tcpdump -s0 port 80 > /root/tcpdump.txt

It produced hundreds of lines that contain letsencrypt.org. I removed all the other lines. I’m certain this dump contains all the activity for just 1 validation event. I verified that my script does NOT repeat this action when it fails. Any repetition should be internal to certbot behavior in this dump.

This forum doesn’t allow me to upload files as a new user. I have posted it in plain text on my web site:

https://www.farbeyondcode.com/z/-vf.0.0.0.88.3C8E6DBD179731A7CE4859D08992502E2F7DE816D81DBAB23F43EDB455710FF1

I was able to renew successfully like 80% of the time right now which was annoying since I wanted it to fail in this case.

I do have tweaks in my linux /etc/sysctl.conf file to protect and improve performance under load. I tried removing/changing some of those values today too, but it didn’t make the problem stop, and it hasn’t been a problem with other normal internet robots/users to have the configuration I do. I’m just mentioning it, since you are talking about SYN, and linux can protect against some bad packet behavior automatically like syn flood or slow packets. I wouldn’t know if you have a system that skips or does the sequence of packets wrong.

Just to be clear, I don’t know enough to be able to interpret what I’m sending you in this dump. I’m not saying that the dump indicates a problem. I’m just providing what was requested to be helpful.

Concurrently to you writing this post, I noticed that in my own investigation (that jsha referred to), the SYNs that are being rejected have a TCP Timestamp value in a vastly different range to the SYNs that succeed.

I am in the process of getting permission to tweak the tcp_tw_recycle and timestamp (if necessary) sysctls to see if it affects the reproduction of the issue.

BTW, it would be way better if you could store a non-text version of the packet capture (-w out.pcap) and then use Wireshark UI with a display filter to cut out the non-Let's Encrypt traffic, and then re-save the pcap file. I think your text file is missing some stuff and a pcap file allows automatic analysis of TCP flows.

1 Like

I’m using certbot 0.28.0 on ubuntu 14.04 with recent patches. We’ll be switching to ubuntu 18.04 soon.

The command is structured like this:
/usr/bin/certbot certonly -a webroot --email … --webroot-path … --csr … --renew-by-default --agree-tos 2>&1

So, the problem on my side is fixed after a few days of work.

Here is the details:

The client applied some settings following an attack. These settings were:

fs.suid_dumpable=1
vm.swappiness=0
kernel.memcg_oom_disable=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_synack_retries=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1

Upon reverting them to their defauls, the dropped SYNs disappeared and the timeouts are now gone and impossible to reproduce.

:partying_face:

2 Likes

This is what I use on some of those settings right now. I don’t know if it would help to change it to 0 on my end, but I don’t need permission lol.

net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_keepalive_time = 600

I’ll try to upload a pcap filtered with wireshark soon. I’m might be a network pro by the end of this.

oh nice. so was that made live to all users including me and I should see if it fails?

did you mean i should turn off tw_reuse and tw_recycle perhaps?