Certbot is frequently timing out since a few weeks ago

I think there is nothing I can configure to adjust the timeout. It appears like the capacity of the certificate validation system is being hit for the last several weeks, or there are different usage limits per ip then described in the documentation.

I want to let you guys know that the system used to be reliable and now it is failing repetitively, at almost any time of day.

I’m only issuing a few certificates per day typically, and I put a 5 second delay between renewing each one and it still fails. Am I supposed to wait even longer? It doesn’t seem like a usage limit is being reached. It seems like the system is at capacity and is unavailable frequently.

It seems like the correct way to use certbot is to do an infinite loop retrying the same command until it works?

I also rebuilt the way I do validation to use http since https had a security issue or something, so I’m only validating with http.

I don’t think the details of the command being used are relevant to the issue. It’s certainly not a firewall or dns problem either since the command always works after repeat attempts.

1 Like

Could you share the actual error Certbot’s presenting you with? Are you timing out connecting to Let’s Encrypt, or is the Let’s Encrypt validation authority receiving a timeout when attempting to connect to your domain? Or perhaps to your authoritative DNS servers?

2 Likes

Hi @skyflare

what's your domain name? There are a lot of problems with nameservers not answering, having timeouts, don't accept TCP queries etc.

If nameservers are hanging (Top level domain, next domain ...), then Letsencrypt can't validate your domain.

2 Likes

Hi @skyflare,

We appreciate your reporting this problem, but it’s not a common one, so diagnosing it will require more information about your specific configuration, like the information requested by @jared.m and @JuergenAuer.

Thank you for following up with me. I just did 4 certificates successfully, but the 5th one failed with the same error message again. There were several second between attempts and I tried a different domain each time to avoid hitting the usage limit. The domain was not the cause though, it can happen on any of them. I also checked the dns right now, and 104.156.48.89 is the correct ip and the root and www domains work correctly.

Because I have a custom made application, I am using the webroot plugin on purpose to automate this. My tool generates these commands, so I can’t type them wrong between attempts. Our web server is not even close to being at capacity on our end - Its a high end dedicated server mostly running on idle. The dns is hosted by amazon route53 for this domain.

Here is the unmodified commands and error response:

/usr/bin/openssl ecparam -genkey -name prime256v1 > ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.key’

result: (nothing)

/usr/bin/openssl req -new -sha256 -key ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.key’ -subj ‘/CN=orlandoluxuryproperty.net’ -reqexts SAN -config ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.config’ -outform der -out ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.csr’ 2>&1

result: (nothing)

/usr/bin/certbot certonly -a webroot --email ‘zgraphservers@zgraph.com’ --webroot-path ‘/var/jetendo-server/jetendo/sites/orlandoluxuryproperty_net/’ --csr ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.csr’ --renew-by-default --agree-tos 2>&1

result: Plugins selected: Authenticator webroot, Installer None Performing the following challenges: http-01 challenge for orlandoluxuryproperty.net http-01 challenge for www.orlandoluxuryproperty.net Using the webroot path /var/jetendo-server/jetendo/sites/orlandoluxuryproperty_net for all unmatched domains. Waiting for verification… Cleaning up challenges Failed authorization procedure. orlandoluxuryproperty.net (http-01): urn:ietf:params:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://www.orlandoluxuryproperty.net/.well-known/acme-challenge/5jp4fide-v2z6Mt1CBOcDYQWvys4500_jS3lvS9FMs0: Timeout during connect (likely firewall problem) IMPORTANT NOTES: - The following errors were reported by the server: Domain: orlandoluxuryproperty.net Type: connection Detail: Fetching http://www.orlandoluxuryproperty.net/.well-known/acme-challenge/5jp4fide-v2z6Mt1CBOcDYQWvys4500_jS3lvS9FMs0: Timeout during connect (likely firewall problem) To fix these errors, please make sure that your domain name was entered correctly and the DNS A/AAAA record(s) for that domain contain(s) the right IP address. Additionally, please check that your computer has a publicly routable IP address and that no firewalls are preventing the server from communicating with the client. If you’re using the webroot plugin, you should also verify that you are serving files from the webroot path you provided.

@lestaff, could someone take a look at connectivity to this host? It looks fine to me but it’s getting validation timeouts.

Connectivity looks fine to us as well. I see the same error message in our logs but nothing involving our rate limits, or any unusual rate of connectivity problems to other hosts.

Is it possible there’s a DDoS protection or Web application firewall service that might be rate limiting the Let’s Encrypt validation servers here? I believe this IP’s hosting provider does offer that service.

DDoS protection requires a ticket to be applied to my server for HTTP traffic, it is not always on. I have ufw for firewall, which I configured myself to limit connections from the same ip at 20 per second. It shouldn’t be blocking any other conditions on port 80.

I’m also using nginx for the web server, and I have a dedicated rewrite rule and location for the /.well-known/ urls to let it bypass my application.

We’ve also seen a few reports of service providers blocking Let’s Encrypt’s validation servers due to presence on abuse lists. Does your service provider use any such list?

If we were blocking anyone, it wouldn’t work again just seconds later. My host provides hardware and network and there aren’t others involved besides me, so I know how it is setup. They don’t get involved with our machine or the traffic to it.

I’m surprised that it wasn’t found to be a capacity problem on your end. If you want me to loop certbot every few seconds until it works, I can do that. I just wanted to share that it became unreliable recently, which made me nervous since we are now using the service for hundreds of domains.

I thought there might be something you could do to adjust the capacity / concurrency / queuing of the system or certbot internals to retry on X times on timeout failures. It does timeout fairly quickly. I’d think there are temporary high speed bursts of activity on your end, which may exceed the connection limits or just take too long to finish.

Thanks for checking and following up with us! Might it be possible for you to take packet captures (with tcpdump or similar) during validation attempts, and inspect them if the attempt fails? I think that could be interesting to look at, since I haven’t been able to correlate the timestamps of your errors with any known periods of capacity or network problems.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

I wanted to let you know that the problem described in my original post is still on-going and generates a constant stream of error logs because the validation is so unreliable.

https://community.letsencrypt.org/t/certbot-is-frequently-timing-out-since-a-few-weeks-ago/81411/13

About 70% of all my certbot validations fail.

We did MMR and traceroute tests from our server to the public internet and didn’t find any packet loss or problems with our network/hardware. I don’t know how to test connections to your systems since the tool makes the connection. Is there a hostname I can use to query an ip and test it? I can assure you we’re on premium hosting/bandwidth with low latency. Our host does not filter any of our traffic.

I have to repeat the commands against your system 2 to 5 times in order to get each certificate. I repeat the command once an hour until it works. It ALWAYS is able to renew all the certificates eventually. The issue is only with reliability of the service.

I may have to redesign the script to stop sending me errors until it fails like 5 times if this can’t be fixed, since I’m seeing a lot of error alerts from this when renewals come up.

I don’t know how this tool works, but it seems like it should not fail this often especially if all it has to do is pull a small static file via HTTP.

I still use http webroot plugin as before. Is it possible for me to implement the function of the certbot validation in my application instead of relying on this tool? Maybe there is a timing or configuration problem with it that is too fast.

Moving your post to the existing thread and reopening, in the hope that it gets the attention of the appropriate people :slight_smile:

It might be that something is blocking too many concurrent requests (like: DDOS protection).
Each time you run it some are allowed and thusly cached and those are not tried again.
So that eventually all have been allowed and the process can proceed.

So, it there an IPS in place that is affecting the (multiple simultaneous) inbound connections?

Or perhaps there are multiple systems (load balanced) involved and sometimes it hits the requester and sometimes it doesn’t.
Again, caching the ones that did pass the test (recently); So that it will eventually pass all the tests.

Hard to say exactly without more specific details…

For what it's worth, I'm currently working with a customer who is experiencing around 40% of ACMEv2 orders failing due to challenge connect timeouts.

If I ever get to the bottom of it I'll be happy to share my learnings here.

1 Like

We don’t do any network filtering on port 80 with this request, it is setup to bypass my application with a nginx rewrite to the static file. There is no reason for this activity to look like an attack unless certbot is written wrong.

My script is automated and repeats, but I’m spacing out the letsencrypt command to be very slow, 5 seconds apart and it never retries the same domain until an hour later. It’s not going to hit our usage limits unless it is internally doing a bulk amount of requests per individual command.

Is there a way to do what certbot does without using it?

There are lots of alternative applications and libraries that do what certbot does, and variations thereof. However it doesn’t seem like Certbot is the problem per se - it’s using the HTTP-01 challenge, which causes Let’s Encrypt (the certificate authority, not Certbot) to make a HTTP request back to your web server, and that request seem to be timing out.

1 Like

Hi @skyflare

We’ve gotten a report from another user with similar symptoms. When they collected packet captures, they found that their machine was receiving SYN packets from Boulder’s validation attempts, but was not replying to those packets. This does not fully resolve the issue of course, and they are still investigating, but it helped narrow the zone of possible problems. Would you be willing to take some packet captures during one of your failed validation attempts and tell us whether you see the same symptoms?

Thanks,
Jacob