Certbot is frequently timing out since a few weeks ago


#1

I think there is nothing I can configure to adjust the timeout. It appears like the capacity of the certificate validation system is being hit for the last several weeks, or there are different usage limits per ip then described in the documentation.

I want to let you guys know that the system used to be reliable and now it is failing repetitively, at almost any time of day.

I’m only issuing a few certificates per day typically, and I put a 5 second delay between renewing each one and it still fails. Am I supposed to wait even longer? It doesn’t seem like a usage limit is being reached. It seems like the system is at capacity and is unavailable frequently.

It seems like the correct way to use certbot is to do an infinite loop retrying the same command until it works?

I also rebuilt the way I do validation to use http since https had a security issue or something, so I’m only validating with http.

I don’t think the details of the command being used are relevant to the issue. It’s certainly not a firewall or dns problem either since the command always works after repeat attempts.


#2

Could you share the actual error Certbot’s presenting you with? Are you timing out connecting to Let’s Encrypt, or is the Let’s Encrypt validation authority receiving a timeout when attempting to connect to your domain? Or perhaps to your authoritative DNS servers?


#3

Hi @skyflare

what’s your domain name? There are a lot of problems with nameservers not answering, having timeouts, don’t accept TCP queries etc.

If nameservers are hanging (Top level domain, next domain …), then Letsencrypt can’t validate your domain.


#4

Hi @skyflare,

We appreciate your reporting this problem, but it’s not a common one, so diagnosing it will require more information about your specific configuration, like the information requested by @jared.m and @JuergenAuer.


#5

Thank you for following up with me. I just did 4 certificates successfully, but the 5th one failed with the same error message again. There were several second between attempts and I tried a different domain each time to avoid hitting the usage limit. The domain was not the cause though, it can happen on any of them. I also checked the dns right now, and 104.156.48.89 is the correct ip and the root and www domains work correctly.

Because I have a custom made application, I am using the webroot plugin on purpose to automate this. My tool generates these commands, so I can’t type them wrong between attempts. Our web server is not even close to being at capacity on our end - Its a high end dedicated server mostly running on idle. The dns is hosted by amazon route53 for this domain.

Here is the unmodified commands and error response:

/usr/bin/openssl ecparam -genkey -name prime256v1 > ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.key’

result: (nothing)

/usr/bin/openssl req -new -sha256 -key ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.key’ -subj ‘/CN=orlandoluxuryproperty.net’ -reqexts SAN -config ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.config’ -outform der -out ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.csr’ 2>&1

result: (nothing)

/usr/bin/certbot certonly -a webroot --email ‘zgraphservers@zgraph.com’ --webroot-path ‘/var/jetendo-server/jetendo/sites/orlandoluxuryproperty_net/’ --csr ‘/etc/letsencrypt/live-ecdsa/155/ecdsa.csr’ --renew-by-default --agree-tos 2>&1

result: Plugins selected: Authenticator webroot, Installer None Performing the following challenges: http-01 challenge for orlandoluxuryproperty.net http-01 challenge for www.orlandoluxuryproperty.net Using the webroot path /var/jetendo-server/jetendo/sites/orlandoluxuryproperty_net for all unmatched domains. Waiting for verification… Cleaning up challenges Failed authorization procedure. orlandoluxuryproperty.net (http-01): urn:ietf:params:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://www.orlandoluxuryproperty.net/.well-known/acme-challenge/5jp4fide-v2z6Mt1CBOcDYQWvys4500_jS3lvS9FMs0: Timeout during connect (likely firewall problem) IMPORTANT NOTES: - The following errors were reported by the server: Domain: orlandoluxuryproperty.net Type: connection Detail: Fetching http://www.orlandoluxuryproperty.net/.well-known/acme-challenge/5jp4fide-v2z6Mt1CBOcDYQWvys4500_jS3lvS9FMs0: Timeout during connect (likely firewall problem) To fix these errors, please make sure that your domain name was entered correctly and the DNS A/AAAA record(s) for that domain contain(s) the right IP address. Additionally, please check that your computer has a publicly routable IP address and that no firewalls are preventing the server from communicating with the client. If you’re using the webroot plugin, you should also verify that you are serving files from the webroot path you provided.


#6

@lestaff, could someone take a look at connectivity to this host? It looks fine to me but it’s getting validation timeouts.


#7

Connectivity looks fine to us as well. I see the same error message in our logs but nothing involving our rate limits, or any unusual rate of connectivity problems to other hosts.

Is it possible there’s a DDoS protection or Web application firewall service that might be rate limiting the Let’s Encrypt validation servers here? I believe this IP’s hosting provider does offer that service.


#8

DDoS protection requires a ticket to be applied to my server for HTTP traffic, it is not always on. I have ufw for firewall, which I configured myself to limit connections from the same ip at 20 per second. It shouldn’t be blocking any other conditions on port 80.


#9

I’m also using nginx for the web server, and I have a dedicated rewrite rule and location for the /.well-known/ urls to let it bypass my application.


#10

We’ve also seen a few reports of service providers blocking Let’s Encrypt’s validation servers due to presence on abuse lists. Does your service provider use any such list?


#11

If we were blocking anyone, it wouldn’t work again just seconds later. My host provides hardware and network and there aren’t others involved besides me, so I know how it is setup. They don’t get involved with our machine or the traffic to it.

I’m surprised that it wasn’t found to be a capacity problem on your end. If you want me to loop certbot every few seconds until it works, I can do that. I just wanted to share that it became unreliable recently, which made me nervous since we are now using the service for hundreds of domains.

I thought there might be something you could do to adjust the capacity / concurrency / queuing of the system or certbot internals to retry on X times on timeout failures. It does timeout fairly quickly. I’d think there are temporary high speed bursts of activity on your end, which may exceed the connection limits or just take too long to finish.


#12

Thanks for checking and following up with us! Might it be possible for you to take packet captures (with tcpdump or similar) during validation attempts, and inspect them if the attempt fails? I think that could be interesting to look at, since I haven’t been able to correlate the timestamps of your errors with any known periods of capacity or network problems.