Getting new timeout errors during renewal as of a month ago (setup previously was working)

I've had my setup working properly for over a year now, and I'm not aware of any material changes, but I've been getting the error "Timeout during connect (likely firewall problem)" since 8 September. Just noticed it now because the old cert expired and I got an email. (This is on my home internet, so I don't bother with any kind of alerting system.)

My first thought was maybe my ISP started blocking port 80, but a) I have "business class" internet service from my ISP, and I'm told they're not blocking anything, and b) while certbot is running and waiting for a response, I can successfully access the file under .well-known/acme-challenge/ over HTTP from outside my network from three places: a VPS in a datacenter in Fremont, CA; an EC2 instance in AWS's us-east-1 region; and my phone while it's connected only to its cellular network.

My next thought was some other kind of selective filtering, but the only thing filtering port 80/443 on my side is fail2ban running on the web host. There are currently no bans in place, and nothing is filtering on the router (which forwards to the internal host).

Finally I checked the nginx log while manually running the renew command, and it does seem like something is getting through, despite the failure:

52.39.4.59 - - [03/Oct/2021:13:30:29 -0700] "GET /.well-known/acme-challenge/k2IxPNp0yl9Axvtvu3dpXKAO86wZ2o6TittMgbWE9Gs HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)"

Update: I also tried passing --dry-run and --test-cert to certbot to use LE's staging environment. This time, LE successfully made three requests to my webserver (2 from US IPs, one from a DE IP), but I guess it wanted to make more requests, as it still failed.

--

My domain is: kelnos.spurint.org

I ran this command: certbot renew

It produced this output:

Attempting to renew cert (kelnos.spurint.org) from /etc/letsencrypt/renewal/kelnos.spurint.org.conf produced an unexpected error: Failed authorization procedure. kelnos.spurint.org (http-01): urn:ietf:params:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://kelnos.spurint.org/.well-known/acme-challenge/MClz0ibf9GmGoX2En2t-BZOtIdS7bMipjbtstqVp220: Timeout during connect (likely firewall problem). Skipping.

My web server is (include version): nginx 1.14.2-2+deb10u4

The operating system my web server runs on is (include version): Raspberry Pi OS (buster)

My hosting provider, if applicable, is: n/a (self-hosted at home)

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 0.31.0

1 Like

@kelnos A great post. Sadly, nothing jumps out at me except that you saw one attempt in your nginx access log. Lets Encrypt will make several attempts from different locations around the world. Might you have some sort of geo blocking in fail2ban or elsewhere? I see all the external locs you tried were US and the IP in the access log was also US.

Update: I just looked at a recent renewal and there were 4 attempts from Lets Encrypt - 3 were US and 1 Germany. Probably cannot count on locations.

2 Likes

@MikeMcQ there shouldn't be any geo blocking. I also just completely disabled fail2ban and tried again, with the same result.

Regarding your update, I see what you mean about probably not being able to count on any particular locations. I tried a few more times, and sometimes I see no requests coming in, and other times I see just one (the same US-based IP again).

I dug through my router settings again, and I can't find anything that might be causing this block there, either.

Update: tried one more time and this time a request from 3.120.130.29 (Germany) successfully made it through. Not sure what to make of this...

1 Like

Try shutting down nginx, then running certbot, then firing up nginx again. Assuming that works, something has gone wrong in the jiggery-pokery between your nginx and certbot's sharing of port 80 and you can light up the certbot developers about it.

Certbot does not share port 80. For the http challenge it places files in the location you tell it which will be served by nginx (in this case).

A fair suggestion to try restarting nginx. I would try a hard restart - not just a reload. It's a long shot but ...

2 Likes

@kelnos Always just the one request - hmm. Have you looked in the nginx error log for anything odd? Maybe set the level low enough to see lots.

@Ted I'm not using the certbot nginx plugin, just webroot mode. As I understand it, in this case, certbot should just be interacting with the filesystem and not trying to listen for requests, right?

Restarting nginx, though, is a generally good idea that I foolishly hadn't though of. Unfortunately LE is rate-limiting me right now, so I'll have to wait a bit to test that.

Try --dry-run or --test-cert

I dont remember which works with which commands and I have to step away

2 Likes

When I first ran certbot on my setup I tried that interacting with the webserver horsesheet and it never worked. I always shut off the webserver, run certbot in the challenge mode that lets it be the web server, then restart the real webserver.

If I was running Amazon which got 200 queries a second 24x7 then I would mess around with the webroot mode. But then I wouldn't be using LE in the first place.

Otherwise it's just fancy programming from the certbot developers who figured "ooo this is a cool idea let's see if we can write a mode for certbot that will allow the webserver to respond to queries for the 0.5 seconds that it takes for certbot to do a renewal!!!"

As was pointed out in ST3, the more they overthink the plumbing the easier it is to stop up the drain.

I see. I'm not sure we're talking about the same thing, though. I do not use the mode where it interacts with the webserver. All it does is make a request to LE, drop a challenge response on the filesystem, and let LE's verification servers do their thing, with nginx already serving the files just as it would any other static file.

At any rate, the setup I've been using has been working perfectly for me for years on other hosts, and for about a year on this one, with no problems until now.

@MikeMcQ Tried --dry-run --test-cert (and also just --dry-run). Now I see three requests from LE (2 US IPs, 1 DE IP), but it still fails... I guess it wants four successful requests...

2 Likes

@MikeMcQ @Ted I also just tried shutting down nginx and using --standalone so certbot could bind to port 80 and serve the challenge response files directly. Still fails, unfortunately, with the same timeout message. (Still using staging environment as I've been rate limited.)

That's good. That means nothing is wrong with nginx so you can quit throwing time and energy into attempting various things with nginx.

So we are left with:

possible block on your server from something you don't know about (maybe fail2ban even though it's shut off is still messing with the IP stack) If you run iptables does it show any sort of interception going on in your tables? maybe shut off iptables or ufw?

possible marginal connection on the ISP side of things - maybe your ISP turned on some super-duper firewall thingie on your cable/dsl/whatever modem, maybe they are having issues with BGP table route failures, etc.

Possible screwup on your gateway router, maybe it's dropping packets because the wall wart AC adapter is starting to fail, etc. (don't laugh, I have a whole collection of "broken" routers, some quite expensive, that people have given me for free where their only problem is the wall wart is putting out enough voltage to light the power LED but not actually run the CPU inside)

Possible screwup on any other network device (hub, etc.) in your network causing partial drops of traffic.

If you have a second windows box just for grins throw acme certbot on it, set it to the IP of your webserver, see if you can pull a cert from LE on it. Use a throwaway name like foo.spurint.org. That will rule out your webserver system (or convict it)

2 Likes

I did check iptables after shutting down fail2ban, and there weren't any entries in there except the ones I expected. Also tried flushing all the iptables tables entirely and resetting their default policies to ACCEPT, just to be certain. No dice.

If the ISP has done something, I would guess it's intended to be permanent, since the failures have been happening for three weeks. Hopefully that's not it, as there's probably no solution there.

Issue with the router or another network device is certainly possible. I guess next thing I'll test will be to just eliminate all that and connect my webserver directly to the cable modem (I guess the modem could also be at fault, ugh). I'll try that later in the evening when I can take the network down without anyone getting upset. If that doesn't work, I'll also take your suggestion to use another machine as the webserver (I guess I can use my laptop) and connect that directly to the cable modem.

Thank you both so much for all the troubleshooting suggestions so far. Will report back once I've been able to try this.

If none of this works, I'll probably switch to using dns-01 challenges; my DNS provider has an API I can use to automate poking TXT records into my zone, which hopefully will work if all else fails.

2 Likes

Have you tried this combination?:

  • shutdown nginx
  • certbot --standalone --dry-run

If that fails, please upload the LE log file.
/var/log/letsencrypt/letsencrypt.log

1 Like

Yes, I have tried --standalone --dry-run. Attached a log of the attempt.

letsencrypt.log.txt (28 KB)

As far as I know, this is correct. Are all the successful attemps from AWS? I believe LE uses AWS for their secondary validation vantage points.

You should keep using the staging environment until successful there, then switch to production.

2 Likes

Yes, they are all from AWS.

Makes sense, will do.

Then you're missing the primary validation vantage point. Which makes sense, otherwise the error message would have included "secondary".

My guess there's still a firewall blocking stuff. Please double and triple check everything you can think of, also at the ISP level.

2 Likes

A post was split to a new topic: Unable to renew cert recently