Missleading error in output from LE services: Timeout during connect (likely firewall problem)

luilegeant · January 29, 2022, 2:27pm

Hello,

My domain is: nextcloud.torquetum.eu

My reverse proxy+LE stack ran this command:
acme.sh --issue with the following parameters : --log /var/myacmedebuglogfolder/debug.log --debug 2 --server https://acme-staging-v02.api.letsencrypt.org/directory --config-home /etc/acme.sh/staging --webroot /usr/share/nginx/html --keylength 4096 --cert-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/cert.pem --key-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/key.pem --ca-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/chain.pem --fullchain-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/fullchain.pem --always-force-new-domain-key --domain nextcloud.torquetum.eu

It produced this output:
Fetching http://nextcloud.torquetum.eu/.well-known/acme-challenge/aAChznYO1-2prH6ZiE3coz5Oz8zvCNiFmMPe6X0cerk: Timeout during connect (likely firewall problem)

My web server is (include version): nginx 1.20.2
The LE certificate request & renewal process is handled by my reverse proxy, it's a stack of 3 containers composed of:

nginx:1.20.2-alpine
helder/docker-gen:latest
nginxproxy/acme-companion:2.1.2

The operating system my web server runs on is (include version): ubuntu 20.04.3 lts, running docker in swarm mode (single node)

My hosting provider, if applicable, is: myself, on my home ISP connection, which already has another service running and certificate issued by LE on 31/12/2021 (using the setup described in this ticket).

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): see container listed above

NOTE: This ticket is similar to this ticket and to this other ticket, but I'm reporting another issue which seems to be on the LE side and not on the certificate requester side (my server).

The following queries all work http://nextcloud.torquetum.eu/.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 (one of my many tests today, in debug mode it left the test file in place):

curl (to above url) executed from the ubuntu server hosting the containers: works with dns resolving to the public ip (my ISP modem)
curl (to above url) executed from the ubuntu server hosting the containers: works with dns resolving to the docker host (192.168.1.66 ) (modem to which my modem forwards port 80 and 443 from the www)
curl (to above url) executed from inside the acme-companion containers: works
curl (to above url) executed using my laptop on my neighbor's wifi (you need to have good relationships )
curl (to above url) executed using a VPS hosted at OVH

The extract of logs from my reverse proxy:

nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:38 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:39 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:51 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "acme.sh/2.9.0 (GitHub - acmesh-official/acme.sh: A pure Unix shell script implementing ACME client protocol)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:58:52 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "curl/7.68.0" "-"

Osiris · January 29, 2022, 3:41pm

I'm not sure if "misleading" is the correct term. The validation server wouldn't return this error if it could connect. It just can't for some reason. Even if that reason is unknown or weird, that doesn't make the error message misleading IMO.

I can connect to your server too by the way. But maybe some regional blocking is active? Fail2ban? Something like that?

luilegeant · January 29, 2022, 3:49pm

The validation server is the one doing the two first queries above that I extracted from my reverse proxy. Note the success code 200. Then the third queries is done by the acme companion container which also get a 200 success. And the last one is one of my manual curl test.

No fail2ban limitation on port 80 and 443 configured atm.

MikeMcQ · January 29, 2022, 3:56pm

Let's Encrypt makes requests from various points around the globe. Do you have any Geo based firewall blocking?

Lately there will be 4 requests so looks like two are failing to get through.

luilegeant · January 29, 2022, 4:04pm

My firewall doesn't have geo based blocking, but my ISP could and I don't know of any way to validate this part.

I'll give it another go later and check how many get through.

Osiris · January 29, 2022, 4:23pm

Indeed, there should be 4 requests from different LE servers.

rg305 · January 29, 2022, 10:30pm

What does the nginx access log show?

luilegeant · January 30, 2022, 8:59am

Hello,
Thank you very much for all the information and tips to analyze the situation (including in DMs).

To sum up:

yesterday during my migration attempt, only a part of the required queries from LE where reaching my reverse proxy. First on production server, then it reached the hard limit then on the staging servers in debug mode to file this report
a second automated attempt from my reverse proxy around 29/Jan/2022:13:57:58 +0000 succeeded on the staging environment (I only saw it later): 3 queries received
then later I re triggered it on the production environment: 29/Jan/2022:20:30:33 +0000 and it succeeded

It took me a while to go through and eventually come with a strange idea, when I triggered the deploy, the server had just migrated from 1 IP to another. The A and CNAME records where up to date according to dnschecker.org.
But I realized that this IP changed in the last 24 hours (48 now). The A record pointing to my modem has a TTL of 60 (yes I know, not my choice) the other 600 for (the domain used here is a CNAME).

We can close the support process as the situation was resolved.

Regarding the error message: learning that n queries are being made to validate the server ip ownership was a good lead but seeing that at least x (lower than n) reached my server and succeeded here is my take:

you scared me about regional blocking, sadly a reality today (i don't think it was the case though)
not knowing at first how many to expect didn't help me (thanks for the later detail on this one)
seeing a success on the server mislead me to believe that all reached my server

Could we adapt the error Timeout during connect (likely firewall problem) to say that some queries succeeded but not all (in the context that it partially succeeds). Because it could help analyze the situation differently and be more precise (if that's something allowed in the evaluation model).

Thank you very much for you help !

system · March 1, 2022, 8:59am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeout during connect (likely firewall problem) Help	11	7641	October 5, 2019
Timeout during connect (likely firewall problem) Help	16	1830	November 29, 2022
Timeout during connect (likely firewall problem) (once again...) Help	7	871	July 7, 2021
Yet another "Timeout" while verifying via HTTP Help	26	3230	July 17, 2021
ConnectTimeout when registering certificate Help	28	1506	November 20, 2022

Missleading error in output from LE services: Timeout during connect (likely firewall problem)

Related topics