Missleading error in output from LE services: Timeout during connect (likely firewall problem)

Hello,

My domain is: nextcloud.torquetum.eu

My reverse proxy+LE stack ran this command:
acme.sh --issue with the following parameters : --log /var/myacmedebuglogfolder/debug.log --debug 2 --server https://acme-staging-v02.api.letsencrypt.org/directory --config-home /etc/acme.sh/staging --webroot /usr/share/nginx/html --keylength 4096 --cert-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/cert.pem --key-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/key.pem --ca-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/chain.pem --fullchain-file /etc/nginx/certs/_test_nextcloud.torquetum.eu/fullchain.pem --always-force-new-domain-key --domain nextcloud.torquetum.eu

It produced this output:
Fetching http://nextcloud.torquetum.eu/.well-known/acme-challenge/aAChznYO1-2prH6ZiE3coz5Oz8zvCNiFmMPe6X0cerk: Timeout during connect (likely firewall problem)

My web server is (include version): nginx 1.20.2
The LE certificate request & renewal process is handled by my reverse proxy, it's a stack of 3 containers composed of:

  • nginx:1.20.2-alpine
  • helder/docker-gen:latest
  • nginxproxy/acme-companion:2.1.2

The operating system my web server runs on is (include version): ubuntu 20.04.3 lts, running docker in swarm mode (single node)

My hosting provider, if applicable, is: myself, on my home ISP connection, which already has another service running and certificate issued by LE on 31/12/2021 (using the setup described in this ticket).

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): see container listed above

NOTE: This ticket is similar to this ticket and to this other ticket, but I'm reporting another issue which seems to be on the LE side and not on the certificate requester side (my server).

The following queries all work http://nextcloud.torquetum.eu/.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 (one of my many tests today, in debug mode it left the test file in place):

  • curl (to above url) executed from the ubuntu server hosting the containers: works with dns resolving to the public ip (my ISP modem)
  • curl (to above url) executed from the ubuntu server hosting the containers: works with dns resolving to the docker host (192.168.1.66 :wink: ) (modem to which my modem forwards port 80 and 443 from the www)
  • curl (to above url) executed from inside the acme-companion containers: works
  • curl (to above url) executed using my laptop on my neighbor's wifi (you need to have good relationships :stuck_out_tongue: )
  • curl (to above url) executed using a VPS hosted at OVH

The extract of logs from my reverse proxy:

nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:38 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:39 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:57:51 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "acme.sh/2.9.0 (GitHub - acmesh-official/acme.sh: A pure Unix shell script implementing ACME client protocol)" "-"
nextcloud.torquetum.eu X.X.X.X - - [29/Jan/2022:12:58:52 +0000] "GET /.well-known/acme-challenge/mU1dTSyJQnSh0oqySzGlTRFVtfEFZSGdaMOcxlVwUu0 HTTP/1.1" 200 87 "-" "curl/7.68.0" "-"

I'm not sure if "misleading" is the correct term. The validation server wouldn't return this error if it could connect. It just can't for some reason. Even if that reason is unknown or weird, that doesn't make the error message misleading IMO.

I can connect to your server too by the way. But maybe some regional blocking is active? Fail2ban? Something like that?

3 Likes

The validation server is the one doing the two first queries above that I extracted from my reverse proxy. Note the success code 200. Then the third queries is done by the acme companion container which also get a 200 success. And the last one is one of my manual curl test.

No fail2ban limitation on port 80 and 443 configured atm.

Let's Encrypt makes requests from various points around the globe. Do you have any Geo based firewall blocking?

Lately there will be 4 requests so looks like two are failing to get through.

3 Likes

My firewall doesn't have geo based blocking, but my ISP could and I don't know of any way to validate this part.

I'll give it another go later and check how many get through.

Indeed, there should be 4 requests from different LE servers. :slight_smile:

4 Likes

What does the nginx access log show?

1 Like

Hello,
Thank you very much for all the information and tips to analyze the situation (including in DMs).

To sum up:

  • yesterday during my migration attempt, only a part of the required queries from LE where reaching my reverse proxy. First on production server, then it reached the hard limit then on the staging servers in debug mode to file this report
  • a second automated attempt from my reverse proxy around 29/Jan/2022:13:57:58 +0000 succeeded on the staging environment (I only saw it later): 3 queries received
  • then later I re triggered it on the production environment: 29/Jan/2022:20:30:33 +0000 and it succeeded

It took me a while to go through and eventually come with a strange idea, when I triggered the deploy, the server had just migrated from 1 IP to another. The A and CNAME records where up to date according to dnschecker.org.
But I realized that this IP changed in the last 24 hours (48 now). The A record pointing to my modem has a TTL of 60 (yes I know, not my choice) the other 600 for (the domain used here is a CNAME).

We can close the support process as the situation was resolved.

Regarding the error message: learning that n queries are being made to validate the server ip ownership was a good lead but seeing that at least x (lower than n) reached my server and succeeded here is my take:

  • you scared me about regional blocking, sadly a reality today (i don't think it was the case though)
  • not knowing at first how many to expect didn't help me (thanks for the later detail on this one)
  • seeing a success on the server mislead me to believe that all reached my server

Could we adapt the error Timeout during connect (likely firewall problem) to say that some queries succeeded but not all (in the context that it partially succeeds). Because it could help analyze the situation differently and be more precise (if that's something allowed in the evaluation model).

Thank you very much for you help !

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.