Challenge times out on standalone mode reverse-proxied to different port

Thanks for the suggestion, but the HAProxy config is correct. The server directive does not accept paths. The URI path is passed directly to the backend (I'm not doing any rewriting).

The http://127.0.0.1:10443/ location is Certbot's standalone web server. If certbot's not running the location will result in a 503 error. There's nowhere for me to place a test file unless I can access certbot's web files or something.

1 Like

The Let's Debug Verbose output shows interesting results. See here

It's initial test to look for a test file fails with 503 (after a 301). But, the test that uses the Let's Encrypt staging system fails with a timeout.

When some source IP's can connect and others timeout it most often is related to a firewall blocking the IP's of Let's Encrypt servers.

4 Likes

Hmmm... following @rg305's line of thought, I modified my HAProxy config a little bit:

        use_backend letsencrypt if letsencrypt !{ hdr(host) -i darkvirtue.com }

This adds an exception for my base domain. I've put a test file there: https://darkvirtue.com/.well-known/acme-challenge/letsdebug-test

Viewing that file should return "Hello world"

curl https://darkvirtue.com/.well-known/acme-challenge/testfile
Hello world

curl -I https://darkvirtue.com/.well-known/acme-challenge/letsdebug-test
HTTP/2 200
date: Tue, 06 Dec 2022 21:50:16 GMT
<snip>

However, the debug tool still shows a connection timeout:

My firewall is set for ports 80 and 443 to be completely open. I'm running fail2ban, but no jails related to web traffic and no currently banned IPs.

The server doesn't have any other firewall in front of it, as far as I'm aware - it's in a colo datacenter that provides unmanaged hosting.

3 Likes

Could there be a firewall with some kind of DDoS protection? Sometimes called "smart firewall" or similar? Individual requests are working from various IP sources. But, the Let's Encrypt servers will make several challenge requests from different IP's at the same time. We sometimes see these cause problems when they are too strict.

Also, your darkvirtue.com domain responds a little different than the books subdomain. It's not a problem but looking through this thread there are tests for each. It was a little confusing. Should we be focusing on books?

curl -IL http://books.darkvirtue.com/.well-known/acme-challenge/TestBooks
HTTP/1.1 301 Moved Permanently
content-length: 0
location: https://books.darkvirtue.com/.well-known/acme-challenge/TestBooks

HTTP/2 503
cache-control: no-cache
content-type: text/html

And:
curl -IL http://darkvirtue.com/.well-known/acme-challenge/TestApex
HTTP/1.1 301 Moved Permanently
content-length: 0
location: https://darkvirtue.com/.well-known/acme-challenge/TestApex

HTTP/2 404
date: Thu, 08 Dec 2022 17:46:08 GMT
content-type: text/html; charset=iso-8859-1
4 Likes

No, this provider doesn't have any DDOS protections or firewalls at all (they're very inexpensive).

Sorry for the confusion with books vs the non-subdomain. I was trying to test different backends. The base domain goes to Apache and books goes to Calibre-Web. Both domains are having the same issue, but the base domain isn't eligible for renewal yet - I've reverted my testing changes so both domains will behave the same with regard to Acme. We can focus on books.

Since I'm using certbot in standalone mode, the 503 error for /.well-known/acme-challenge is expected while certbot isn't actively running.

2 Likes

Ah. The 501 you showed in post 1 is not what you would see from the certbot standalone. Yes, a 501 but yours is:

When I do the same I get this (for my own domain):

curl -I test.sample.com/.well-known/acme-challenge/Test123

HTTP/1.0 501 Unsupported method ('HEAD')
Server: BaseHTTP/0.6 Python/3.8.10
Date: Sat, 10 Dec 2022 16:50:08 GMT
Connection: close
Content-Type: text/html;charset=utf-8
Content-Length: 497

Notice the certbot standalone will identify itself with the Server response header. You could be stripping those off somewhere I suppose but the rest of the response looks very different too. So, I think your 501 comes from something else

Also, you can also try this curl when standalone is running and you should see similar response to this. You might have to change your routings so a "home" page request gets sent to certbot standalone rather than your vue app or whatever.

curl -i test.sample.com
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.8.10
Date: Sat, 10 Dec 2022 16:53:30 GMT
Content-Type: text/html

ACME client standalone challenge solver
4 Likes

Yes, I am stripping out the server headers. The strict-transport-security, x-frame-options, x-xss-protection, referrer-policy, and x-content-type-options headers are injected by HAProxy on all HTTPS connections as security measures.

For testing, I modified HAProxy to point the base path for books to certbot and got similar results to you while it was running:

curl -i https://books.darkvirtue.com/
HTTP/2 200 
date: Sat, 10 Dec 2022 19:26:51 GMT
content-type: text/html
strict-transport-security: max-age=31536000; preload; includeSubDomains
x-frame-options: SAMEORIGIN
x-xss-protection: 1;mode=block
referrer-policy: no-referrer,same-origin,strict-origin,strict-origin-when-cross-origin
x-content-type-options: nosniff

ACME client standalone challenge solver
1 Like

That confirms the comms path can reach certbot standalone.

I go back to an earlier comment that if only the Let's Encrypt servers cannot get through then it's almost always a firewall. In extremely rare cases there have been comms routing issues at ISPs or inside hosting services. In less rare cases it has been some "smart" or "adaptive" firewall setting blocking the requests as DDoS protection.

Do you see the Let's Encrypt challenges arrive in HAProxy? Can you check those logs after trying a --dry-run certbot request? How many do you see and what is source IP? And, for each LE Server you should see its HTTP and another as HTTPS since you redirect them

I don't use HAProxy but I thought it could obtain certs directly. Then it handles all the HTTPS and then uses HTTP to any backend server. Is that an option?

And, a recap for any other volunteers ...

  • An HTTP challenge request is redirected to HTTPS (all HTTP are)
  • HAProxy sends HTTPS challenge as HTTP to Certbot --standalone mode on port 10443
  • Individual challenge test requests can reach certbot --standalone but cert requests to LE staging and production fail with "Timeout"
4 Likes

Thanks, Mike. I've turned up HAProxy logging while attempting a --dry-run renewal.

I ran: certbot certonly --dry-run -d books.darkvirtue.com

Dec 10 18:03:23 darkvirtue haproxy[2708497]: 13.58.205.202:33732 [10/Dec/2022:18:03:23.291] http-virtual http-virtual/<NOSRV> 0/-1/-1/-1/0 301 176 - - LR-- 2/2/0/0/0 0/0 "GET /.well-known/acme-challenge/V8rwOCofrAu--n_BqYYTf1XZKUqEyhn1V3HyMvKe7q0 HTTP/1.1"
Dec 10 18:03:23 darkvirtue haproxy[2708497]: 54.191.153.32:55236 [10/Dec/2022:18:03:23.373] http-virtual http-virtual/<NOSRV> 0/-1/-1/-1/0 301 176 - - LR-- 3/3/0/0/0 0/0 "GET /.well-known/acme-challenge/V8rwOCofrAu--n_BqYYTf1XZKUqEyhn1V3HyMvKe7q0 HTTP/1.1"
Dec 10 18:03:23 darkvirtue haproxy[2708497]: 13.58.205.202:18954 [10/Dec/2022:18:03:23.382] http-virtual~ letsencrypt/certbot 0/0/1/2/4 200 189 - - ---- 2/2/0/0/0 0/0 "GET /.well-known/acme-challenge/V8rwOCofrAu--n_BqYYTf1XZKUqEyhn1V3HyMvKe7q0 HTTP/1.1"
Dec 10 18:03:23 darkvirtue haproxy[2708497]: 54.191.153.32:43572 [10/Dec/2022:18:03:23.600] http-virtual~ letsencrypt/certbot 0/0/1/3/4 200 189 - - ---- 2/2/0/0/0 0/0 "GET /.well-known/acme-challenge/V8rwOCofrAu--n_BqYYTf1XZKUqEyhn1V3HyMvKe7q0 HTTP/1.1"

All of the requests came from 2 IP address: 54.191.153.32 and 13.58.205.202. All of the logged requests reported a 301 redirect or 200 OK.

Hmm. Well, there should be 3 IP addresses. The one missing is the primary data center. I believe that one uses Cloudflare Magic Transit.

I don't have time any more today to help. But at least we are getting closer. It is getting beyond my skills if there is a routing problem like that though.

3 Likes

Did I miss something???
Where did HTTP-port 10443 go?
[and does 10443 imply HTTPS?]

4 Likes

TL;DR: it's a busy traffic flow, but, in the end it just looks like the primary LE server farm is not reaching HAProxy. 2 AWS secondaries are seen just fine.

See post #1 for HAProxy setup but it fronts for certbot --standalone:
HAProxy redirects all http to https (even the HTTP challenge).
HAProxy sees challenge arrive from LE on HTTPS (after redirect)
HAProxy sends that challenge to Certbot --standalone using HTTP but port 10443

Post #29 shows the HAProxy log records where 2 of 3 LE Server farms are seen by HAProxy (the original HTTP and the HTTPS from redirect). This missing LE farm is the primary. It is IPv4 here so LE origin is via Cloudflare Magic Transit.

The test in post #27 proves HAProxy can talk to certbot standalone.

The most recent cert for books.darkvirtue.com was on Oct4 2022. This setup has been working for "years" per first post.

4 Likes

@Vicerious Can you ask your provider to look at their inbound logs when you run another --dry-run test? They should see 6 requests for the same URL that you only see 4 of in HAProxy log. The exact URL changes each request but they know your target IP so should be able to isolate by that.

One LE request does not reach your HAProxy. It either gets "lost" reaching the edge of your provider's network. Or, if it reaches their network it is lost in their routing before getting to you.

Using --dry-run ensures no prior cached validation for your account interferes with this traffic assessment.

4 Likes

Then this is missing the part where is tells certbot to use port 10443:

4 Likes

in cli.ini post #1?

but yes, I assume they used standalone because otherwise we should not have seen any HTTP 200 status code to the challenge

We should have seen 2 more requests in the haproxy logs in any case

but you raise a good point in that supposedly other domains on this server are getting successful certs so why wouldn't challenges arrive just for this domain?

4 Likes

To recap, the reason for the Timeout is related to the missing entries in the HAProxy log. Earlier you said other domains were working. Are they still? What shows in the HAProxy log when using a --dry-run test for some other domain using that same server - the one with IP ending in 60.101.

4 Likes

I've reached out to my provider for more info and will be linking them to this thread. I'll update when I know more.

Thanks to everyone for all the help so far!

1 Like

I heard back from my provider. Apparently they're having a strange routing/BGP issue preventing some locations from connecting to their network, including a lot of Cloudflare (I can't ping 1.1.1.1, for example).

Does retrieving the issued certs rely on the Cloudflare endpoint as well? If not, I might try switching to DNS challenges to sidestep this routing weirdness.

Thanks to everyone for your help in troubleshooting this!

2 Likes

The symptom we kept seeing was the HTTP requests from the Let's Encrypt server was not reaching your domain server. So, switching to DNS challenge might help. Automating a DNS Challenge is often much harder.

The API requests from the ACME Client (Certbot) to the ACME Server (LE) won't change but those weren't showing problems.

That said, I see the active cert for books was issued today so is good for 90 days. Might not be worth the trouble to switch to DNS.

Your darkvirtue.com domain doesn't expire until Feb6 2023 so a fair time for your provider to sort their problem.

4 Likes

It looks like my provider resolved their routing issue just before I posted my last reply and my cronjob happily renewed everything, so it's all good now.

Thanks again for the help and answers!

2 Likes