Thanks for the suggestion, but the HAProxy config is correct. The server directive does not accept paths. The URI path is passed directly to the backend (I'm not doing any rewriting).
The http://127.0.0.1:10443/ location is Certbot's standalone web server. If certbot's not running the location will result in a 503 error. There's nowhere for me to place a test file unless I can access certbot's web files or something.
Could there be a firewall with some kind of DDoS protection? Sometimes called "smart firewall" or similar? Individual requests are working from various IP sources. But, the Let's Encrypt servers will make several challenge requests from different IP's at the same time. We sometimes see these cause problems when they are too strict.
Also, your darkvirtue.com domain responds a little different than the books subdomain. It's not a problem but looking through this thread there are tests for each. It was a little confusing. Should we be focusing on books?
No, this provider doesn't have any DDOS protections or firewalls at all (they're very inexpensive).
Sorry for the confusion with books vs the non-subdomain. I was trying to test different backends. The base domain goes to Apache and books goes to Calibre-Web. Both domains are having the same issue, but the base domain isn't eligible for renewal yet - I've reverted my testing changes so both domains will behave the same with regard to Acme. We can focus on books.
Since I'm using certbot in standalone mode, the 503 error for /.well-known/acme-challenge is expected while certbot isn't actively running.
Notice the certbot standalone will identify itself with the Server response header. You could be stripping those off somewhere I suppose but the rest of the response looks very different too. So, I think your 501 comes from something else
Also, you can also try this curl when standalone is running and you should see similar response to this. You might have to change your routings so a "home" page request gets sent to certbot standalone rather than your vue app or whatever.
Yes, I am stripping out the server headers. The strict-transport-security, x-frame-options, x-xss-protection, referrer-policy, and x-content-type-options headers are injected by HAProxy on all HTTPS connections as security measures.
For testing, I modified HAProxy to point the base path for books to certbot and got similar results to you while it was running:
That confirms the comms path can reach certbot standalone.
I go back to an earlier comment that if only the Let's Encrypt servers cannot get through then it's almost always a firewall. In extremely rare cases there have been comms routing issues at ISPs or inside hosting services. In less rare cases it has been some "smart" or "adaptive" firewall setting blocking the requests as DDoS protection.
Do you see the Let's Encrypt challenges arrive in HAProxy? Can you check those logs after trying a --dry-run certbot request? How many do you see and what is source IP? And, for each LE Server you should see its HTTP and another as HTTPS since you redirect them
I don't use HAProxy but I thought it could obtain certs directly. Then it handles all the HTTPS and then uses HTTP to any backend server. Is that an option?
And, a recap for any other volunteers ...
An HTTP challenge request is redirected to HTTPS (all HTTP are)
HAProxy sends HTTPS challenge as HTTP to Certbot --standalone mode on port 10443
Individual challenge test requests can reach certbot --standalone but cert requests to LE staging and production fail with "Timeout"
Hmm. Well, there should be 3 IP addresses. The one missing is the primary data center. I believe that one uses Cloudflare Magic Transit.
I don't have time any more today to help. But at least we are getting closer. It is getting beyond my skills if there is a routing problem like that though.
TL;DR: it's a busy traffic flow, but, in the end it just looks like the primary LE server farm is not reaching HAProxy. 2 AWS secondaries are seen just fine.
See post #1 for HAProxy setup but it fronts for certbot --standalone:
HAProxy redirects all http to https (even the HTTP challenge).
HAProxy sees challenge arrive from LE on HTTPS (after redirect)
HAProxy sends that challenge to Certbot --standalone using HTTP but port 10443
Post #29 shows the HAProxy log records where 2 of 3 LE Server farms are seen by HAProxy (the original HTTP and the HTTPS from redirect). This missing LE farm is the primary. It is IPv4 here so LE origin is via Cloudflare Magic Transit.
The test in post #27 proves HAProxy can talk to certbot standalone.
The most recent cert for books.darkvirtue.com was on Oct4 2022. This setup has been working for "years" per first post.
@Vicerious Can you ask your provider to look at their inbound logs when you run another --dry-run test? They should see 6 requests for the same URL that you only see 4 of in HAProxy log. The exact URL changes each request but they know your target IP so should be able to isolate by that.
One LE request does not reach your HAProxy. It either gets "lost" reaching the edge of your provider's network. Or, if it reaches their network it is lost in their routing before getting to you.
Using --dry-run ensures no prior cached validation for your account interferes with this traffic assessment.
but yes, I assume they used standalone because otherwise we should not have seen any HTTP 200 status code to the challenge
We should have seen 2 more requests in the haproxy logs in any case
but you raise a good point in that supposedly other domains on this server are getting successful certs so why wouldn't challenges arrive just for this domain?
To recap, the reason for the Timeout is related to the missing entries in the HAProxy log. Earlier you said other domains were working. Are they still? What shows in the HAProxy log when using a --dry-run test for some other domain using that same server - the one with IP ending in 60.101.
I heard back from my provider. Apparently they're having a strange routing/BGP issue preventing some locations from connecting to their network, including a lot of Cloudflare (I can't ping 1.1.1.1, for example).
Does retrieving the issued certs rely on the Cloudflare endpoint as well? If not, I might try switching to DNS challenges to sidestep this routing weirdness.
Thanks to everyone for your help in troubleshooting this!
The symptom we kept seeing was the HTTP requests from the Let's Encrypt server was not reaching your domain server. So, switching to DNS challenge might help. Automating a DNS Challenge is often much harder.
The API requests from the ACME Client (Certbot) to the ACME Server (LE) won't change but those weren't showing problems.
That said, I see the active cert for books was issued today so is good for 90 days. Might not be worth the trouble to switch to DNS.
Your darkvirtue.com domain doesn't expire until Feb6 2023 so a fair time for your provider to sort their problem.
It looks like my provider resolved their routing issue just before I posted my last reply and my cronjob happily renewed everything, so it's all good now.