It did not pause and stay there let me try again.
Did it unpause?
Okay so I copied the next set out of the folder and back in. Try these.
That way then it can continue and they will remain now.
Do you use Fail2Ban or some other adaptive firewall scheme?
We have a regular firewall in place but nothing like fail2ban on this server.
I do fear something is going on though because I can see when we check for those files in the apache logs but I do not see letsencrypt checking for them. So it is like its not reaching the apache server. That is why I was hoping there was specific IP addresses. We do have a list of banned IP addresses on the firewall but I turned that rule completely off to insure that some how these IP addresses were not in that list. We mostly put denial of service attempt IP addresses in there. If someone comes in using /.well-known/acme-challenge/ then it auto reroutes to the proper apache server that serves up these requests. So in an effort to make sure nothing was getting passed by I checked the apache logs of the servers that would answer that request if it did not have /.well-known/acme-challenge in it. There are 0 of those that have been passed to the other servers.
Check the recent certbot logs/output if you would. I want to make sure you're not being rate-limited for failed authorizations. No point chasing our tail. What's the last error?
The following errors were reported by the server:
Timeout after connect (your server may be slow or overloaded)
Not seeing anything like that. It would be really nice if they included the IP they were coming from in that response.
A reasonable suggestion to be sure.
Thank you for your wonderful cooperation with all this. I'm going to call for heavy reinforcements now.
They're usually pretty expedient about responding. Give them some time though.
We could really use your input and guidance here.
- Manually confirmed http-01 challenge file creation and contents with 100% responsiveness over multiple acquisition attempts (i.e. load balancing not an issue)
- Attempts from Boulder do not appear in apache logs
- Testing against staging environment
- Fails due to timeout after connection
- Firewall suspected but no evidence seen
Thank you really. It is greatly appreciated. You have been awesome and have been extremely helpful.
You're quite welcome. I wish all of our visitors were as knowledgeable and responsive as you.
This an error stems from the connection not responding to the validation attempt. On the Let's Encrypt side, the connection was successful and waited for a reply but never got one so it stops waiting and returns the error.
You looked at your Apache logs, but this is possibly a problem at your HA Proxy layer and if possible you should review the logs there. It sounds like your HA Proxy isn't transmitting the request to Apache so you don't see any attempts in the Apache logs.
I've sent a DM with the IP we attempted to validate for your domain that you can also review.
I would agree with you but we can test it every other way. If we test it with a web browser it works. If we test it on remote machines using curl it works. So I guess I am confused why it does not work with the letsencrypt servers.
I was able to access both challenge files myself using my Samsung phone and verify their contents. StealthMicro was also able to perform external validation. Any idea why the Let's Encrypt Boulder requests would be treated differently here?
Just a note: I did confirm that there's no IPv6 in play here.
The Let's Debug response was very curious as well...
@griffin that warning just means that Let's Debug gave up waiting for the ACME CA to update the status of the challenge (30 seconds IIRC).
It's not anything to do with Let's Encrypt. I should probably increase the duration.
Edit: it's now 60s but doesn't seem to have helped. I'm kind of surprised. It's not like the challenge is being queued up on the Let's Encrypt side, and the 10 second deadline was always pretty strict in the past? Or I am misremembering and the 10s timeout only affects dialing?
It responds quickly to normal browser requests.
There must be some sort of user agent check in place.
That was my suspicion.
I also find this a bit mysterious. But I think Jillian's on the right track with checking your HAProxy logs and configs. In particular I would be curious to see if HAProxy has a log entry for the validation requests starting.
Some things we know:
- We're getting this error from our primary datacenters, not from AWS, so this is not related to blocking AWS addresses. If it were, we'd see "during secondary validation" in the error.
acadia.k12.la.usconsistently fails, but
www.acadia.k12.la.usconsistently succeeds (I checked the logs, but also your Certbot output indicates that only the non-
wwwhostname failed). [Edit: the parenthetical referred to the wrong hostname. Fixed]
- This is a "timeout after connect," not "timeout during connect." That means it's probably not a firewall problem. If the firewall were blocking us, we'd get a timeout during connect.
One thing I wonder about: Could your certbot or other tooling be temporarily taking down Apache while it runs, then starting it back up afterwards? For instance, it's common to configure that in
standalone mode. But your output above indicates
webroot mode, so that's probably not what's going on.
One long-shot test I would recommend: On some host outside your network, set up this loop:
while curl -vv -m 80 https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test ; do sleep 1; done
That fetches the URL repeatedly, with a timeout of 80 seconds (Let's Encrypt's overall timeout is 90 seconds).
Then, on your instance that runs Certbot, do a renewal. Does the
curl loop time out while the renewal is going on?
Also: is there anything different in your load balancer config for
www vs non-
A lot of food for thought. Thanks Jacob!