Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.
I ran this command: Using a docker which is running the command, so unknown?
It produced this output: Full log available upon request, but this is the error
Detail: During secondary validation: 69.131.126.51: Fetching http://www.xiocrypt.com/.well-known/acme-challenge/6JVRLYqoryE69x4IUKTWXaRbHaoBIBracwysCimF5Xg: Timeout during connect (likely firewall problem)
My web server is (include version): nginx 1.20.2
The operating system my web server runs on is (include version): alpine 3.14?
My hosting provider, if applicable, is: N/A
I can login to a root shell on my machine (yes or no, or I don't know): yes
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 1.28.0
Additional information:
Unable to get certbot to issue a cert when in live env. It times out when trying to do the http-01 challenge. Staging works as expected. Traffic is routing to the container properly and returning a 200 when asked for the challenge. This is an existing environment, only change was removing one and adding one subdomain.
Have you setup a firewall based on IP addresses since your last good cert on Jun29?
Because I cannot reach your server from my test server (I timeout). And, the Let's Debug test site times out trying to reach you for its test challenge (so from its IP) although its test using the LE staging system test does get through.
There have been no firewall changes. I just reran it to grab a pcap just to make sure and it worked... I'll keep a watch on it and let you know if I see any further issues. I wonder it there was an interruption/change made upstream from me.
HTTP is allowed consistently. I may need to look into why it is not appearing so, but the firewall has it in it's NAT rules and nginx is listening to 80 to redirect to 443. When it was having the issue, I was able to get a pcap file that showed 2 ips requesting the challenge and getting a 200 response. If there were more that should have been, they were being blocked before they hit my tap. I can investigate it further if it happens again.
Edit: It is possible that one of the sources for some of my firewall rules updated to include some of LE's IPs, although I did not see any blocks for this web node in the routing logs. Another thing I will look into at a later date.
A successful http production challenge will be from 4 locations around the globe (currently). Successful results are cached and I guess that must be per-location if you only saw 2 and got a cert. I thought to be cached it had to be a successful http challenge entirely but your results are contrary.
If you mean that website to be readily available to the public I think you need to assess your comms reliability. Many tests fail:
From my test server on AWS EC2 (East Coast) (http or https)
Oddly, I see your site fine from SSL Labs so that's something
The pattern of what works and fails is unusual. Even just http is not consistent.
EDIT:
You could try sudo certbot renew --dry-run to test connectivity at least from Let's Encrypt servers. This uses the test system but also forces each location to challenge again and not rely on any challenge cache. (omit sudo if not needed)
After a few hours of digging, I'm pretty sure I found the issue. There was a firewall rule that was misconfigured. It was blocking a lot of random IPs that it shouldn't have been. I'm surprised issues were not seen sooner as this rule had been in place for almost 6 months.
Edit: Also, when I said I saw 2 IPs, that was when it did not work. Later I retested it, but was not able to get a pcap so I don't know how many came though. Based on what you said, I would assume 4.
Four is the "worst case" scenario.
In the "best case" scenario, some of those systems were able to validate recently and are now using one of those recently cached validations.