The Let's Debug response was very curious as well...
@griffin that warning just means that Let's Debug gave up waiting for the ACME CA to update the status of the challenge (30 seconds IIRC).
It's not anything to do with Let's Encrypt. I should probably increase the duration.
Edit: it's now 60s but doesn't seem to have helped. I'm kind of surprised. It's not like the challenge is being queued up on the Let's Encrypt side, and the 10 second deadline was always pretty strict in the past? Or I am misremembering and the 10s timeout only affects dialing?
It responds quickly to normal browser requests.
There must be some sort of user agent check in place.
That was my suspicion.
I also find this a bit mysterious. But I think Jillian's on the right track with checking your HAProxy logs and configs. In particular I would be curious to see if HAProxy has a log entry for the validation requests starting.
Some things we know:
- We're getting this error from our primary datacenters, not from AWS, so this is not related to blocking AWS addresses. If it were, we'd see "during secondary validation" in the error.
acadia.k12.la.us
consistently fails, butwww.acadia.k12.la.us
consistently succeeds (I checked the logs, but also your Certbot output indicates that only the non-www
hostname failed). [Edit: the parenthetical referred to the wrong hostname. Fixed]- This is a "timeout after connect," not "timeout during connect." That means it's probably not a firewall problem. If the firewall were blocking us, we'd get a timeout during connect.
One thing I wonder about: Could your certbot or other tooling be temporarily taking down Apache while it runs, then starting it back up afterwards? For instance, it's common to configure that in standalone
mode. But your output above indicates webroot
mode, so that's probably not what's going on.
One long-shot test I would recommend: On some host outside your network, set up this loop:
while curl -vv -m 80 https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test ; do sleep 1; done
That fetches the URL repeatedly, with a timeout of 80 seconds (Let's Encrypt's overall timeout is 90 seconds).
Then, on your instance that runs Certbot, do a renewal. Does the curl
loop time out while the renewal is going on?
Also: is there anything different in your load balancer config for www
vs non-www
?
A lot of food for thought. Thanks Jacob!
I got to rereading through the log and found myself wondering about something...
Did you mean this the other other way around, maybe?
As in: only the non-www
hostname failed
{
"identifier": {
"type": "dns",
"value": "acadia.k12.la.us"
},
"status": "invalid",
"expires": "2020-12-02T16:24:09Z",
"challenges": [
{
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:ietf:params:acme:error:connection",
"detail": "Fetching http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o: Timeout after connect (your server may be slow or overloaded)",
"status": 400
},
"url": "https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/161888351/HzeKxg",
"token": "kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
"validationRecord": [
{
"url": "http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
"hostname": "acadia.k12.la.us",
"port": "80",
"addressesResolved": [
"104.232.38.112"
],
"addressUsed": "104.232.38.112"
}
]
}
]
}
There's even more fun stuff happening that curl
isn't reproducing:
$ curl -X GET -IL http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
cache-control: no-cache
HTTP/2 404
date: Thu, 26 Nov 2020 06:09:09 GMT
server: Apache
content-length: 196
content-type: text/html; charset=iso-8859-1
load-balancer: ies-lb-v2-ssl
versus a minimal replica of Let's Encrypt's VA HTTP client:
package main
import (
"crypto/tls"
"errors"
"log"
"net"
"net/http"
"time"
)
func main() {
doRequest("http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test")
}
func doRequest(u string) {
req, err := http.NewRequest(http.MethodGet, u, nil)
if err != nil {
panic(err)
}
req.Header.Set("user-agent", "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)")
resp, err := makeThrowawayHTTPClient().Do(req)
if err != nil {
handleError(err)
return
}
defer resp.Body.Close()
log.Printf("Successful response: %d", resp.StatusCode)
}
func handleError(err error) {
var netErr net.Error
if errors.As(err, &netErr) && netErr.Timeout() {
log.Printf("Timeout after connect (%v)", err)
return
}
log.Printf("Generic error: %v", err)
}
func makeThrowawayHTTPClient() *http.Client {
return &http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
IdleConnTimeout: time.Second,
TLSHandshakeTimeout: 10 * time.Second,
MaxIdleConns: 1,
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,
},
},
}
}
$ go run cmd/foo/main.go
2020/11/26 17:09:50 Timeout after connect (Get "https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test": net/http: TLS handshake timeout)
So .... is the webserver intolerant of Go's TLS implementation perhaps? Setting TLSClientConfig.MaxVersion
to TLS1.2 ...
$ go run cmd/foo/main.go
2020/11/26 17:13:38 Successful response: 404
Huh? Double huh, because the same exact same reproduction does not work against the www
subdomain, as jsha observed earlier.
This doesn't however explain https://acme-staging-v02.api.letsencrypt.org/get/authz-v3/161448597, where port 443 is not involved at all.
All this messing around ended up with me finding and fixing a connection leak in Let's Debug though, so .
I'm not understanding this part. If you look at my last post, you'll see where I think that jsha mentioned www
instead of non-www
like the log reflects (there are multitudes of polls for the apex domain in the log before failure, but www
succeeded immediately at the beginning). Did you get a failure against www
?
I think everybody is in agreement that www
does not suffer from any issues?
The same goes for the Go TLS issue I posted about - it only affects the domain apex (for some reason).
I misunderstood the implication of the "does not work" part, which is why I wanted some clarification. Thanks for setting me straight.
I might have actually been on the right path?
Not exactly a check, but a difference in client operation. The apex versus www difference is baffling to me to. Looping rewrite in apache perhaps? That would explain the successful connection with the lack of response.
There's definitely intentional UA sniffing going on. e.g. The webserver responds with a 403 if your UA starts with curl
.
But the TLS thing is 99% likely to be an unintended interoperability issue.
Looping rewrite in apache perhaps
Probably not, it's the very start of the TLS handshake. No HTTP request has been sent at this point.
The webserver responds with a 403 if your UA starts with
curl
.
Is it a 403 or a 404?
I thought you got 302->404. Is this a different result?
It's an actual 403. But I think it's more of a curiosity, because it doesn't seem to affect the ACME challenge resources.
$ curl -H 'User-Agent: curlfoox' -X GET -IL http://acadia.k12.la.us/
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/
cache-control: no-cache
HTTP/2 403
vs
$ curl -H 'User-Agent: foox' -X GET -IL http://acadia.k12.la.us/
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/
cache-control: no-cache
HTTP/2 200
it doesn't seem to affect the ACME challenge resources.
There's where I was lost. Makes sense now.
$ curl -X GET -IL http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test HTTP/1.1 302 Found content-length: 0 location: https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test cache-control: no-cache HTTP/2 404 date: Thu, 26 Nov 2020 06:09:09 GMT server: Apache content-length: 196 content-type: text/html; charset=iso-8859-1 load-balancer: ies-lb-v2-ssl
What UA does curl report when not specified? None?
That's a very interesting reproduction case, @_az! It sounds like your hypothesis is that Boulder's VA is hitting the redirect to HTTPS, and then timing out on the HTTPS version of the URL. There's one problem. Here's what the validationRecord for the failed challenge looks like (from the pastebin log shared earlier)
"validationRecord": [
{
"url": "http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
"hostname": "acadia.k12.la.us",
"port": "80",
"addressesResolved": [
"104.232.38.112"
],
"addressUsed": "104.232.38.112"
}
]
If Boulder had followed the redirect, there should be two entries in that validationRecord, one for each URL requested. The fact that there's only one suggests the redirect never happened. Also, if you search out the validationRecord for the successful challenge (the one with the www
), you'll find that it is the same way - just the one HTTP URL, not redirect indicated.
What UA does curl report when not specified? None?
You can test this yourself with curl httpbin.org/user-agent
. Or add -v
to your curl command to see the request headers as well as the response ones.
So I have done a lot of testing over the last few days.
Let me start by saying we have over 500 SSL's from you folks at LetsEncrypt. Here is where it gets a bit interesting. I run a cronjob at 8:00PM on Fridays. This ran a few days ago and some certificates were renewed while others were not. So something really strange is going on. One thing that I have removed is forcing SSL so now things work non ssl but still have the same results.
Another test I did was to remove haproxy from the equation by setting the firewall to send everything from the 2 outgoing letsencrypt servers. Jillian gave me the way to extract the IP addresses and warned me that they change without notice. So for testing purposes I send all traffic from those 2 IP addresses directly to the apache server. This apache server BTW is full time. It is always up and does nothing but handle the letsencrypt stuff. Interesting enough once certbot fails bypassing the proxy server I get these in the apache log file.
"letsencrypt.iescentral.com","10.5.1.172","10.5.1.254","-","[26/Nov/2020:11:29:26 -0800]","-","408","-","-","-"
I need to update for X-Forwarded-For to get your sides IP address. I will work on that. But this should provide a new clue. Error 408 Request Timeout and that happens the instant certbot fails on my side.
So since I do not know how your side works let me explain my side. I changed the account # to all xxx that appears to be the only critical info.
renew_before_expiry = 30 days
version = 0.40.0
archive_dir = /etc/letsencrypt/archive/acadia.k12.la.us
cert = /etc/letsencrypt/live/acadia.k12.la.us/cert.pem
privkey = /etc/letsencrypt/live/acadia.k12.la.us/privkey.pem
chain = /etc/letsencrypt/live/acadia.k12.la.us/chain.pem
fullchain = /etc/letsencrypt/live/acadia.k12.la.us/fullchain.pemOptions used in the renewal process
[renewalparams]
authenticator = webroot
account = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
server = https://acme-v02.api.letsencrypt.org/directory
[[webroot_map]]
acadia.k12.la.us = /home/common/etc/haproxy/letsencrypt/web
www.acadia.k12.la.us = /home/common/etc/haproxy/letsencrypt/web
I have apache that has that web folder as the document root. We know it works because it has worked that way for over a year and some certificates succeed while others fail.
To me it seems like a particular server on your side maybe is having trouble communicating with our side? That could explain why some certificates renewed okay during the cronjob? Do you guys use a round robin type method?
I have put a challenge file back in the correct folder so this link.
http://acadia.k12.la.us/.well-known/acme-challenge/BmOxbbaZLxCE7dJPY50X6e_FYR86250y0Y9ApybCMEQ
That link will respond 24/7 until I remove the file.
The thing is ANY URL we support will work as well. So for instance
http://iescentral.com/.well-known/acme-challenge/BmOxbbaZLxCE7dJPY50X6e_FYR86250y0Y9ApybCMEQ
That would work as well. Basically on haproxy I look for .well-known/acme-challenge and forward any request with that in it to the same folder. Then webroot on certbot is set to hit that folder as well.
I can see the challenge files being created and deleted properly and when they are created I can hit them with any web browser on any network I have access too and they return properly every time.
So by removing the https out of the equation should that not solve the Go TLS issue? If it is using http instead that should no longer be part of this. Unless of course there is something on your side I am unaware of.
So I look forward to hearing thoughts and I can try to implement any tests on my side you wish.
So just FYI at this moment I have firewall rules bypassing haproxy. I can change it back in a few seconds upon request.
As you noticed, 408 is a timeout.
That can easily be tested.
Can it be reached from the Internet or not?