Trouble renewing certificate - certbot 0.40.0

griffin · November 25, 2020, 9:38pm

The Let's Debug response was very curious as well...

_az · November 25, 2020, 10:19pm

@griffin that warning just means that Let's Debug gave up waiting for the ACME CA to update the status of the challenge (30 seconds IIRC).

It's not anything to do with Let's Encrypt. I should probably increase the duration.

Edit: it's now 60s but doesn't seem to have helped. I'm kind of surprised. It's not like the challenge is being queued up on the Let's Encrypt side, and the 10 second deadline was always pretty strict in the past? Or I am misremembering and the 10s timeout only affects dialing?

rg305 · November 25, 2020, 10:23pm

It responds quickly to normal browser requests.
There must be some sort of user agent check in place.

griffin · November 25, 2020, 10:56pm

That was my suspicion.

jsha · November 26, 2020, 1:19am

I also find this a bit mysterious. But I think Jillian's on the right track with checking your HAProxy logs and configs. In particular I would be curious to see if HAProxy has a log entry for the validation requests starting.

Some things we know:

We're getting this error from our primary datacenters, not from AWS, so this is not related to blocking AWS addresses. If it were, we'd see "during secondary validation" in the error.
acadia.k12.la.us consistently fails, but www.acadia.k12.la.us consistently succeeds (I checked the logs, but also your Certbot output indicates that only the non-www hostname failed). [Edit: the parenthetical referred to the wrong hostname. Fixed]
This is a "timeout after connect," not "timeout during connect." That means it's probably not a firewall problem. If the firewall were blocking us, we'd get a timeout during connect.

One thing I wonder about: Could your certbot or other tooling be temporarily taking down Apache while it runs, then starting it back up afterwards? For instance, it's common to configure that in standalone mode. But your output above indicates webroot mode, so that's probably not what's going on.

One long-shot test I would recommend: On some host outside your network, set up this loop:

while curl -vv -m 80 https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test ; do sleep 1; done

That fetches the URL repeatedly, with a timeout of 80 seconds (Let's Encrypt's overall timeout is 90 seconds).

Then, on your instance that runs Certbot, do a renewal. Does the curl loop time out while the renewal is going on?

Also: is there anything different in your load balancer config for www vs non-www?

griffin · November 26, 2020, 1:34am

A lot of food for thought. Thanks Jacob!

griffin · November 26, 2020, 2:25am

@jsha

I got to rereading through the log and found myself wondering about something...

Did you mean this the other other way around, maybe?

As in: only the non-www hostname failed

{
  "identifier": {
    "type": "dns",
    "value": "acadia.k12.la.us"
  },
  "status": "invalid",
  "expires": "2020-12-02T16:24:09Z",
  "challenges": [
    {
      "type": "http-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:connection",
        "detail": "Fetching http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o: Timeout after connect (your server may be slow or overloaded)",
        "status": 400
      },
      "url": "https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/161888351/HzeKxg",
      "token": "kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
      "validationRecord": [
        {
          "url": "http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
          "hostname": "acadia.k12.la.us",
          "port": "80",
          "addressesResolved": [
            "104.232.38.112"
          ],
          "addressUsed": "104.232.38.112"
        }
      ]
    }
  ]
}

griffin · November 26, 2020, 2:40am

@_az

I think you are probably right about the 10s timeout only affecting dialing...

_az · November 26, 2020, 6:26am

There's even more fun stuff happening that curl isn't reproducing:

$ curl -X GET -IL http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
cache-control: no-cache

HTTP/2 404
date: Thu, 26 Nov 2020 06:09:09 GMT
server: Apache
content-length: 196
content-type: text/html; charset=iso-8859-1
load-balancer: ies-lb-v2-ssl

versus a minimal replica of Let's Encrypt's VA HTTP client:

package main

import (
	"crypto/tls"
	"errors"
	"log"
	"net"
	"net/http"
	"time"
)

func main() {
	doRequest("http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test")
}

func doRequest(u string) {
	req, err := http.NewRequest(http.MethodGet, u, nil)
	if err != nil {
		panic(err)
	}
	req.Header.Set("user-agent", "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)")

	resp, err := makeThrowawayHTTPClient().Do(req)
	if err != nil {
		handleError(err)
		return
	}
	defer resp.Body.Close()

	log.Printf("Successful response: %d", resp.StatusCode)
}

func handleError(err error) {
	var netErr net.Error
	if errors.As(err, &netErr) && netErr.Timeout() {
		log.Printf("Timeout after connect (%v)", err)
		return
	}

	log.Printf("Generic error: %v", err)
}

func makeThrowawayHTTPClient() *http.Client {
	return &http.Client{
		Transport: &http.Transport{
			DisableKeepAlives:   true,
			IdleConnTimeout:     time.Second,
			TLSHandshakeTimeout: 10 * time.Second,
			MaxIdleConns:        1,
			TLSClientConfig: &tls.Config{
				InsecureSkipVerify: true,
			},
		},
	}
}

$ go run cmd/foo/main.go
2020/11/26 17:09:50 Timeout after connect (Get "https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test": net/http: TLS handshake timeout)

So .... is the webserver intolerant of Go's TLS implementation perhaps? Setting TLSClientConfig.MaxVersion to TLS1.2 ...

$ go run cmd/foo/main.go
2020/11/26 17:13:38 Successful response: 404

Huh? Double huh, because the same exact same reproduction does not work against the www subdomain, as jsha observed earlier.

This doesn't however explain https://acme-staging-v02.api.letsencrypt.org/get/authz-v3/161448597, where port 443 is not involved at all.

All this messing around ended up with me finding and fixing a connection leak in Let's Debug though, so .

griffin · November 26, 2020, 7:20am

I'm not understanding this part. If you look at my last post, you'll see where I think that jsha mentioned www instead of non-www like the log reflects (there are multitudes of polls for the apex domain in the log before failure, but www succeeded immediately at the beginning). Did you get a failure against www?

_az · November 26, 2020, 7:23am

I think everybody is in agreement that www does not suffer from any issues?

The same goes for the Go TLS issue I posted about - it only affects the domain apex (for some reason).

griffin · November 26, 2020, 7:25am

I misunderstood the implication of the "does not work" part, which is why I wanted some clarification. Thanks for setting me straight.

griffin · November 26, 2020, 7:26am

I might have actually been on the right path?

Not exactly a check, but a difference in client operation. The apex versus www difference is baffling to me to. Looping rewrite in apache perhaps? That would explain the successful connection with the lack of response.

_az · November 26, 2020, 7:29am

There's definitely intentional UA sniffing going on. e.g. The webserver responds with a 403 if your UA starts with curl.

But the TLS thing is 99% likely to be an unintended interoperability issue.

Probably not, it's the very start of the TLS handshake. No HTTP request has been sent at this point.

griffin · November 26, 2020, 7:30am

Is it a 403 or a 404?

I thought you got 302->404. Is this a different result?

_az · November 26, 2020, 7:31am

It's an actual 403. But I think it's more of a curiosity, because it doesn't seem to affect the ACME challenge resources.

$ curl -H 'User-Agent: curlfoox' -X GET -IL http://acadia.k12.la.us/
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/
cache-control: no-cache

HTTP/2 403

vs

$ curl -H 'User-Agent: foox' -X GET -IL http://acadia.k12.la.us/
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/
cache-control: no-cache

HTTP/2 200

griffin · November 26, 2020, 7:32am

There's where I was lost. Makes sense now.

_az:

$ curl -X GET -IL http://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
HTTP/1.1 302 Found
content-length: 0
location: https://acadia.k12.la.us/.well-known/acme-challenge/letsdebug-test
cache-control: no-cache

HTTP/2 404
date: Thu, 26 Nov 2020 06:09:09 GMT
server: Apache
content-length: 196
content-type: text/html; charset=iso-8859-1
load-balancer: ies-lb-v2-ssl

What UA does curl report when not specified? None?

jsha · November 26, 2020, 8:37am

That's a very interesting reproduction case, @_az! It sounds like your hypothesis is that Boulder's VA is hitting the redirect to HTTPS, and then timing out on the HTTPS version of the URL. There's one problem. Here's what the validationRecord for the failed challenge looks like (from the pastebin log shared earlier)

      "validationRecord": [
        {
          "url": "http://acadia.k12.la.us/.well-known/acme-challenge/kogsqUE84tlx30Z5kwlOZIEENUvf9Ru6JlnqRh_Qz8o",
          "hostname": "acadia.k12.la.us",
          "port": "80",
          "addressesResolved": [
            "104.232.38.112"
          ],
          "addressUsed": "104.232.38.112"
        }
      ]

If Boulder had followed the redirect, there should be two entries in that validationRecord, one for each URL requested. The fact that there's only one suggests the redirect never happened. Also, if you search out the validationRecord for the successful challenge (the one with the www), you'll find that it is the same way - just the one HTTP URL, not redirect indicated.

You can test this yourself with curl httpbin.org/user-agent. Or add -v to your curl command to see the request headers as well as the response ones.

StealthMicro · November 29, 2020, 9:33pm

So I have done a lot of testing over the last few days.

Let me start by saying we have over 500 SSL's from you folks at LetsEncrypt. Here is where it gets a bit interesting. I run a cronjob at 8:00PM on Fridays. This ran a few days ago and some certificates were renewed while others were not. So something really strange is going on. One thing that I have removed is forcing SSL so now things work non ssl but still have the same results.

Another test I did was to remove haproxy from the equation by setting the firewall to send everything from the 2 outgoing letsencrypt servers. Jillian gave me the way to extract the IP addresses and warned me that they change without notice. So for testing purposes I send all traffic from those 2 IP addresses directly to the apache server. This apache server BTW is full time. It is always up and does nothing but handle the letsencrypt stuff. Interesting enough once certbot fails bypassing the proxy server I get these in the apache log file.

"letsencrypt.iescentral.com","10.5.1.172","10.5.1.254","-","[26/Nov/2020:11:29:26 -0800]","-","408","-","-","-"

I need to update for X-Forwarded-For to get your sides IP address. I will work on that. But this should provide a new clue. Error 408 Request Timeout and that happens the instant certbot fails on my side.

So since I do not know how your side works let me explain my side. I changed the account # to all xxx that appears to be the only critical info.

renew_before_expiry = 30 days

version = 0.40.0
archive_dir = /etc/letsencrypt/archive/acadia.k12.la.us
cert = /etc/letsencrypt/live/acadia.k12.la.us/cert.pem
privkey = /etc/letsencrypt/live/acadia.k12.la.us/privkey.pem
chain = /etc/letsencrypt/live/acadia.k12.la.us/chain.pem
fullchain = /etc/letsencrypt/live/acadia.k12.la.us/fullchain.pem

Options used in the renewal process

[renewalparams]
authenticator = webroot
account = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
server = https://acme-v02.api.letsencrypt.org/directory
[[webroot_map]]
acadia.k12.la.us = /home/common/etc/haproxy/letsencrypt/web
www.acadia.k12.la.us = /home/common/etc/haproxy/letsencrypt/web

I have apache that has that web folder as the document root. We know it works because it has worked that way for over a year and some certificates succeed while others fail.

To me it seems like a particular server on your side maybe is having trouble communicating with our side? That could explain why some certificates renewed okay during the cronjob? Do you guys use a round robin type method?

I have put a challenge file back in the correct folder so this link.

http://acadia.k12.la.us/.well-known/acme-challenge/BmOxbbaZLxCE7dJPY50X6e_FYR86250y0Y9ApybCMEQ

That link will respond 24/7 until I remove the file.

The thing is ANY URL we support will work as well. So for instance

http://iescentral.com/.well-known/acme-challenge/BmOxbbaZLxCE7dJPY50X6e_FYR86250y0Y9ApybCMEQ

That would work as well. Basically on haproxy I look for .well-known/acme-challenge and forward any request with that in it to the same folder. Then webroot on certbot is set to hit that folder as well.

I can see the challenge files being created and deleted properly and when they are created I can hit them with any web browser on any network I have access too and they return properly every time.

So by removing the https out of the equation should that not solve the Go TLS issue? If it is using http instead that should no longer be part of this. Unless of course there is something on your side I am unaware of.

So I look forward to hearing thoughts and I can try to implement any tests on my side you wish.

So just FYI at this moment I have firewall rules bypassing haproxy. I can change it back in a few seconds upon request.

rg305 · November 29, 2020, 10:41pm

As you noticed, 408 is a timeout.
That can easily be tested.
Can it be reached from the Internet or not?