"Timeout during connect", but we confirmed 3 successful validations

We're attempting to create a new certificate for a customer domain hosted on the Discourse infrastructure and receiving the dreaded error:

urn:ietf:params:acme:error:connection
Fetching https://api.discourse.org/api/ssl_challenges?hostname=(redacted)&filename=/.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y: Timeout during connect (likely firewall problem)

However, we can confirm that three successful retrievals of the challenge secret were made by Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org) from the system hosting the challenge response:

Time source.ip url.query http.response.status_code
Apr 5, 2021 @ 17:31:50.094 2a05:d014:3ad:701:d969:e08f:1bb9:62bd hostname=(redacted)&filename=/.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 200
Apr 5, 2021 @ 17:31:49.897 2600:1f16:269:da01:367:cea2:153a:d5c8 hostname=(redacted)&filename=/.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 200
Apr 5, 2021 @ 17:31:49.588 2600:1f14:804:fd02:1be3:bfea:ffcc:a21f hostname=(redacted)&filename=/.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 200

Since Let's Encrypt performs multiple perspective validation, I'm inferring that 1/4 of the challenges failed.

Looking at the logs for the actual webserver hosting the domain but not the LE challenge response, it seems that it was this last one coming from 66.133.109.36:

Time source.ip url.path http.response.status_code
Apr 5, 2021 @ 17:32:02.456 66.133.109.36 /.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 301
Apr 5, 2021 @ 17:31:49.585 2a05:d014:3ad:701:d969:e08f:1bb9:62bd /.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 301
Apr 5, 2021 @ 17:31:49.430 2600:1f16:269:da01:367:cea2:153a:d5c8 /.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 301
Apr 5, 2021 @ 17:31:49.354 2600:1f14:804:fd02:1be3:bfea:ffcc:a21f /.well-known/acme-challenge/LpIZOn6UjHjtir408rTfu2VLyxIOZXuxsk0ycndwV0Y 301

(these logs are for the actual hostname and redirect to the system from which the first set of logs came from)

It seems that the validation from 66.133.109.36 (outbound1.letsencrypt.org.) is not following the redirect. We see zero traffic from that IP to the server in question hosting the challenge response (or, it's trying on IPv6 and not making it there, possibly due to the HE/Cogent problem, but not then attempting IPv4) and http/https traffic is explicitly allowed.

IPv4 connectivity is up between the server and 66.133.109.36 (as I can ping it), so I know that's not the problem.

Is there any visibility from LE's side in to what may be going wrong?

5 Likes

It's nice to see a help post with so much detail! Can you tell us how often you've tried? That is, is this the report from just one attempt, or have you tried multiple times and seen the same pattern and sources of requests?

Also, did the error message you get from your client say "Secondary validation" in it? If not, maybe that means that the secondary validation succeeded, but the primary validation (from a different network) failed?

I also find it weird that you see some IPv4 requests and some IPv6 requests. It is worth noting that per the IPv6 documentation a fallback from IPv6 to IPv4 happens on an initial request, but not on a redirect. So it sounds to me like there is some sort of IPv6 connectivity issue between at least some of Let's Encrypt's locations and your IPv6 network.

5 Likes

@lestaff can someone take a look at this?

4 Likes

We are looking at it and examining the impact. If we suspect this is a problem for more than 1 user, we will open a status while we root cause and remediate.

7 Likes

At least 5 attempts, same behaviour each time.

No - the full error was at the top:

Timeout during connect (likely firewall problem)

Given that the failing request seems to be the one from 66.133.109.36, looks like it's the primary that's failing.

We're hosting on HE, so if the primary validation server is on a Cogent-only network that'll probably cause the behaviour we see.

4 Likes

Looks like Let's Encrypt staff is working on it.

https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/606b5c00a8b4db052d1ba2e1

April 5, 2021 18:50 UTC [Identified] We're aware of a problem that is intermittently affecting validation of sites that use IPv6. We've identified the cause and are working to fix it ASAP.

6 Likes

[Monitoring] We've put a workaround in place that should mitigate the intermittent IPv6 validation issue. We're continuing work to fix the underlying cause.

I can confirm that I am able to issue a certificate for the hostname in question after the workaround was put in place.

:heart: to the LE staff :smiley:

9 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.