Timeout during connect (likely firewall problem) on 8.43.85.0/24

My domain is:

All the attempts to renew certificates for services hosted on the 8.43.85.0/24 subnet fails. We can clearly see the connection on the logs but the validation doesn’t happen:

Jul 30 09:34:45 wiki apache: 3.14.255.131 - - [30/Jul/2019:09:34:45 +0000] “GET /.well-known/acme-challenge/C7VRofqb87xsTZ6moD6Fhh6ePcKrt6mpopY8zMe3kMo HTTP/1.1” 302 382 “http://wiki.gnome.org/.well-known/acme-challenge/C7VRofqb87xsTZ6moD6Fhh6ePcKrt6mpopY8zMe3kMo” “Mozilla/5.0 (compatible; Let’s Encrypt validation server; +https://www.letsencrypt.org)”

The excerpt is taken from wiki.gnome.org (8.43.85.12). I believe there’s an issue with this specific subnet, it’d be lovely if any let’s encrypt engineer could look into it.

It produced this output:

response {
“type”: “http-01”,
“status”: “invalid”,
“error”: {
“type”: “urn:acme:error:connection”,
“detail”: “Fetching http://wiki.gnome.org/.well-known/acme-challenge/C7VRofqb87xsTZ6moD6Fhh6ePcKrt6mpopY8zMe3kMo: Timeout during connect (likely firewall problem)”,
“status”: 400
},

My web server is (include version):
httpd-2.4.6-89.el7_6.1.x86_64. No recent changes on the httpd configuration, nor on the let’s encrypt tools. The httpd configuration matches the one of hosts sitting on another subnet which receive validated certs just fine.

The operating system my web server runs on is (include version):
RHEL 7

I can login to a root shell on my machine (yes or no, or I don’t know):
yes

Hi, I am also hosting servers in the same DC, and I can reproduce on a unrelated server ( 8.43.85.171, just 1h ago ). Both averi and I have checked the network (with limited access), and from what I did see, the http request went on and the answer was sent, no error on the tcp level.

Hi @averi

checking the wiki.gnome.org url there is a redirect to a specific subdomain ( https://check-your-website.server-daten.de/?q=wiki.gnome.org ):

Domainname Http-Status redirect Sec. G
http://wiki.gnome.org/
8.43.85.12 302 https://wiki.gnome.org/ 0.234 A
https://wiki.gnome.org/
8.43.85.12 200 4.033 A
http://wiki.gnome.org/.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de
8.43.85.12 302 https://wiki.gnome.org/.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de 0.233 A
Visible Content: Found The document has moved here . Apache/2.4.6 (Red Hat Enterprise Linux) Server at wiki.gnome.org Port 80
https://wiki.gnome.org/.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de 302 https://letsencrypt.gnome.org/.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de 3.674 A
Visible Content: Found The document has moved here . Apache/2.4.6 (Red Hat Enterprise Linux) Server at wiki.gnome.org Port 443
https://letsencrypt.gnome.org/.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de 404 4.250 A
Not Found
Visible Content: Not Found The requested URL /.well-known/acme-challenge/check-your-website-dot-server-daten-dot-de was not found on this server. Apache/2.2.15 (Red Hat) Server at letsencrypt.gnome.org Port 80

/.well-known/acme-challenge/random-filename is redirected to https://letsencrypt.gnome.org/.well-known/acme-challenge/random-filename.

Looks more that wiki.gnome.org has changed something, so the redirect from your domain isn’t redirected. Why do you redirect to wiki.gnome.org?

The reason behind the redirects is not relevant here as Let’s Encrypt officially supports up to 10 redirects. Nothing changed on the wiki.gnome.org side nor on any host hosted on that subnet. @misc manages a set of systems that sit outside of GNOME and he’s affected by the problem as well.

I’m confident the problem lies on Let’s Encrypt side :slight_smile:

Hi @averi and @misc,

We’re taking a look at this right now and will update this thread when we have more information.

1 Like

We’ve escalated to the affected datacenter’s upstream network engineers and have updated https://letsencrypt.status.io/.

3 Likes

I don’t see the status page showing any issue right now though and the problem is still there.

Thanks for the prompt action :wink:

Hmm :thinking: Did you try a hard refresh? I see this partial service disruption notice as active:
https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/5d40614b3df4b70c69ed16dd

1 Like

I see it now, thanks!

2 Likes

We’re still waiting on our upstream ISP to resolve this. They are aware of the problem, but the responses we get for when the issue will be resolved are, “soon”.

1 Like

Is there any chance this can be escalated? We’re going to run short on a set of certificates needing renewal with no enough time to rewrite the automation tools to swap the verification method to a different one than acme.

Thanks!

Any update @Phil?

I am also having related issues.

Thanks!

We’ll keep pushing on it, thanks for checking in. What’s your first certificate expiry date? That can help us communicate the urgency to our upstream.

We are migrating a lot of sites and part of the migration process is to request a certificate. We are sporadically successful. Is there an option to point to another place for cert request?

I'm afraid not, sorry.

Sorry to hear that. My requests are coming from AWS us-east-1 region, so I assume many others are impacted by this issue.

2 posts were split to a new topic: Timeout and incomplete answer checking the validation file

Hi, @averi & @misc,

Our upstream ISP is still investigating this. At this point, with the troubleshooting information we’ve gathered, it looks like the problem’s limited to the small slice of IP space you’re in (8.43.84.0/22). It’s reachable from most of our validation endpoints, but a traceroute from one of them appears to stop within Peak 10’s Raleigh POP, which is the last hop before your /22.

It’s hard to tell from the outside exactly how your network connectivity is set up, but I’ve spot-checked a few sites with what I think is very similar connectivity, and they are working for us.

We’re going to keep following up. It might speed things up for you to check with your network engineers, and with your immediate upstream ISP, as well. Is it possible there’s some kind of firewall or DDoS protection appliance that’s blocking our validation endpoints? That’s an issue we’ve seen before, with similar symptoms.

4 Likes

Hi, @readetaylor,

Your issue may be unrelated: we’re not aware of any problems affecting validations for sites on AWS.

I assume you’re also seeing a “Timeout during connect (likely firewall problem).” A frequent cause of this is publishing IPv6 AAAA records in DNS, without allowing IPv6 connections through your firewall (or AWS security group).

If that’s not it, could you please start a new thread with full troubleshooting info, including some sample domains?

Thanks!

3 Likes

The problem appears to be fixed. We had our network engineering team double check and it appears asymmetric routing got in the way. An RCA is still being worked out internally. Thanks a lot for your support!

3 Likes