HTTP validation failures due to timeout (IPv6 issues maybe?)

I’m developing my own http-01 based client solution using the python acme library. I’ve got it almost working, but it fails on the .poll() call with a timeout response on the staging letsencrypt server. I can access the challenge URL just fine in the browser, which (based on research here) makes me thing there’s an IPv6 problem somehow. Could it be related to https://github.com/letsencrypt/boulder/pull/2852?

urn:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://images-test.jamesaddison.ca/.well-known/acme-challenge/fZJwuGZMpCdqVZHuS-tRm8HFnT1ySEc-G9F4aFjkWB4: Timeout
james@wombat:~ $ dig images-test.jamesaddison.ca

; <<>> DiG 9.8.3-P1 <<>> images-test.jamesaddison.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42647
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;images-test.jamesaddison.ca.	IN	A

;; ANSWER SECTION:
images-test.jamesaddison.ca. 59	IN	CNAME	y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net.
y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net. 119 IN CNAME ca01.eemcdn.net.
ca01.eemcdn.net.	119	IN	CNAME	ca01.eemcdn.net.i.belugacdn.com.
ca01.eemcdn.net.i.belugacdn.com. 599 IN	A	104.37.178.1

;; Query time: 70 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jul 29 09:12:15 2017
;; MSG SIZE  rcvd: 182

and IPv6:

james@wombat:~ $ dig aaaa images-test.jamesaddison.ca

; <<>> DiG 9.8.3-P1 <<>> aaaa images-test.jamesaddison.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60795
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;images-test.jamesaddison.ca.	IN	AAAA

;; ANSWER SECTION:
images-test.jamesaddison.ca. 54	IN	CNAME	y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net.
y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net. 114 IN CNAME ca01.eemcdn.net.
ca01.eemcdn.net.	114	IN	CNAME	ca01.eemcdn.net.i.belugacdn.com.
ca01.eemcdn.net.i.belugacdn.com. 599 IN	AAAA	2610:1c8:c::1

;; Query time: 50 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jul 29 09:12:20 2017
;; MSG SIZE  rcvd: 194

From http://ipv6-test.com/validate.php:

47 AM

The fact that your domain has no IPv6 DNS server shouldn’t make a difference.

There’s one easy way to test: Can you create a subdomain that points to the same IPv4 IP and has only an A record? If that succeeds while this subdomain fails, then we are more likely looking at an IPv6 problem.

Note that #2852 is now fixed an in production.

It does indeed work with an IPv4 A address instead of a series of CNAMEs. I get past the polling step with a status of valid for the challenge and am able to fetch chain cert, etc.

Where might the issues be in this situation? As you might be able to tell, I’m CNAMEing subdomains under my control to subdomains of a partner CDN’s (BelugaCDN) designated subdomains. Their servers are IPv6 ready, however, my origin servers that they connect to are not (which I don’t believe should matter…).

Please let me know if I can do any further testing for you! Please note that the failing subdomain was images-test.jamesaddison.ca while the successful one was images-test2.jamesaddison.ca - see below for my dig results:

james@wombat:~ $ dig images-test2.jamesaddison.ca

; <<>> DiG 9.8.3-P1 <<>> images-test2.jamesaddison.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7722
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;images-test2.jamesaddison.ca.	IN	A

;; ANSWER SECTION:
images-test2.jamesaddison.ca. 59 IN	A	104.37.178.1

;; Query time: 34 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jul 29 14:52:37 2017
;; MSG SIZE  rcvd: 62

and IPv6:

james@wombat:~ $ dig aaaa images-test2.jamesaddison.ca

; <<>> DiG 9.8.3-P1 <<>> aaaa images-test2.jamesaddison.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 196
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;images-test2.jamesaddison.ca.	IN	AAAA

;; AUTHORITY SECTION:
jamesaddison.ca.	1799	IN	SOA	dns1.registrar-servers.com. hostmaster.registrar-servers.com. 2017072900 43200 3600 604800 3601

;; Query time: 69 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jul 29 14:52:45 2017
;; MSG SIZE  rcvd: 119

Hi @jsha,

I’m the CTO and one of the founders of BelugaCDN, which is what is hosting images-test.jamesaddison.ca which @jaddison is trying to test/connect to.

I’d like to offer my assistance debugging the issue if we can be of any.

Do you have a specific set of IPv6 (and IPv4) IPs that you issue challenges from? I would like to begin by testing out connectivity (V4+V6) from each of our POPs, I can also use this to determine which POP your challenges should be hitting, and attempt to gather a tcpdump capture of your connection attempts to see if/where the connection failure is occurring. Would it be possible for us to coordinate so you can do the same on your side? Also, traceroutes from your challenge servers to images-test.jamesaddison.ca may also prove useful.

Thanks and let me know,

-Adam

Thanks, @AdamJacobMuller. @cpu might also be able to help with your debugging request.

It does appear that we're connecting over IPv6 to the Beluga CDN edge. I agree that it shouldn't matter what your origin servers are doing in this case since it seems like the VA error is a timeout getting an HTTP response from the edge over IPv6.

We don’t publish a list of IP addresses we use to validate, because they may change at any time. In the future we may validate from multiple IP addresses at once.

I'll ask our operations team to try and collect some of this today. I was able to verify IPv6 connectivity from one of my own test servers to the address that the VA tried to use when the timeout was observed. There wasn't anything fishy looking so it may be related to the VA datacentre or the POP that it reached. I'll see what we can find out!

Thank you for bringing this to our attention. I’ve run the following commands from several vantage points across the internet on different network providers.

  • traceroute 2610:1c8:c::1
  • traceroute -6 2610:1c8:c::1
  • curl -g -H "Host: images-test.jamesaddison.ca" [2610:1c8:c::1]:80/.well-known/acme-challenge/
  • curl -6 -g -H "Host: images-test.jamesaddison.ca" [2610:1c8:c::1]:80/.well-known/acme-challenge/

I am seeing mixed results from the various vantage points while attempting to establish a connection to 2610:1c8:c::1.

Working:

  • Amazon: AS14618
  • Various AT&T ASNs

Not Working:

  • Linode: AS63949
  • Merit Networks: AS237
  • Comcast: AS7922
  • The current issuing datacenter’s ASN

So, it looks like a network error/misconfiguration perhaps?

Is it safe to assume then that either ipv4 is not working in these cases as well, or the fallback fix in https://github.com/letsencrypt/boulder/pull/2852 it’s not working as desired?

Digging through our logs, I see this:

net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is the more specific version of the "Timeout" error you got. What this indicates to me is that the connection was successful and Boulder sent the request headers, but the response timed out. However, curl -6 images-test.jamesaddison.ca from a random VPS works for me. Maybe there's a firewall on certain paths that is dropping the request headers? Sounds similar to IPv6First with a hanging ipv6 connection never tries the ipv4 address · Issue #2897 · letsencrypt/boulder · GitHub.

Hi @devnullisahappyplace,

For the non-working networks, could you provide the traceroute output?

I just tested from Linode in Newark and have no issues connecting.

@AdamJacobMuller

For me, it works with Linode Atlanta, but with Dallas, IPv6 connections time out.

Atlanta:

$ mtr -brwz images-test.jamesaddison.ca
Start: Tue Aug  1 04:53:57 2017
HOST: jane                                                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS63949 2600:3c02::8678:acff:fe5a:1941                   0.0%    10    0.9   1.0   0.9   1.3   0.0
  2. AS63949 2600:3c02:4444:3::1                             50.0%    10    1.0   1.2   1.0   1.6   0.0
  3. AS63949 2600:3c02:4444:5::1                             80.0%    10    1.0   1.0   0.9   1.0   0.0
  4. AS???   2001:478:132::75                                 0.0%    10    0.7   3.3   0.6  26.2   8.0
  5. AS6939  100ge4-1.core1.mia1.he.net (2001:470:0:18d::2)   0.0%    10   15.0  15.0  14.9  15.4   0.0
  6. AS6461  2001:504:40:108::1:32                            0.0%    10   13.8  14.9  13.7  24.6   3.3
  7. AS23393 2610:1c8:c::1                                    0.0%    10   13.8  13.8  13.6  14.1   0.0

Dallas:

$ mtr -brwz images-test.jamesaddison.ca
Start: Tue Aug  1 04:54:02 2017
HOST: clover                                                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS63949 2600:3c00:ffff:0:8678:acff:fe0d:97c1                     0.0%    10    1.3   1.6   1.2   3.2   0.5
  2. AS63949 2600:3c00:2222:6::1                                      0.0%    10    1.3   1.3   1.2   1.4   0.0
  3. AS63949 2600:3c00:2222:5::2                                     40.0%    10    2.2   1.6   1.2   2.3   0.0
  4. AS???   eqix-ix-da1-ipv6.isprime.com (2001:504:0:5:0:2:3393:1)   0.0%    10   13.2  13.7   1.1  68.9  20.1
  5. AS???   ???                                                     100.0    10    0.0   0.0   0.0   0.0   0.0

Online mtr thingies: https://mtr-atlanta.mnrd.us/?c=8da3b52e https://mtr-dallas.mnrd.us/?c=74a62a93 (Those links will expire, and running without -n may time out.)

1 Like

@AdamJacobMuller I’ve sent you a PM with some traceroute information.

So is this issue consider solved by both sides then? I think @AdamJacobMuller indicated as much to me out of band.

@jaddison Have you been able to make progress on your client since this thread started? If so, I’d say this can be closed.

I cannot reproduce the issue any more - thanks @devnullisahappyplace and @AdamJacobMuller!

Hi @devnullisahappyplace,

We were able to locate an issue in a single city where connections over IPv6 were failing.

Very much appreciate the traceroutes and other debug information. I will mention as well that this definitely indicates that you have an issue with fallback from IPv6 to IPv4 if you can’t establish a connection over v6.

Thanks all,

-Adam

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.