I’m developing my own http-01 based client solution using the python acme library. I’ve got it almost working, but it fails on the .poll() call with a timeout response on the staging letsencrypt server. I can access the challenge URL just fine in the browser, which (based on research here) makes me thing there’s an IPv6 problem somehow. Could it be related to https://github.com/letsencrypt/boulder/pull/2852?
urn:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://images-test.jamesaddison.ca/.well-known/acme-challenge/fZJwuGZMpCdqVZHuS-tRm8HFnT1ySEc-G9F4aFjkWB4: Timeout
james@wombat:~ $ dig images-test.jamesaddison.ca
; <<>> DiG 9.8.3-P1 <<>> images-test.jamesaddison.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42647
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;images-test.jamesaddison.ca. IN A
;; ANSWER SECTION:
images-test.jamesaddison.ca. 59 IN CNAME y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net.
y5fi6c7itdivphpyc3gc000000000011.ca01.eemcdn.net. 119 IN CNAME ca01.eemcdn.net.
ca01.eemcdn.net. 119 IN CNAME ca01.eemcdn.net.i.belugacdn.com.
ca01.eemcdn.net.i.belugacdn.com. 599 IN A 104.37.178.1
;; Query time: 70 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jul 29 09:12:15 2017
;; MSG SIZE rcvd: 182
The fact that your domain has no IPv6 DNS server shouldn’t make a difference.
There’s one easy way to test: Can you create a subdomain that points to the same IPv4 IP and has only an A record? If that succeeds while this subdomain fails, then we are more likely looking at an IPv6 problem.
It does indeed work with an IPv4 A address instead of a series of CNAMEs. I get past the polling step with a status of valid for the challenge and am able to fetch chain cert, etc.
Where might the issues be in this situation? As you might be able to tell, I’m CNAMEing subdomains under my control to subdomains of a partner CDN’s (BelugaCDN) designated subdomains. Their servers are IPv6 ready, however, my origin servers that they connect to are not (which I don’t believe should matter…).
Please let me know if I can do any further testing for you! Please note that the failing subdomain was images-test.jamesaddison.ca while the successful one was images-test2.jamesaddison.ca - see below for my dig results:
I’m the CTO and one of the founders of BelugaCDN, which is what is hosting images-test.jamesaddison.ca which @jaddison is trying to test/connect to.
I’d like to offer my assistance debugging the issue if we can be of any.
Do you have a specific set of IPv6 (and IPv4) IPs that you issue challenges from? I would like to begin by testing out connectivity (V4+V6) from each of our POPs, I can also use this to determine which POP your challenges should be hitting, and attempt to gather a tcpdump capture of your connection attempts to see if/where the connection failure is occurring. Would it be possible for us to coordinate so you can do the same on your side? Also, traceroutes from your challenge servers to images-test.jamesaddison.ca may also prove useful.
It does appear that we're connecting over IPv6 to the Beluga CDN edge. I agree that it shouldn't matter what your origin servers are doing in this case since it seems like the VA error is a timeout getting an HTTP response from the edge over IPv6.
We don’t publish a list of IP addresses we use to validate, because they may change at any time. In the future we may validate from multiple IP addresses at once.
I'll ask our operations team to try and collect some of this today. I was able to verify IPv6 connectivity from one of my own test servers to the address that the VA tried to use when the timeout was observed. There wasn't anything fishy looking so it may be related to the VA datacentre or the POP that it reached. I'll see what we can find out!
Thank you for bringing this to our attention. I’ve run the following commands from several vantage points across the internet on different network providers.
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
This is the more specific version of the "Timeout" error you got. What this indicates to me is that the connection was successful and Boulder sent the request headers, but the response timed out. However, curl -6 images-test.jamesaddison.ca from a random VPS works for me. Maybe there's a firewall on certain paths that is dropping the request headers? Sounds similar to IPv6First with a hanging ipv6 connection never tries the ipv4 address · Issue #2897 · letsencrypt/boulder · GitHub.
We were able to locate an issue in a single city where connections over IPv6 were failing.
Very much appreciate the traceroutes and other debug information. I will mention as well that this definitely indicates that you have an issue with fallback from IPv6 to IPv4 if you can’t establish a connection over v6.