Inconsistent LE responses http-01 invalid/valid - wireshark captures


#1

LE (outbound1.letsencrypt.org: 66.133.109.36) responding with “invalid” and “valid” responses for http-01 challenge - same system/same client within one hour.

Details:
Server 2012 R2
ACME POSH client - successfully requesting the same cert on the same server a couple of months ago; the issue seems to be related to a recent change on the LE side.
CN mail.5demo.com (response invalid 2/23/17 15:57 MST)
CN autodiscover.5demo.com (response valid 2/23/17 16:52 MST)
IP: 98.245.115.5
ISP: Comcast Cable
City: Boulder
State: Colorado
Country: United States

  1. CM mail.5demo.com - http-01 challenge successfully created, requested, and provided to the LE outbound1.letsencrypt.org: 66.133.109.36 server:

The LE server response is “invalid”:

2.CN autodiscover.5demo.com - 54 minutes later - same running system, no change whatsoever:

The LE server response is “valid”:

In both captures we can see clearly how the LE (outbound1.letsencrypt.org: 66.133.109.36) server connects, sends a GET request for the challenge file located at the challenge folder (matching the token in the response “invalid/valid” snapshots), and receives the file with the correct content.

The wireshark captures are available here:
http://www.netometer.com/downloads/LE_Wireshark_Captures.zip

Regards,

Dean


SSL cert for Exchange 2013
#2

I notice there’s a 13s delay in the first validation and 1s in the latter from the submission to the request from LE’s server. Is it possible that the first instance is too eager to continue and should just check again? I’m not familiar with the PowerShell client but that’s the only difference I noticed.

If there is a crucial timing issue, it could explain why the first one fails, because the LE server is busy looking up DNS before making a connection, while in the second case it already has most of the DNS tree cached and the request is handled much quicker.

It’s just a shot in the dark.


#3

Re: "because the LE server is busy looking up DNS before making a connection"
In the capture we see the successful TCP three-way handshake in both cases and the conversation where LE checks and confirms the challenge - this happens after the successful DNS name resolution so DNS is out of the picture here.

In this particular case, I wasn’t running the script as I suspected a timing (as you say “too eager”) issue. I was running the commands manually one by one. The weird think is that the response “invalid” comes right away; there isn’t even a pending status. Updating the request to check the status simply confirms the “invalid” status - you don’t get a status “valid” once you get “invalid”.

I don’t see a way to get more details about the rejection - LE retrieves successfully the challenge file (as it is clear from the capture) and then issues status “invalid”. Most importantly, that same CN was confirmed successfully on the same machine, same script a couple of moths ago. And the weird thing is that under the same conditions, the next FQDN - autodiscover - is approved.

The problem is sporadic - sometimes “mail.5demo.com” is approved and “autodiscover.5demo.com” fails - there is no logic behind this. Again, I think the issue is on the LE side.

Dean


#4

The only theory I can come up with based on the wireshark screenshot and the validation records is some kind of race condition where the timeout occurs somewhere between when the (CA) validation server sends the ACK for the challenge response it has received and when the response is handed back to the code that verifies the response (again on the CA side). I’m not sure how plausible that is, and it would presumably be a very small window, so it seems unlikely that this could be triggered multiple times like in this case.

(I assume the ACK would not be sent if the net.Dialer runs into a timeout. If it does for some reason, that would explain this too - but that seems like weird behaviour to me. My understanding of TCP is rather limited, so I might be mistaken.)

The zip file with the wireshark captures appears to be empty, by the way.

Direct link for the challenge/authz (which includes the full validation record):
https://acme-v01.api.letsencrypt.org/acme/challenge/SjDgnMoJoHtZYt80XSAXWIwL4kkiDUZQY778NUVECjQ/696446878

@cpu @jsha do your logs show any additional details for this challenge?


#5

The logs show a timeout waiting for an HTTP response. @NetoMeter, are you getting consistent failures on that one host, or was it just the one time?


#6

Thank you, Jacob.

That’s exactly what happened - I added the “TCP Delta” column (should have done this from the very beginning) and it tells the whole story:

I was mislead initially because I saw the LE GET request and the web server response. It’s true that the server responded correctly but the key here is when:

  1. The RTT (Round Trip Time) is decent, 42 ms - packets #52-54
  2. LE sends instantly the GET request - #55
  3. The web server acknowledges quickly the request, 14ms - #56
  4. LE waits 5 sec (4.95) and sends FIN, ACK (connection close) - #57
  5. The Web servers ACKs the connection close LE request - #58
  6. And 8 sec (7.84) after acknowledging the FIN,ACK, the web server sends the challenge response - #69 (it’s not uncommon to see one of the parties continue sending data after acknowledging a FIN, ACK).
  7. LE ACKs the challenge response - #71 that’s perfectly fine (and the reason for my confusion)

The conversation above is stream #2 in http01-invalid.pcapng from the zip file above. The bottom line is the issue is due to the delayed server response - exactly what Jacob saw in the logs. It’s clear that LE has a timeout of 5 seconds, which is perfectly reasonable.

I really appreciate your quick response guys. You rock!!!

I’ll replace some network equipment and try again.

Regards,

Dean


#7

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.