Http-01 timeout issues

So this is a client on an embedded device...

It has multiple alt names...

foo.com
www.foo.com

When doing a http-01 validation it requests the first url three times... and marks it as validated.
It requests the 2nd URL 2 times then give me an invalid response with a timeout...

I never saw the third request?
This is an IPV6 only box...
Mostly because I'm out of "Real" routable IPV4 to test development on?

(My code is 100% working in blocking mode, now trying to convert it to an async
state machine so I don't have to burn a task to run acme that happens once every 60 days)

Thoughts?

Let's Encrypt supports IPv6 only so that should work. Do both domains resolve to the same v6 address?

5 Likes

I immediately thought of:
Two out of three ain't bad.
-Meat Loaf

Then my second thought was:
Is there anything that would rate limit connections/second?

4 Likes

Nothing that should rate limit things.
Run in a non-async manner it works with V6 only.

This may be a task priority issue, IE this is running on a bare metal RTOS with strict priority scheduling and its possible that the http server is getting blocked for a few 100 msec.

What is the TCP connection and HTTP request timeout limit for LetsEncrypt?

Does your client request validation of the challenge in series or does it ask LE to validate them all first, then steadily serve the responses in order? It sounds like your http server is not able to respond to 3 names are once.

1 Like

I go one at a time... not all at once.

Create the order...
Then for each authorization in the order I get the authorization,
then select the http-01 challenge....
Provision the challenge...
Start the challenge by sending the {}

Every time the provisioned challenge URL is accessed I query the challenge status...
(Also every 10 backing off to 60 seconds)

On the third http query it usually turns valid.
alas on the second authorization I only see two queries not three.
and after the time out I get a challenge status of invalid with

"error": {
    "type": "urn:ietf:params:acme:error:connection",
    "detail": "2606::...... Fetching http://www.xxx.com/.well-known/acme-challenge/cPmiEPIvuBJk5Mt1HxrChuUk03AEabeiar8sPyk1rRQ: Timeout during connect (likely firewall problem)",
    "status": 400
}

Get the

1 Like

Is your handler single threaded? Could you sometimes be on a 60s wait to check challenge status and block while the 3rd http challenge is inbound? I am not sure the timeout for Let's Encrypt servers but 60s sounds too short to cause timeout.

Can you add some logging to your handler to ensure it is freely waiting for any inbound request?

The timeout from LE is pretty specific. It's more likely on your side. And, given you get a couple successes I can't help wonder if your progressive wait is a factor.

3 Likes

My 60 second wait is a backoff.... I don't get there its all over in less than 5 seconds.

Also the http server and the waiting task are separate tasks.
So the http server could possibly be blocked for ~200msec. This is a hard real time system... I know my latency...

The TCP task will accept the incoming TCP connection negotiate and put it aside for the http task to handle.

The only possible delay (outside of external routing weirdness) is a delay of 200msec between tcp negotiation and the http task actually servicing the request.
That is why I'm really confused here....people use this system and its TCP subsystem to drive real time things like motor drives and the like over industrial ethernet...

1 Like

When you get the challenge failure are you able to get the failed URL from outside your own network?

3 Likes

The first two of three worked, and nothing changed.....
I have not tried at the exact instant it fails, but 30 sec before or 30 sec after yes its accessible...

2 Likes

Alas, I don't think we can debug this without the code.

Does it work with other CAs? For instance, if you switched to Buypass go.

Have you tried it with Pebble?

2 Likes

I'll work on it some more in the morning....
Long day... its probably me, but I don't see how...

1 Like

If you can you confirm that any of the names do pass HTTP-01 authentication, you can rotate them to the end of the next request...
Rinse and repeat until all have passed.
[mileage may vary - definitely test that out plenty in staging first]

3 Likes

Only two names, and first always passes....
Changing to an older non-async version of the code also works...
So something in my async processing is wrong... what I have no clue...

You keep saying the first two of three work but could it be that the first is lost and the 2nd and 3rd are getting through? That would mean the first one after your initial domain name succeeded right?

2 Likes

hmm...
Checking the last two IPs with the first three IPs: Can you tell which of the three IPs fails to repeat?
[Assuming the pattern repeats... Is it the first, second, or third IP that fails]

Does it go:
1,2,3 then 1,2
OR
1,2,3 then 2,3
OR
1,2,3 then 1,3
OR
...

3 Likes

So I wire sharked the connection, the third query just is not on the wire at my end.
With two names to challenge/verify I get 3 connections on port 80 for the first...
and 2 connections on port 80 for the 2nd and the staging server says it timed out...
This is likely a router issue...too much traffic piling up and the Server set for a very short timeout so the lost TCP_SYN packet does not get retransmitted before the server gives up...

I have very fast (Gigabit +) fiber direct to the premises and the device I'm testing is on a 10/100 connection so when too many incoming connections happen at once the router has to toss something...

Sounds like the router may be unable to handle the number of connections your gigabit line is trying to use.
OR
It isn't recycling/closing unused connections fast enough.

But it's even more strange if there are no retries seen in your capture.
So that points in another direction.
Meaning: The packets might NOT have been dropped - they may have received some negative response. Like: reject

4 Likes

One way to troubleshoot that is to insert a "tap" between the ISP/DMARC and the router and mirror one of the ports to a PC running WireShark.

My bet is the router isn't doing what one would expect.
To that end... what is the make/model and code rev of that router?

3 Likes

I would also audit the logs to rule out firewall issues:

  1. which IPs are validating OK?
  2. what order they are coming in?

This can help determine if the issue with the "third ip" is due to ordering (the third will always fail, no matter what the first 2 ips are) or potentially due to a firewall or network connectivity problem.

There have been similar issues due to a problem in routing tables between LE's datacenter and the Client.

4 Likes