Http-01 timeout issues

pbreed · August 23, 2023, 12:53am

So this is a client on an embedded device...

It has multiple alt names...

When doing a http-01 validation it requests the first url three times... and marks it as validated.
It requests the 2nd URL 2 times then give me an invalid response with a timeout...

I never saw the third request?
This is an IPV6 only box...
Mostly because I'm out of "Real" routable IPV4 to test development on?

(My code is 100% working in blocking mode, now trying to convert it to an async
state machine so I don't have to burn a task to run acme that happens once every 60 days)

Thoughts?

mcpherrinm · August 23, 2023, 1:07am

Let's Encrypt supports IPv6 only so that should work. Do both domains resolve to the same v6 address?

rg305 · August 23, 2023, 1:55am

I immediately thought of:
Two out of three ain't bad.
-Meat Loaf

Then my second thought was:
Is there anything that would rate limit connections/second?

pbreed · August 23, 2023, 2:30am

Nothing that should rate limit things.
Run in a non-async manner it works with V6 only.

This may be a task priority issue, IE this is running on a bare metal RTOS with strict priority scheduling and its possible that the http server is getting blocked for a few 100 msec.

What is the TCP connection and HTTP request timeout limit for LetsEncrypt?

webprofusion · August 23, 2023, 2:40am

Does your client request validation of the challenge in series or does it ask LE to validate them all first, then steadily serve the responses in order? It sounds like your http server is not able to respond to 3 names are once.

pbreed · August 23, 2023, 3:03am

I go one at a time... not all at once.

Create the order...
Then for each authorization in the order I get the authorization,
then select the http-01 challenge....
Provision the challenge...
Start the challenge by sending the {}

Every time the provisioned challenge URL is accessed I query the challenge status...
(Also every 10 backing off to 60 seconds)

On the third http query it usually turns valid.
alas on the second authorization I only see two queries not three.
and after the time out I get a challenge status of invalid with

"error": {
    "type": "urn:ietf:params:acme:error:connection",
    "detail": "2606::...... Fetching http://www.xxx.com/.well-known/acme-challenge/cPmiEPIvuBJk5Mt1HxrChuUk03AEabeiar8sPyk1rRQ: Timeout during connect (likely firewall problem)",
    "status": 400
}

Get the

MikeMcQ · August 23, 2023, 3:13am

Is your handler single threaded? Could you sometimes be on a 60s wait to check challenge status and block while the 3rd http challenge is inbound? I am not sure the timeout for Let's Encrypt servers but 60s sounds too short to cause timeout.

Can you add some logging to your handler to ensure it is freely waiting for any inbound request?

The timeout from LE is pretty specific. It's more likely on your side. And, given you get a couple successes I can't help wonder if your progressive wait is a factor.

pbreed · August 23, 2023, 3:47am

My 60 second wait is a backoff.... I don't get there its all over in less than 5 seconds.

Also the http server and the waiting task are separate tasks.
So the http server could possibly be blocked for ~200msec. This is a hard real time system... I know my latency...

The TCP task will accept the incoming TCP connection negotiate and put it aside for the http task to handle.

The only possible delay (outside of external routing weirdness) is a delay of 200msec between tcp negotiation and the http task actually servicing the request.
That is why I'm really confused here....people use this system and its TCP subsystem to drive real time things like motor drives and the like over industrial ethernet...

MikeMcQ · August 23, 2023, 3:55am

When you get the challenge failure are you able to get the failed URL from outside your own network?

pbreed · August 23, 2023, 4:06am

The first two of three worked, and nothing changed.....
I have not tried at the exact instant it fails, but 30 sec before or 30 sec after yes its accessible...

webprofusion · August 23, 2023, 4:08am

Alas, I don't think we can debug this without the code.

Does it work with other CAs? For instance, if you switched to Buypass go.

Have you tried it with Pebble?

pbreed · August 23, 2023, 4:19am

I'll work on it some more in the morning....
Long day... its probably me, but I don't see how...

rg305 · August 23, 2023, 4:21am

If you can you confirm that any of the names do pass HTTP-01 authentication, you can rotate them to the end of the next request...
Rinse and repeat until all have passed.
[mileage may vary - definitely test that out plenty in staging first]

pbreed · August 23, 2023, 4:31am

Only two names, and first always passes....
Changing to an older non-async version of the code also works...
So something in my async processing is wrong... what I have no clue...

MikeMcQ · August 23, 2023, 4:36am

You keep saying the first two of three work but could it be that the first is lost and the 2nd and 3rd are getting through? That would mean the first one after your initial domain name succeeded right?

rg305 · August 23, 2023, 4:43am

hmm...
Checking the last two IPs with the first three IPs: Can you tell which of the three IPs fails to repeat?
[Assuming the pattern repeats... Is it the first, second, or third IP that fails]

Does it go:
1,2,3 then 1,2
OR
1,2,3 then 2,3
OR
1,2,3 then 1,3
OR
...

pbreed · August 23, 2023, 3:28pm

So I wire sharked the connection, the third query just is not on the wire at my end.
With two names to challenge/verify I get 3 connections on port 80 for the first...
and 2 connections on port 80 for the 2nd and the staging server says it timed out...
This is likely a router issue...too much traffic piling up and the Server set for a very short timeout so the lost TCP_SYN packet does not get retransmitted before the server gives up...

I have very fast (Gigabit +) fiber direct to the premises and the device I'm testing is on a 10/100 connection so when too many incoming connections happen at once the router has to toss something...

rg305 · August 23, 2023, 3:55pm

Sounds like the router may be unable to handle the number of connections your gigabit line is trying to use.
OR
It isn't recycling/closing unused connections fast enough.

But it's even more strange if there are no retries seen in your capture.
So that points in another direction.
Meaning: The packets might NOT have been dropped - they may have received some negative response. Like: reject

rg305 · August 23, 2023, 4:00pm

One way to troubleshoot that is to insert a "tap" between the ISP/DMARC and the router and mirror one of the ports to a PC running WireShark.

My bet is the router isn't doing what one would expect.
To that end... what is the make/model and code rev of that router?

jvanasco · August 23, 2023, 4:46pm

I would also audit the logs to rule out firewall issues:

which IPs are validating OK?
what order they are coming in?

This can help determine if the issue with the "third ip" is due to ordering (the third will always fail, no matter what the first 2 ips are) or potentially due to a firewall or network connectivity problem.

There have been similar issues due to a problem in routing tables between LE's datacenter and the Client.

Topic		Replies	Views
HTTP validation failures due to timeout (IPv6 issues maybe?) Client dev	16	3309	September 2, 2017
Let's Encrypt Timeouts Help	10	1508	August 25, 2018
Failed http-01 challenge Help	11	4707	September 8, 2021
Http-01 challenge failing for alternate requests Help	7	1544	July 22, 2018
Timeout on http-01 challenge for no good reason Server	5	1409	March 23, 2018

Http-01 timeout issues

Related topics