More debugging information when verification fails

Cohote · June 17, 2021, 2:43pm

As per my post in Help - Yet another “Timeout” while verifying via HTTP -
Would like to see a little more detail in errors when reported by clients (such as le64 and letsdebug) during HTTP-01 verification. In my case, I kept getting "Timeout, check firewall", when as per logs, there was something else going on. (Primary servers not even trying to hit site on first few attempts - but after waiting an hour or so after the first (failed) attempt, it then succeeds, with no changes required on my part.

Perhaps a timeout occurred in the LE server itself, and NOT trying to contact my site - but in which case then, the 'firewall' message is a complete red herring. (And took me days of verifying everything before posting here.)

Something even as innocuous as 'Primary Server timeout, Secondary fine' or even '3/4 server successful' would have helped - both me in trying to debug, and then when I posted here for help.

Osiris · June 17, 2021, 5:01pm

You can be assured that the primary servers were trying to hit your servers: you just didn't see anything in the logs, because the TCP connection wasn't established from the first packet.

Also, if a secondary server fails, but the primary doesn't, the error message will add "During secondary validation, …". Just when the primary fails (with or without failing secondaries), it doesn't add "primary".

petercooperjr · June 17, 2021, 5:14pm

Yeah, but I think from the end-user perspective, saying "some servers couldn't connect but some could" is pretty useful, and if the primary fails but secondary succeed it doesn't tell the user anything like that. Whether a server is "primary" or "secondary" is just how Let's Encrypt organizes them, but really from an end-user perspective the difference should be immaterial. Better messaging to make clear whether "no server could connect" or "some servers could connect and some couldn't" would probably be helpful, rather than just the "secret code" here of "is the word secondary in the message" being somewhat helpful for one-way-around.

I've seen several cases lately where it looks like primary failed but at least some secondary succeeded (or at least they see some connections working in their logs without the word "secondary" being in the error message; there's the OP's here, one from last weekend, and this one from yesterday). I don't know if the connection at Let's Encrypt's main datacenter has gotten less reliable or if it's just a coincidence that I happened to notice these, but if they're going to be trying all the connections at once anyway, it seems that reporting when a partial success happens might be useful for those debugging even when one of the failures is the primary server.

Cohote · June 17, 2021, 5:16pm

I am still going to have to disagree here - of if they were, the block was outside of my data-center - in either case, the 'firewall' issue is a bad, and incorrect error to give.

It also does not explain the fact that after trying the first time, failing, and then trying again and it works.

Yup, exactly - thanks for phrasing it better'n I did.

petercooperjr · June 17, 2021, 5:41pm

Well, it's trying to use "firewall" as a generic term for "packets we sent your way didn't get a response". Usually this is due to the firewall somewhere blocking traffic, and it's actually surprisingly typical for people to be getting the message due to blocks outside of their "datacenter" when they're trying to host a server on a residential connection and their ISP blocks port 80 for them, or sometimes even a "commercial" ISP has some adaptive regional-based firewall in place ahead of the traffic getting to the server's network. If you can think of a better user-friendly message for "we can't connect to your server" they might be happy to change it, though. It's really hard to describe all the things that might be wrong with traffic getting from Let's Encrypt's datacenters to yours.

Well, something changed, but whether it was on a network closer to Let's Encrypt end of the Internet or a network closer to your end of the Internet, we might never know.

Osiris · June 17, 2021, 5:59pm

That's allowed of course

I have no reason to believe that the validation server did not try to establish a connection. You don't raise a timeout error if there isn't anything to timeout to. I.e., for something to timeout, it had to try at least something.

Please read the error message more carefully:

Timeout during connect (likely firewall problem)

The mentioning of the firewall is a suggestion. Likely a firewall problem, which it is most of the time. Of course other issues could lead to a time out, but I believe it isn't useful to sum up an entire list of possible reasons why a connection times out. That would make the error message rather lengthy. And as most of the issues are due to firewalls (as you can check on this Community), I don't think the suggestion is unwarrented.

griffin · June 17, 2021, 11:43pm

Welcome to the Let's Encrypt Community, Mike

FWIW, this particular area has been a huge point of contention for a while now, so you're certainly not alone in your frustrations here. Some of the most vehement and incendiary critics ever seen in this community (of which you're nowhere close, so thank you for both your candor and civility) have focused on this very topic. I'm sure that @Osiris and @petercooperjr will both agree that there's an awkward balancing act here as the CA's (Let's Encrypt's) feedback is filtered through a third party's (certbot's) ACME client software, which has carte blanche to report troubles however it sees fit. As the author of an ACME client myself (CertSage), I can say that it's a bit of a "telephone" game trying to relay what's happening as reported/inferred by the CA to the user.

@Osiris, @petercooperjr

Maybe it's time we approach Let's Encrypt about a simple, "full transparency" outline of the processes that reflects the errors thrown by Boulder.

I feel like this is not "step-oriented"/granular enough for debugging purposes:

but the full specification is overly complex to digest for debugging purposes:

https://tools.ietf.org/html/rfc8555

_az · June 17, 2021, 11:48pm

It's worth saying that network problems are opaque by nature. It's not like there's a treasure trove of information sitting in their systems which they're choosing not to share with you.

Even Boulder does not know what the cause of the problem is. The operating system doesn't tell it anything, the routers don't tell it anything useful, there's no ACK packets coming back from the remote peer. It's just a black hole.

Boulder already tries to interpret the different network error states and translate them to helpful errors. The timeout (firewall) message is very helpful in my opinion. As shown on our forums, it's actually accurate, most of the time.

petercooperjr · June 17, 2021, 11:49pm

Yeah, I think there's been an understanding all along that the documentation needs further improving. I'm looking forward to the effort to rework the FAQ, and I think your suggestion is along the same lines of trying to help people find answers to the problems that most regularly happen. I suspect the limiting factor in the critical path, though, is just people willing (and taking the time) to write/review/edit/etc. such better documentation.

griffin · June 17, 2021, 11:50pm

I fully agree, @_az. I'm just wondering if a step-by-step (white box) guide with:

what's currently happening
what's expected
how things could go wrong

might lead to much clearer debugging.

I forsee this as basically a mind-map of what we helpers mentally go through every time we diagnose a case.

Osiris · June 18, 2021, 5:49am

I'm wondering.. Currently, the error messages often contain a "hint" to the user, which might not always be the correct thing. Understandable, you can't write proza in a short error message.

Wouldn't it be better to name or number the error messages and refer to a specific Boulder website which explains the error message in more detail?

_az · June 18, 2021, 5:50am

Hey kid, you wanna see a dead body?

By the way I still totally want to do this if anybody else wants to join me.

Osiris · June 18, 2021, 5:52am

What's the cause of death?

_az · June 18, 2021, 5:59am

In all seriousness I wasn't sure how to interpret the silence, but if @jsha doesn't hate it ^{maybe it could be a thing}.

Osiris · June 18, 2021, 6:40am

I've opened the thread again, open for discussion!

Nummer378 · June 18, 2021, 10:41am

This so reminds me of the wonderful world of Microsoft Error Codes

"The numbers mason, what do they mean!"

petercooperjr · June 18, 2021, 3:22pm

Oh, lists of error codes go back to way before Microsoft. In the olden days, computers didn't have the bytes to give you back much more than a number that you could look up in the (physically-printed) manual.

(Insert obligatory XKCD link here)

Cohote · June 18, 2021, 3:56pm

Do NOT get me f*cking started on IIS's error codes. Holy crap, talk about non-standard at times.

Sorry for my silence, my work's been a pain this week.

Honestly, better documentation would have helped as well. I didn't even know that the system was trying to ping my server from different servers until it was pointed out to a blog post. And then even in there, it gave very little info - I had to click a link it IT to another blog to find some more info.

If I had known ahead of time that, "okay, 4 different servers are trying to hit me", I could then help narrow it down some.

To me, I still find it odd that whatever is happening seems to fix itself over the course of an hour or so. I could create a sub-domain now, something random, and very likely cause it to occur again.

Edit: I will add that I'm a DB/Backend developer. I know enough IIS and network to run a site and protect it, but deep network knowledge I do NOT have - so I know HOW to make sure my firewall isn't blocking the request, and I can take my Data Center's word that they're not blocking IPs - but deep diving, like dns issues, routing issues is beyond me. My initial thought was that the 'timeout' may have been coming from another, secondary process - like triyng to find the IP address for my domain to hit - and not the actual request to HIT my server. Which would explain then why, after a certain amount of time, it would suddenly work.

Osiris · June 18, 2021, 4:53pm

If it indeed is a BGP kind of problem, I guess this could indeed happen any time, if the issue isn't thoroughly fixed but keeps popping in and out of existance. I'm not that familiar with BGP and the internet protocols which regulate the greater internet structure, but I think there are some dynamic things possible. For example, traceroute sends three packets per hop it tries. But sometimes you get responses from different routers back in the same hop. So even if those three packets are send at the same time, they traverse different routes somehow.

Cohote · June 18, 2021, 7:38pm

Ah, thanks for reminding me - I DID open a support ticket with my Data Center, informing them of the BGP error/issue. I haven't heard anything back yet (likely won't until next week.)

Topic		Replies	Views
Yet another "Timeout" while verifying via HTTP Help	26	3261	July 17, 2021
Can't test renew certificat Issuance Tech	13	4070	December 31, 2017
Problem with verification Issuance Tech	22	4181	May 29, 2020
The Let's Encrypt HTTP challenge failed: acme error 'urn:acme:error:connection': DNS problem: SERVFAIL looking up A for domain.com	17	14088	March 15, 2016
Renew certificate failed due to secondary validation Help	32	2392	July 2, 2022

More debugging information when verification fails

Related topics