Improved timeout errors from Boulder

jsha · May 7, 2018, 8:40pm

Hi all! A few weeks ago we landed a change to Boulder to improve the validation errors it presents in timeout cases. Now it will say either “Timeout during connect (likely firewall problem)” or “Timeout after connect (your server may be slow or overloaded)”. The former is by far the most common. In the process, we found an interesting race condition that was causing us to mis-handle timeouts during HTTP-01 validations that timed out on an IPv6 address: those validations wouldn’t proceed to fall back to IPv4 as we intended, and would also report the wrong error. That’s now fixed.

Hopefully these changes make it easier for people to figure out their exact problems, and also easier to help them. I’ll be curious to know if they usually make sense. For instance, the “likely firewall problem” could also be “you have the wrong IP address, and it’s unrouteable,” but that seems less common. Please let me know if you spot a lot of cases in the forum contradicting this! One example I just saw: Someone’s ISP was blocking port 80. I think this is probably close enough to a firewall problem that the message reasonably covers it.

alexzorin · May 7, 2018, 8:50pm

I read somewhere previously that not falling back to IPv6 was an intentional policy, is that not the case now?

Does the fallback behaviour differ between connect timeout and response timeout?

jsha · May 7, 2018, 9:19pm

My ideal would be to never fall back. Just as we have relatively high standards for functioning DNS in order to get a certificate, I'd like to be able to say "If you're IPv6 address has reachability problems, either remove the AAAA record (which will have benefits for your visitors as well), or fix the reachability problems." However, since IPv6 validation was introduced after our initial launch, there are some people with unreachable AAAA addresses who have been renewing happily without a hitch. We didn't want to break those people when we deployed IPv6, so when we launched we provided fallback on connection timeouts for both TLS and HTTP challenges.

Shortly after launch we got a bunch of reports of failures during HTTP validation due to timeouts. When we checked our logs, we found that these were timeouts after connect. Our interpretation at the time was that the IPv6 address was perfectly routeable, but for some reason the HTTP server listening on that address was not responding. Our decision at the time was not to try and work around that case. However, it turns out that our logs were incorrect due to the race condition I linked: We were getting a connection timeout, but it was being reported as a post-connect timeout. We had to fix that race condition for a number of reasons.

There is an open question of whether we could now adopt a more strict stance on IPv6 fallbacks, since it seems like we may have never actually been doing those fallbacks correctly for the HTTP challenge. What do you think?

Yes, as of now, Boulder should correctly implement the logic we originally intended: Connection timeouts can fallback to IPv4, but not HTTP request/response timeouts.

tdelmas · May 8, 2018, 1:06pm

Could you be strict for a first issuance and lax for renew? So that it beaks nothing, but still improve the health of the global ecosystem.

Do you monitor when a a renew if successful thanks to a fallback ? A 3-month period could give the complete list of domains relying on the fallback, and if the list is short enough, maybe they could be contacted, and the fallback removed completely. And if the list is long, it may be better to keep the fallback for renew.

jsha · May 8, 2018, 5:28pm

This is a decent idea, though it increases complexity. We'd still have to maintain both paths, and plumb through the "is this a revalidation?" bit. We'd like to do the latter anyhow, for better statistics, so maybe it's practical.

Not currently, but this is also a good idea.

tdelmas · May 8, 2018, 9:58pm

For that point I hoped you could reuse the part that check for rates-limits. But yes, If the statistics shows it's possible to remove the fallback, it's better to do so.

_az · May 9, 2018, 7:56am

Whatever the decision, it would be great if the logic was clearly documented so it could be understood by integrators. Not sure if that’s a problem with commitment to a stance or showing the CA’s security hand too much, but it helps to have an unambiguous picture when assisting end-users.

My personal view is that the fallback should not be re-introduced at all:

Encourage non-broken IPv6 setups, as already mentioned
Avoid suddenly reversing an understanding that has already been already been “soaking” for quite some time

Sorry for posting from two usernames.

jsha · June 8, 2018, 7:56am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.