I did as you suggested. It is definetely a TLSv1.2 issue. I was able to tcpdump a letsdebug working and non working one and captured it as a pcap file so i can compare in wireshark. So a few more questions.
I have a pcap of a working domain and a pcap of a non working domain. Is anyone good at reading TCP at this low level that could review with wireshark?
I was under that impression as well. It does retrieve the challenge file via http but apparently part of the staging process is done with https? Unless there is a way to change that?
So with that it is looking like a haproxy issue. haproxy handles all the ssl certificates. It loads around 560+ certificates. Maybe we are hitting some weird limit with something. The thing is all the websites work and answer SSL correctly so it may be something to do with the TLSv1.2 implementation on the load balancer.
I guess I needed to clarify. And pretty sure I can answer my own question as having the "outbound" portion redirected to a different domain kind of defeats the whole authentication part of the process.
If you want (and there's nothing sensitive in there), zip up and email the pcap files to az@letsdebug.net and I can take a look as well. Hard to understand anything from screenshots.
My thinking is an update to haproxy has caused this. We are running 2.3.1 stable which has been a really nice update BTW. Much faster reloads and much better admin api. It used to take us 3.5 minutes to reload haproxy with the 560+ SSL certificates on multiple haproxy frontends. We have 3 platforms we support as well as some development servers so we load up a lot of actual certificate pems. Now haproxy reloads in less then 15 seconds. But regardless if we find this to be a haproxy issue we will revert back to what works until they can find and fix the issue. That is if it turns out to be in haproxy which at this point I am pretty sure it is. We have ruled out pretty much all else at this point.
We have a barracuda F400 firewall. Pretty robust. I did contact support and we went through it looking for the possibility of an IPS issue as someone mentioned above and did not see where it was blocking any traffic concerning letsencrypt. So I think you are right. I will revert back to an older version later tonight if no one has any other suggestions by then.
It might also be in the HAProxy config.
But, in my view, it is either the version change or a config change (or both).
You could still try redirecting everything to HTTPS - it seems that HAProxy can do that much better now.
[at the expense of doing HTTP much less better - lol]
I am leaning more to a limit we have reached in haproxy regarding the certificates. Otherwise I think it would happen all the time. Why it works on some domains and others might actually have a pattern. It would be important to figure that out to help the folks at haproxy to fix it if it ends up being haproxy.
Certificates are HTTPS, the problem is in HTTP.
Unless the system is running low on resources (first come first served, quantities limited).
Which you could try changing the loading order (I'm not familiar with HAProxy).