I was able to find these requests and piece together a timeline from what the validation authority on our side logged. As you noticed the staging env performs domain validation from multiple network perspectives. There are 3 "remote" VAs and one primary VA. For the sake of making this timeline easier to understand lets call them RVA_1, RVA_2, RVA_3 and PVA_1.
- 26/11/17 11:26:03.000 R_VA1 and R_VA2, start a TLS-SNI-01 validation attempt to
1.163.189.31:443
with SNI valuee2e2cf3a09<snipped>.acme.invalid
- 26/11/17 11:26:03.899674 P_VA1 starts a validation attempt to the same address/SNI value.
- 26/11/17 11:26:05.000 R_VA3 starts a validation attempt to the same address/SNI value.
- 26/11/17 11:26:08.000 R_VA1 reports a timeout. 11:26:08 minus 11:26:03 gives us the expected 5s timeout.
- 26/11/17 11:26:08.000 R_VA2 reports a EOF error (Maybe the heap was blown on your end at this point?)
- 26/11/17 11:26:08.899836 P_VA1 reports a timeout
- 26/11/17 11:26:09.000 R_VA3 reports an EOF
At face value this seems to match up to my expectation of how the timeouts are enforced at 5s.
Does the above timeline help at all with explaining the failure conditions on your end? It's possible the EOF being read on the alternating connection failures is a consequence of the outstanding validation requests being cancelled on our end after the first timeout but I'd have to take a closer look at the code to know for sure.
I can do a little bit more digging tomorrow if there's still a mystery at hand.
(I'm impressed that you hack around on chips with so little heap space! Sounds like an interesting challenge. I'll keep you in my thoughts the next time some of my sloppier code mallocs
a larger than needed buffer without a care in the world )