That should be impossible because the primary center Flexential is a required validation. One of two secondary may fail and the cert validation can still succeed. But, I agree those 3 IPv4 all look like AWS to me.
That is the pattern we saw earlier (that is, no Flexential on IPv6 using Staging)
Huh. I see why you edited your post
I need some time to assess and I plan to post this offline to other experts on how to proceed.
Just background ... Let's Encrypt has several centers around the world it uses for validation. Each center has multiple servers and multiple IP. So, it's possible that at one center there is one IP with a routing issue. Still, I think we would be seeing much higher reports of failures on this forum and in Let's Encrypt's own monitoring systems if that were the case. Intermittent comms routing problems are challenging.
I really appreciate the efforts you have made. Hopefully we can fully resolve this.
@calmexpress Just fyi we are talking about this offline.
I need to correct an earlier statement I made. I said after the HTTP->HTTPS redirect the process repeats. I was implying the same IPv6->IPv4 sequence and timeout fallback occurs after a redirect but that was wrong. I was pointed to the docs (aha!) which says the fallback is only attempted, if needed, with the first HTTP request and not any redirect.
This doesn't resolve the problem but wanted to update you on the status and correct this. Hopefully better news later.
Hi, we're seeing this same problem from our own network (Brightbox).
IPv6 HTTP challenges for staging certificates regularly fails with "Timeout during connect (likely firewall problem)" despite seeing a number of successful IPv6 validation requests being served. The exact same certificate request succeeds seconds later using the production api. I would say staging challenges fail like this about 20% of the time and I've yet to see one production certificate challenge fail.
IPv4 HTTP challenges always succeed on both staging and production.
Our challenge happens to involve a redirect, but I've reproduced it using certbot with and without a redirect.
We have a test suite that runs regularly against the staging api and it started reporting this error in August.
I can replicate the issue with Let's Debug, which seems to test the challenge from the staging environment.
ipv6: Let's Debug
ipv4: Let's Debug
Confirming the issue exists exactly as JohnL described on DigitalOcean cloud services as well as Linode/Akami cloud services. Our failure rate is more like 80% leaning toward more failures when repeated attempts are made, like some sort of firewall rate limiting on the LetsEncrypt side (separate from their published application service rate limits).
Let’s Debug is run independently and not part of Let’s Encrypt at all. If you’re seeing problems on both it and Let’s Encrypt, the problem is likely in your network.
The Let's Debug test creates a test challenge in staging which is what fails. Let's Debug also connects to the test subject on its own. For the linked test, Let's Debug's own test succeeded, while the staging test failed.
2 posts were split to a new topic: Staging fails with IPv6
Sorry it took us so long to notice and re-review this issue. We just investigated and fixed a problem that was affecting IPv6-only validation, from the staging environment only. Please try again and let us know if you're still having trouble.