Our nightly regressions test our ACME client, one test creates then revokes a certificate using LE Staging. The tests have run without errors for many weeks, our client hasn't changed during that time.
Starting 16 April we are seeing intermittent failures during certificate create, while waiting for finalize, specifically this:
Most of the certificate ops executed by our nightly regressions using LE Stage pass, however there have been 7 of the finalizing-order failures since 2023-04-16 19:40:27 UTC. The client polls for finalize status to change from status "processing" to a completed status - in failure cases it generally has to wait about 90 seconds before the 500 error is returned.
Any ideas why these intermittent failures have shown in the past two days?
The configuration for the "Sapling 2023h2" CT log was incorrect, which resulted in all our SCTs going to other test logs. Two of those other test logs became overloaded, causing slow staging finalizes and intermittent failures.
I've correctly configured Sapling 2023h2, and everything should be better now. This would have started at 2023-07-15T00:00Z when the Sapling 2023h1 log shard ended.
i manually ran the regression test that failed per above, it still fails but much quicker now. once domain validation succeeds and finalize order starts, the client receives
Yeah, it's still (basically) the same problem, just we're failing certificate linting after successfully submitting now. I missed a spot in my config change and posted here a little too early. Will be a few more minutes.