Getting intermittent 404 from let's encrypt, when finalizing an order

Hello,

We are using let's encrypt with Kubernetes cert-manager to issue certificates for Postgres database domains that we offer to clients. The issue I am encountering is that from time to time the let's encrypt server returns 404 when cert-manager attempts to finalize an order:

failed to finalize Order resource due to bad request, marking Order as failed
err="404 urn:ietf:params:acme:error:malformed: Certificate not found"
resource_kind="Order"

The issue happens when requesting a fair amount of certificates (100 per hour). Still, the issue is not related to rate limit. Because the status is a 4xx, cert-manager does not retry immediately, and takes up to 1h to try again (this behavior is expected). I would like to know if there are known causes or explanation to the above issue. Thanks.

There are some longstanding known issues in Let's Encrypt's server where their replicas aren't quite synchronizing as fast as one would hope for, and so if your request happens to get directed to a different replica than the one that processed your prior request you can see a 404 like that. I've seen some engineers there just calling it "the 404 bug".

I don't think that there's much you can do, beyond just retrying later as your client should continue to automatically do. It may be that as Let's Encrypt continues to scale (millions of certificates issued per day!) that it might be happening a little more often now than it used to.

Thanks! at least I have an idea now about the source of the issue. As we are using cert-manager, we don't have direct control over the clients behavior, but I will try to see if we can contribute there:)