We noticed that Let's Encrypt keeps retrying old acme challenges on our customers domains that don't exist anymore in our database. Currently we response with a 400 error but we get a similar request around every 10 seconds.
What's the recommended way to stop the retry mechanism for cert-manager? Is there another HTTP status code we should try? The cert-manager version is v1.3.1.
Let's Encrypt would only make challenge requests when an ACME client requests certs. Are you sure it's the LE servers making the request? Can you show some example log records at least the requesting IP, URI, and User Agent? Does the DNS for those old names still point to your servers?
The requesting IP is 34.82.13.157 and the user-agent is cert-manager/v1.3.1 (clean). One of the hosts is hello-auth.z2h.lcl.dev but all "*.lcl.dev" resolve to our servers.
LetsEncrypt wouldn't retry old acme-challenges - an authorization challenge can't be retried once it fails. Are you sure these aren't new challenges for old customers?
If so, the most likely causes of this situation is that your (former) clients haven't updated their DNS to their new hosts, or the registration lapsed but nameservers are still active. LetsEncrypt would not be able to hit the endpoints in your system unless that happened.
Those requests could be from these customers trying to set up their systems elsewhere, however... it's also possible that you did not unenroll the domains in your cert-manager system correctly OR you found a bug in cert-manager. I wouldn't be surprised if you did.
If these are actually old challenges, are you sure they're not coming from within your network(s)? Perhaps there is something in your system that checks for external visibility of challenges, and that system is responsible for all this.
That IP is allocated to google, and marked as part of google cloud. I haven't seen ISRG/LetsEncrypt run anything on their network yet, has anyone else?
I don't know anything about cert-manager, but I do know there are multiple ACME clients out there in which part of the sequence of getting a certificate is checking if the challenge token is accessible by the client first before triggering the validation at the ACME server.
OP should doublecheck if there aren't any "rogue" ACME clients out there trying to perform challenges unnecessarily.
My first guess (if the complete requested URL doesn't change) would be that something is scanning previously posted links.
And if the URL does change, there is something misconfigured that (incorrectly) thinks it should be able to obtain a cert for that name (and tries... and tries...).
If that is the case, it is really not much to worry about, as it will never be able to do so.
But I do understand your need to find and fix that problem - it just might not be within anything you control, or it might.
Being the EHLO/HELO is from your domain, it is likely something you (or someone in your company) can control.