I am triggering a pipeline where we request certificates for two different environments in parallel (almost the same time), when first environment request is submitted for certificate issue, we can see a txt record entry being added in DNS but before acme_challenge is completed for the text record, it is being replaced with new certificate request form different environment (in my case both challenges are from same DNSname) and I can see following error in first environment DNS activity logs.
"cert-manager/challenges/acceptChallenge: error waiting for authorization" err="acme: authorization error for dnsrecord 403 urn:ietf:params:acme:error:unauthorized: Incorrect TXT record "3FTytUY2onOhi3SsTBRFV9AnASjhtMuwEi74WI" found at _acme-challenge.dnsrecord" resource_name="certificate-xxxxxx" resource_namespace="product" resource_kind="Challenge" resource_version="v1" dnsName="dnsrecord" type="DNS-01"
Is there any option where TXT record won't be replaced ?
(I've moved this to the Help category, I think that makes the most sense for what you're trying to do.)
Your ACME client on one server should be able to each add and remove their own TXT records without impacting the other client's records. It wouldn't shock me if that was a scenario that most ACME clients didn't particularly have a lot of testing around (particularly considering the wide variety of DNS APIs out there).
This is the suggestion I would recommend. I don't know the specifics of cert-manager or your infrastructure, but if you have one centralized place that requests the certificates and stores them (and the private key) somewhere securely, and then each server loads the key and cert from that secure store, then I think that's the easiest approach for scaling out to multiple systems that each need to do their own TLS termination.
Sorry for the late response, I am in parallel looking into cert-manager github page for any open issues or PR's to fix it. I found below issue link which 100% match my problem. Please consider scenario from below link.
IMHO the right solution is what @petercooperjr and @rg305 suggested above: to obtain certificates on one centralized network/machine and then distribute them to the other networks/machines as a post-success hook.
The usage you described - two parallel requests - is often an anti-pattern that leads to downtime issues due to the effects of rate-limiting when it is not properly implemented on automated systems.
In case you are not using this as an anti-pattern, which is possible but not likely given the experiences of people posting similar problems here before...
This is a defect in cert manager with two possible solutions:
cert-manager fixes it.
you switch to another client.
Considering how long this has been happening against the open ticket you shared, I would suggest moving to another ACME client.
If that is not an option, why not force the tasks to run in sequence? You can set the renewals to only work on even days in one DNS region, and odd days in the other DNS region. If you have multiple regions, just divide a day into multiple buckets and only allow renewals to be triggered at that point.
If you are encountering a scenario where that is not possible because of how you start/stop services, you are probably leveraging one or more anti-patterns in your deployment and should be using the centralized method.
Like others have suggested, a strategy that makes particular sense for containers that are constantly re-deployed is to renew certificates periodically and store them in a secrets vault (such as Azure KeyVault), then deploy the certificate regularly (and on container startup) by pulling from the secrets vault. This way the renewal of the actual cert is managed by one process and it's reporting can be centralized, while actual deployment to to the container is kept as an independent process (just use latest secret and fetch it regularly).
Out of interest, are you using Azure Container Apps for your deployment, or something else?