Parallel certificate request with same dnsName for two different server is failing

iamkishore432 · November 20, 2023, 12:45pm

I am triggering a pipeline where we request certificates for two different environments in parallel (almost the same time), when first environment request is submitted for certificate issue, we can see a txt record entry being added in DNS but before acme_challenge is completed for the text record, it is being replaced with new certificate request form different environment (in my case both challenges are from same DNSname) and I can see following error in first environment DNS activity logs.

"cert-manager/challenges/acceptChallenge: error waiting for authorization" err="acme: authorization error for dnsrecord 403 urn:ietf:params:acme:error:unauthorized: Incorrect TXT record "3FTytUY2onOhi3SsTBRFV9AnASjhtMuwEi74WI" found at _acme-challenge.dnsrecord" resource_name="certificate-xxxxxx" resource_namespace="product" resource_kind="Challenge" resource_version="v1" dnsName="dnsrecord" type="DNS-01"

Is there any option where TXT record won't be replaced ?

Osiris · November 20, 2023, 1:00pm

That's not up to Let's Encrypt, but up to the ACME client and/or DNS system.

iamkishore432 · November 20, 2023, 1:30pm

I am quite new to this, need help on the suggestion, I am using Azure DNS zone for my environment, in my scenario what will be acme client?

rg305 · November 20, 2023, 1:48pm

Your ACME client is:

No.
The TXT record is new with every certificate request [even if for the same name].
[it would be impossible to finalize an order twice]

My advice is to:

Treat each system independent of the other.
Work on only one system [until you get it to work].
[use the testing environment for all testing]
Apply the solution to the first system onto the second system.
[testing would be minimal - as you should already have a working solution]

OR
Only work on one system and then copy that certificate onto the second system.

iamkishore432 · November 20, 2023, 1:54pm

Thank you for the response @rg305 , I will check this and test individually for two systems.

Thanks
Krishna

Osiris · November 20, 2023, 3:53pm

OP is talking about the TXT RR being replaced in the DNS zone. It should be perfectly possible to have multiple TXT RR in the DNS zone present.

rg305 · November 20, 2023, 5:12pm

OK, I see that know - I hadn't read the entire post.
[TL;DR]

petercooperjr · November 20, 2023, 5:22pm

(I've moved this to the Help category, I think that makes the most sense for what you're trying to do.)

Your ACME client on one server should be able to each add and remove their own TXT records without impacting the other client's records. It wouldn't shock me if that was a scenario that most ACME clients didn't particularly have a lot of testing around (particularly considering the wide variety of DNS APIs out there).

This is the suggestion I would recommend. I don't know the specifics of cert-manager or your infrastructure, but if you have one centralized place that requests the certificates and stores them (and the private key) somewhere securely, and then each server loads the key and cert from that secure store, then I think that's the easiest approach for scaling out to multiple systems that each need to do their own TLS termination.

rg305 · November 20, 2023, 5:24pm

Things that may help clarify the situation/problem:

Are the two systems using the same LE account?
Is there a setting within cert-manager for it NOT to delete TXT records prior to creating a new one?
[only delete after use]
Can you modify your process to centralize the requests from one single system [sequentially]?

iamkishore432 · November 21, 2023, 11:52am

Sorry for the late response, I am in parallel looking into cert-manager github page for any open issues or PR's to fix it. I found below issue link which 100% match my problem. Please consider scenario from below link.

github.com/cert-manager/cert-manager

Support parallel DNS validation for same host

opened 08:55AM - 26 Jul 21 UTC

closed 08:56AM - 05 Oct 22 UTC

ramondeklein

priority/important-soon lifecycle/rotten area/acme/dns01

# The setup Let's assume the folllowing two server setup: 1. `test-eu.exampl…e.com` is a web-server in the EU. 1. `test-us.example.com` is a web-server in the US. 1. `test.example.com` is a traffic manager that either directs via CNAM to the EU or US based on latency (this requires FollowCNAME to be enabled). To allow proper certifications, I use cert-manager to create the certificates and I am using DNS-based challenges to get two certificates: 1. EU certificate that is valid for `test-eu.example.com` and `test.example.com`. 1. US certificate that is valid for `test-us.example.com` and `test.example.com`. Let's suppose that the ACME check is running in the US, so if I would use HTTP-based authentication then the check for `test.example.com` would always end up on the US server and I couldn't create a certificate for me EU server. With DNS-based challenges this isn't a problem and seems like a good solution ## The problem The problem arises when both the EU and US server start their challenges **at the same time**. No issue for `test-eu` and `test-us`, but the `test.example.com` has two challenges at the same time. It seem the following happens: 1. *EU*: Writes the `_acme-challenge.test.example.com.` TXT record with challenge `AAA`. 2. *US*: Writes the `_acme-challenge.test.example.com.` TXT record with challenge `BBB`. 3. *EU*: ACME validates the `_acme-challenge.test.example.com.` record and finds the incorrect value `BBB`. It fails the challenge and deletes the TXT record. 4. *US*: ACME validates the `_acme-challenge.test.example.com.` record and doesn't find the record and fails. Both challenges fail and after 1 hour the certification manager tries again, but the same race-condition occurs and it keeps failing. ## Possible solutions There are multiple solutions to this problem: 1. Make sure the ACME requests are synchronized and won't run in parallel. This is hard to accomplish when you don't have full control on deployment times (in our case Gitlab pushes the Helm scripts to both environments simultaneously). Because the ACME challenge/validation easily takes 1 minute, this could also make deployments much slower. 1. Use a custom prefix for `_acme-challenge` to differentiate between different requests, but I don't think this is supported by LetsEncrypt. 1. Don't remove the TXT record when the challenge failed. This would allow the second challenge to succeed and one hour later, the first challenge will probably succeed. Not ideal, but might be good enough for some situations. 1. Use multiple TXT values for the `_acme-challenge` DNS record. The last option should be the preferred option, because it should be fully supported by LetsEncrypt. The documentation states: > You can have multiple TXT records in place for the same name. For instance, this might happen if you are validating a challenge for a wildcard and a non-wildcard certificate at the same time. However, you should make sure to clean up old TXT records, because if the response size gets too big Let’s Encrypt will start rejecting it. [source](https://letsencrypt.org/docs/challenge-types/#dns-01-challenge) The current DNS providers in cert-manager (only checked AzureDNS and Route53) seem to always replace the entire value and always deletes the record (even when it doesn't match). I think this is wrong and should be replaced using the following functionality: * When creating the challenge, it should check the `_acme-challenge.test.example.com.` record if it already exists. If not, then create a new TXT record with the given challenge. If it already exists, then add the new TXT value should be added to the existing record. * When removing the challenge, it should remove only the valid challenge from the TXT record. If there are no records left, then it should remove the entire record. There is still the possibility of a race-condition, because there are always two requests, but this is a much-much smaller window than the current solution. I could start working on a pull-request to implement this for Azure DNS, but would this solution be acceptable to be merged into cert-manager?

Thanks
Krishna

Osiris · November 21, 2023, 12:31pm

I've considered the scenario and it certainly looks like an issue with cert-manager.

Is there anything else to consider?

iamkishore432 · November 21, 2023, 12:40pm

No Osiris, I do understand implementation is to be done at cert-manager, I just shared the link to explain the use case.

Thanks
Krishna

rg305 · November 21, 2023, 2:27pm

iamkishore432 · November 21, 2023, 3:04pm

No, DNS hosts (Dns Names) are based on environments similar to the one explained in the github link so we cannot have a centralized system.

Thanks
Krishna

jvanasco · November 21, 2023, 3:50pm

IMHO the right solution is what @petercooperjr and @rg305 suggested above: to obtain certificates on one centralized network/machine and then distribute them to the other networks/machines as a post-success hook.

The usage you described - two parallel requests - is often an anti-pattern that leads to downtime issues due to the effects of rate-limiting when it is not properly implemented on automated systems.

In case you are not using this as an anti-pattern, which is possible but not likely given the experiences of people posting similar problems here before...

This is a defect in cert manager with two possible solutions:

cert-manager fixes it.
you switch to another client.

Considering how long this has been happening against the open ticket you shared, I would suggest moving to another ACME client.

If that is not an option, why not force the tasks to run in sequence? You can set the renewals to only work on even days in one DNS region, and odd days in the other DNS region. If you have multiple regions, just divide a day into multiple buckets and only allow renewals to be triggered at that point.

If you are encountering a scenario where that is not possible because of how you start/stop services, you are probably leveraging one or more anti-patterns in your deployment and should be using the centralized method.

webprofusion · November 22, 2023, 2:58am

Like others have suggested, a strategy that makes particular sense for containers that are constantly re-deployed is to renew certificates periodically and store them in a secrets vault (such as Azure KeyVault), then deploy the certificate regularly (and on container startup) by pulling from the secrets vault. This way the renewal of the actual cert is managed by one process and it's reporting can be centralized, while actual deployment to to the container is kept as an independent process (just use latest secret and fetch it regularly).

Out of interest, are you using Azure Container Apps for your deployment, or something else?

system · December 22, 2023, 2:59am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcard certificate on multiple web servers using DNS TXT validation Help	10	5799	June 7, 2018
Wildcard DNS challenge fails due to duplicate TXT record? Help	13	3289	September 12, 2021
DNS-01 validation: what about a 'race condition'? Help	17	443	July 12, 2024
DNS challenge fails with different TXT records found in two trials Help	13	5969	October 26, 2018
Concurrent issuances with DNS-01 challenge Issuance Tech	14	177	May 10, 2025

Parallel certificate request with same dnsName for two different server is failing

Related topics