Parallel certificate request with same dnsName for two different server is failing

I am triggering a pipeline where we request certificates for two different environments in parallel (almost the same time), when first environment request is submitted for certificate issue, we can see a txt record entry being added in DNS but before acme_challenge is completed for the text record, it is being replaced with new certificate request form different environment (in my case both challenges are from same DNSname) and I can see following error in first environment DNS activity logs.

"cert-manager/challenges/acceptChallenge: error waiting for authorization" err="acme: authorization error for dnsrecord 403 urn:ietf:params:acme:error:unauthorized: Incorrect TXT record "3FTytUY2onOhi3SsTBRFV9AnASjhtMuwEi74WI" found at _acme-challenge.dnsrecord" resource_name="certificate-xxxxxx" resource_namespace="product" resource_kind="Challenge" resource_version="v1" dnsName="dnsrecord" type="DNS-01"

Is there any option where TXT record won't be replaced ?

That's not up to Let's Encrypt, but up to the ACME client and/or DNS system.

2 Likes

I am quite new to this, need help on the suggestion, I am using Azure DNS zone for my environment, in my scenario what will be acme client?

Your ACME client is:

No.
The TXT record is new with every certificate request [even if for the same name].
[it would be impossible to finalize an order twice]

My advice is to:

  • Treat each system independent of the other.
  • Work on only one system [until you get it to work].
    [use the testing environment for all testing]
  • Apply the solution to the first system onto the second system.
    [testing would be minimal - as you should already have a working solution]

OR
Only work on one system and then copy that certificate onto the second system.

3 Likes

Thank you for the response @rg305 , I will check this and test individually for two systems.

Thanks
Krishna

1 Like

OP is talking about the TXT RR being replaced in the DNS zone. It should be perfectly possible to have multiple TXT RR in the DNS zone present.

2 Likes

OK, I see that know - I hadn't read the entire post.
[TL;DR]

2 Likes

(I've moved this to the Help category, I think that makes the most sense for what you're trying to do.)

Your ACME client on one server should be able to each add and remove their own TXT records without impacting the other client's records. It wouldn't shock me if that was a scenario that most ACME clients didn't particularly have a lot of testing around (particularly considering the wide variety of DNS APIs out there).

This is the suggestion I would recommend. I don't know the specifics of cert-manager or your infrastructure, but if you have one centralized place that requests the certificates and stores them (and the private key) somewhere securely, and then each server loads the key and cert from that secure store, then I think that's the easiest approach for scaling out to multiple systems that each need to do their own TLS termination.

4 Likes

Things that may help clarify the situation/problem:

  • Are the two systems using the same LE account?
  • Is there a setting within cert-manager for it NOT to delete TXT records prior to creating a new one?
    [only delete after use]
  • Can you modify your process to centralize the requests from one single system [sequentially]?
3 Likes

Sorry for the late response, I am in parallel looking into cert-manager github page for any open issues or PR's to fix it. I found below issue link which 100% match my problem. Please consider scenario from below link.

Thanks
Krishna

2 Likes

I've considered the scenario and it certainly looks like an issue with cert-manager.

Is there anything else to consider?

3 Likes

No Osiris, I do understand implementation is to be done at cert-manager, I just shared the link to explain the use case.

Thanks
Krishna

4 Likes
2 Likes

No, DNS hosts (Dns Names) are based on environments similar to the one explained in the github link so we cannot have a centralized system.

Thanks
Krishna

1 Like

IMHO the right solution is what @petercooperjr and @rg305 suggested above: to obtain certificates on one centralized network/machine and then distribute them to the other networks/machines as a post-success hook.

The usage you described - two parallel requests - is often an anti-pattern that leads to downtime issues due to the effects of rate-limiting when it is not properly implemented on automated systems.

In case you are not using this as an anti-pattern, which is possible but not likely given the experiences of people posting similar problems here before...

This is a defect in cert manager with two possible solutions:

  • cert-manager fixes it.
  • you switch to another client.

Considering how long this has been happening against the open ticket you shared, I would suggest moving to another ACME client.

If that is not an option, why not force the tasks to run in sequence? You can set the renewals to only work on even days in one DNS region, and odd days in the other DNS region. If you have multiple regions, just divide a day into multiple buckets and only allow renewals to be triggered at that point.

If you are encountering a scenario where that is not possible because of how you start/stop services, you are probably leveraging one or more anti-patterns in your deployment and should be using the centralized method.

5 Likes

Like others have suggested, a strategy that makes particular sense for containers that are constantly re-deployed is to renew certificates periodically and store them in a secrets vault (such as Azure KeyVault), then deploy the certificate regularly (and on container startup) by pulling from the secrets vault. This way the renewal of the actual cert is managed by one process and it's reporting can be centralized, while actual deployment to to the container is kept as an independent process (just use latest secret and fetch it regularly).

Out of interest, are you using Azure Container Apps for your deployment, or something else?

3 Likes