Dns-01 challenge: state transition 'pending' to state 'failed' too short

I am working on implementing an acme client that uses the dns-01 challenge in combination with automatically setting the DNS TXT record.

After setting the DNS TXT record, the acme client initiates the ACME dns-01 challenge and starts polling the challenge poll URL. I noticed that the Let’s encrypt server transitions from the ‘pending’ state to ‘failed’ very quickly. I would have assumed that the Let’s encrypt server would keep checking the DNS record for a much longer time.

I can overcome the problem by adding a delay between setting the TXT record and initiating the ACME dns-01 challenge, however, this seems less than optimal since I want the client to be as fast as possible.

Hi @SoCalDude

that's not the complete story.

The client initiates the start of the Letsencrypt check by doing a POST to the challenge url. Then Letsencrypt starts checking the values.

Read

Responding to Challenges

So if you write your own client:

  • Check the order
  • Create the challenges (dns entries)
  • validate the challenges with your own code
  • do the POST to the challenge url, then Letsencrypt starts.

Looks like you use a client that uses the challenge url after creating the challenge. That can't work.

Sorry for insufficient explanation, but I do as you have in your list.

  • Check the order
  • Create the challenges (dns entries)
  • validate the challenges with your own code
  • — DELAY —
  • do the POST to the challenge url, then Letsencrypt starts.

The code fails without the delay. There seems to be some inertia in the DNS and it would be good if the Let’s Encrypt server compensates for this.

Retrying challenges (RFC 8555 - Automatic Certificate Management Environment (ACME)) is a SHOULD. Let's Encrypt doesn't implement it.

So you have to be completely sure that every single nameserver of your domain is advertising the new records.

Let's Encrypt will obey the TTL of the record you create by up to 60 seconds. You can work around this by setting the TTL to 0 or 1 second.

Otherwise, the cause of the failure is most likely that not all of your nameservers are serving the same zonefile at the time that you respond to the challenge.

How are you doing this?

1 Like

Just wanted to comment on this. While it may seem like a feature to have your client able to get through the cert ordering process as quickly as possible (particularly during development when you're testing things), it's not really something that actually matters in production in the vast majority of cases. In a typical environment, a cert is only going to be renewed once every 60'ish days (assuming you start trying to renew 30 days out which is what most clients tend to do. Whether that renewal takes seconds, minutes, or hours is irrelevant as long as it eventually succeeds prior to the previous cert's expiration. Even if the initial attempt fails for some reason (ACME server issues, temporary DNS issues, etc), a well functioning client will retry until it succeeds.

So I guess the point is, don't worry too much about optimizing the speed of your client. Worry more about making it robust and able to gracefully deal with failures and retry. If that means adding some extra delays, so be it.

Most clients that deal with automating DNS challenges expect that everyone's DNS propagation delay is going to be different and make it configurable option. Some try to automate the checking of authoritative records, but there are an increasing number of environments where that's not possible from the server running the ACME client due to corporate policies trying to prevent data exfiltration via DNS by blocking external DNS resolution or just that the server doesn't actually have outbound Internet access.

1 Like

This is a very good point. The other way this can manifest itself is if your nameservers are anycasted, it becomes literally impossible to "know" whether it's ready, unless the specific DNS API supports reporting the propagation status of a changeset (like Google Cloud DNS does).

If I can suggest a strategy that has been successful for my ACME client, in a diverse nameserver environment:

  1. Sleep for N (tunable duration), then
  2. Upto 3 attempts to verify that the record is visible at all via an interative lookup, sleep N after each attempt.

If you get through all of that and still can't confirm visibility, just submit anyway and hope for the best.

1 Like

Thanks rmbolger, I’ll implement the tunable sleep option.

And with retries, always build in exponential backoff up to a limit so you don’t hammer the API :slight_smile:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.