Dns-01 challenge: state transition 'pending' to state 'failed' too short

SoCalDude · June 14, 2019, 7:36pm

I am working on implementing an acme client that uses the dns-01 challenge in combination with automatically setting the DNS TXT record.

After setting the DNS TXT record, the acme client initiates the ACME dns-01 challenge and starts polling the challenge poll URL. I noticed that the Let’s encrypt server transitions from the ‘pending’ state to ‘failed’ very quickly. I would have assumed that the Let’s encrypt server would keep checking the DNS record for a much longer time.

I can overcome the problem by adding a delay between setting the TXT record and initiating the ACME dns-01 challenge, however, this seems less than optimal since I want the client to be as fast as possible.

JuergenAuer · June 14, 2019, 7:46pm

Hi @SoCalDude

that's not the complete story.

The client initiates the start of the Letsencrypt check by doing a POST to the challenge url. Then Letsencrypt starts checking the values.

Read

Responding to Challenges

So if you write your own client:

Check the order
Create the challenges (dns entries)
validate the challenges with your own code
do the POST to the challenge url, then Letsencrypt starts.

Looks like you use a client that uses the challenge url after creating the challenge. That can't work.

SoCalDude · June 14, 2019, 8:21pm

Sorry for insufficient explanation, but I do as you have in your list.

Check the order
Create the challenges (dns entries)
validate the challenges with your own code
— DELAY —
do the POST to the challenge url, then Letsencrypt starts.

The code fails without the delay. There seems to be some inertia in the DNS and it would be good if the Let’s Encrypt server compensates for this.

_az · June 14, 2019, 8:39pm

Retrying challenges (RFC 8555 - Automatic Certificate Management Environment (ACME)) is a SHOULD. Let's Encrypt doesn't implement it.

So you have to be completely sure that every single nameserver of your domain is advertising the new records.

Let's Encrypt will obey the TTL of the record you create by up to 60 seconds. You can work around this by setting the TTL to 0 or 1 second.

Otherwise, the cause of the failure is most likely that not all of your nameservers are serving the same zonefile at the time that you respond to the challenge.

How are you doing this?

rmbolger · June 14, 2019, 8:59pm

Just wanted to comment on this. While it may seem like a feature to have your client able to get through the cert ordering process as quickly as possible (particularly during development when you're testing things), it's not really something that actually matters in production in the vast majority of cases. In a typical environment, a cert is only going to be renewed once every 60'ish days (assuming you start trying to renew 30 days out which is what most clients tend to do. Whether that renewal takes seconds, minutes, or hours is irrelevant as long as it eventually succeeds prior to the previous cert's expiration. Even if the initial attempt fails for some reason (ACME server issues, temporary DNS issues, etc), a well functioning client will retry until it succeeds.

So I guess the point is, don't worry too much about optimizing the speed of your client. Worry more about making it robust and able to gracefully deal with failures and retry. If that means adding some extra delays, so be it.

Most clients that deal with automating DNS challenges expect that everyone's DNS propagation delay is going to be different and make it configurable option. Some try to automate the checking of authoritative records, but there are an increasing number of environments where that's not possible from the server running the ACME client due to corporate policies trying to prevent data exfiltration via DNS by blocking external DNS resolution or just that the server doesn't actually have outbound Internet access.

_az · June 14, 2019, 9:05pm

This is a very good point. The other way this can manifest itself is if your nameservers are anycasted, it becomes literally impossible to "know" whether it's ready, unless the specific DNS API supports reporting the propagation status of a changeset (like Google Cloud DNS does).

If I can suggest a strategy that has been successful for my ACME client, in a diverse nameserver environment:

Sleep for N (tunable duration), then
Upto 3 attempts to verify that the record is visible at all via an interative lookup, sleep N after each attempt.

If you get through all of that and still can't confirm visibility, just submit anyway and hope for the best.

SoCalDude · June 14, 2019, 9:11pm

Thanks rmbolger, I’ll implement the tunable sleep option.

ezekiel · June 14, 2019, 9:13pm

And with retries, always build in exponential backoff up to a limit so you don’t hammer the API

system · July 14, 2019, 9:13pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dns-01 challenge in pending state Help	2	492	May 22, 2022
DNS challenge and caching Client dev	10	6459	July 10, 2016
Challenge type stability Help	7	527	February 26, 2021
Changing nameservers /_acme-challenge key? Help	8	829	January 26, 2024
Acme-challenge howto Help	2	3820	February 1, 2020

Dns-01 challenge: state transition 'pending' to state 'failed' too short

Related topics