Error when requesting LE to validate DNS challenge

Hi,
Lately, we are getting a lot of errors when we request LE to validate the DNS challenge, I will describe the flow:

We provide service to order certificates from LE using a DNS challenge, when the user ask to order a certificate we are doing the following actions:

  1. We perform a dry run challenge using fake TXT challenge value.
  2. We are validating the dry run challenge to make sure the user really own the domain.
  3. We start the order process against LE and we get the real DNS challenge.
  4. We send the domain validation request to the user DNS challenge handler API
  5. Once the user response with 200 OK, we start validating that the challenge satisfied by querying for he TXT record, if the record found successfully. we send LE a request to validte the challeng (and only if we really find it before).

So lately we are getting a lot of errors when asking LE to validate the challenge, these are the logs:

Oct 29 09:48:26 orders-manager-96c7888cf-d7557 orders-manager DEBUG DNS TXT query for domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' found record with value 2_X2dAOlRER35PxiwduncpXq0yQgPV8_llrpkPeY2Co was found, certificate 'crn:v1:staging:public:cloudcerts:us-south:a/56f52a905e4c4d8614b507ef330225e0:914705eb-2e1a-46e3-ab70-a7c01bef46d0:certificate:53aeb30b3cfb60dd1a40642cdf52e2ef'. The challenge was satisfied
Oct 29 09:48:26 orders-manager-96c7888cf-d7557 orders-manager DEBUG Domain validation for'29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' finished successfully.
Oct 29 09:48:27 orders-manager-96c7888cf-d7557 orders-manager DEBUG Going to call the CA to validate the challenge for domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com after a delay of '40000' ms
Oct 29 09:49:07 orders-manager-96c7888cf-d7557 orders-manager DEBUG The CA accepted the request to validate the challenge for domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com'. Response headers are {"server":"nginx","date":"Thu, 29 Oct 2020 07:49:07 GMT","content-type":"application/json","content-length":"184","connection":"close","boulder-requester":"51223980","cache-control":"public, max-age=0, no-cache","link":"<https://acme-v02.api.letsencrypt.org/directory>;rel=\"index\", <https://acme-v02.api.letsencrypt.org/acme/authz-v3/8215809301>;rel=\"up\"","location":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/8215809301/_F6PXA","replay-nonce":"0103GruBFvMzhOE8PsJhUTFH_75z-1Z-B9W8qZpxdIeDd6A","x-frame-options":"DENY","strict-transport-security":"max-age=604800"}. Response body is {"type":"dns-01","status":"pending","url":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/8215809301/_F6PXA","token":"q4kZhTvRFHOFvgm98m2EKG7SinF4kemRraIK56AjCRQ"}
Oct 29 09:49:08 orders-manager-96c7888cf-d7557 orders-manager ERROR Couldn't order certificate for domains '["29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com"]'. Reason is: Certificate Manager was not able to process your request. Domain validation failed, check your DNS configuration.
Oct 29 09:49:08 orders-manager-96c7888cf-d7557 orders-manager DEBUG Polling domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' challenge validation status. Attempt number 1. Total polling delay 1 seconds
Oct 29 09:49:08 orders-manager-96c7888cf-d7557 orders-manager DEBUG Polled domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' challenge validation status from 'https://acme-v02.api.letsencrypt.org/acme/chall-v3/8215809301/_F6PXA'. Status is: 200. Response body is '{"type":"dns-01","status":"invalid","error":{"type":"urn:ietf:params:acme:error:dns","detail":"DNS problem: NXDOMAIN looking up TXT for _acme-challenge.29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com - check that a DNS record exists for this domain","status":400},"url":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/8215809301/_F6PXA","token":"q4kZhTvRFHOFvgm98m2EKG7SinF4kemRraIK56AjCRQ"}'
Oct 29 09:49:08 orders-manager-96c7888cf-d7557 orders-manager ERROR Polled domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' challenge validation - status is 'invalid'. response body: '{"type":"dns-01","status":"invalid","error":{"type":"urn:ietf:params:acme:error:dns","detail":"DNS problem: NXDOMAIN looking up TXT for _acme-challenge.29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com - check that a DNS record exists for this domain","status":400},"url":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/8215809301/_F6PXA","token":"q4kZhTvRFHOFvgm98m2EKG7SinF4kemRraIK56AjCRQ"}'
Oct 29 09:49:08 orders-manager-96c7888cf-d7557 orders-manager ERROR Domain '29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com' challenge validation polling failed. Reason is: 'Certificate Manager was not able to process your request. Domain validation failed, check your DNS configuration.'

In the logs, we can see that our server find the TXT recor. after finding it we are waning 40 sec and then posting the request to LE for validating the challenge.
Even though we find the TXT record, LE response with an error NXDOMAIN looking up TXT for _acme-challenge.29102012.preprod.e2e.certificate-manager.test.cloud.ibm.com - check that a DNS record exists for this domain"

So I wonder, what we are doing wrong that we find the TXT record but LE didn't.What should be done to prevent those errors?

Thanks for the help!

1 Like

IBM uses anycast for their SoftLayer DNS servers: https://cloud.ibm.com/docs/dns?topic=dns-dns-faq

It might be just as simple as the complete set of DNS servers behinde those two IP addresses need some more time to propogate your TXT record to all the servers. It could be a regional thing too: perhaps the DNS server behind the IP address you connect to has the correct TXT record, but perhaps at a different location a different DNS server behind the same IP address doesn't have it yet.

Also, just to satisfy my curiosity: why do you require actual publically valid certificates for a test environment?

1 Like

Please go ahead and feel free to contribute. Thanks!

3 Likes

I see some basic DNS issues in several places (along the way):

And now a question:
Why even use such a name (from test.cloud.IBM.COM) ?
[unless you work for Big Blue]

1 Like

Hi, yes I'm from IBM and as part of the certificate manager service, we enable users to order certificates from LE.
This domain used in our end-to-end testing to make sure everything works as expected. but recently we start getting the error I mentioned. it worked perfectly for quite a long time.
After the test, we delete the domain so the DNS issues you see are probably because of that.
So what is the correct way to validate the TXT record? if our service finds it, how can we make sure that also LE servers will be able to find the DNS challenge TXT record?
And in case of such an error, is there a way to re-post the challenge? or we must create a new order, is there a way for an Invalid authorization to become valid? or a new order must be created?

Thanks!

If the DNS API you use does not have a way to track whether all the anycast edges have the new zonefile, then you can simply wait for a longer time to allow the propagation to occur. It's a pretty reliable hack.

There isn't. There's no transition out of the invalid state of an authorization.

Yes, because Let's Encrypt have not implemented pre-authorization.

Via global DNS.

There are many free online tools to validate the complete functionality of your DNS system.

It is safe to discard pending order to start a new order (within normal limits - see rate limit page).

It you are in a testing mode, you should be using the testing environment (not production).

Thanks for the help.
The process of ordering certificate is done in our sever (node server) so we cant use any tools here.
Im looking for the right way doing it from code.
We use dns library of node to resolve the TXT records of the challenge and validate it.
We create the dns client using the IPS of the NS configured for the domain, like that:

            const ips = await getNsIpsForDomain(options, zone);
            client.setServers(ips);

So what is the correct way to create the DNS client to be align with the DNS server used in LE?
We do add more delay before posting the challenge to LE and we are now waiting 40 sec, we are looking for the proper solution, and not blindly add more wait delay.

Thanks!

1 Like

I have no doubt you are putting the correct record in the right location.
Where I do have doubt, is within the DNS system you are using (not in the validation processes in use).

What do you mean? the DNS we use for managing the domain?

The DNS the entire Internet uses to get to your TXT records.

Did you see my first post?

Yes i did, and i answered that those errors are probably because we delete the domain right after the test execution.
But in second though analysing a new domain that is not deleted which also managed in Softlayer also show those errors, so you think that the erros we get are relating to those DNS errors?

I think you are using the same DNS servers.
Which already showed problems.
And which probably have not been corrected.
Deleting a (sub)domain wouldn't fix those kinds of problems.

This would be sufficient if Softlayer DNS wasn't anycast. However, it is anycast. Your check is not foolproof because it is only checking a single pair of nameservers (and ignoring the dozens of others).

Since Let's Encrypt is checking that record from 4 different locations, it is likely to be hitting different anycast nameservers at the same time.

There is one experiment you can do to see how long it actually takes to propagate your new record. Create a zone/record/whatever and then immediately:

If you find that your record is reliably available from every location within 40 seconds, great. But I think it's worth actually verifying that. Especially if your provisioning process involves creating a whole new zone on Softlayer.

I'm genuinely not sure how much those dnsviz errors matter. Unbound seems to be able to resolve despite the first instance of that error and I don't see why that wouldn't be the case for the further delegations as well.

2 Likes

Ok, so if LE is checking that record from 4 different locations how can we align to that check? is LE expose the locations where it validates the TXT records? We want to make sure that our dry run is (almost) 100% close to the same check as LE performs.
How can we improve it and not by adding more delay before posting the request to LE?

Do you run your own DNS?
[aside from those being operated by SoftLayer]

If not, can you run your own DNS?

Do you have an IP that can be used to process Internet DNS requests from?

We will do whatever is needed to improve that process.

No we are not running our own DNS

We can create VMS for that in a various locations which will be used to resolve the TXT records...

That is a good start.
I would suggest testing several different DNS scenarios out to see which works best for you case.
Here are two:

  1. CNAME all _acme-challenge requests (from test zone) to a much shorter DNS zone in your VM.
  2. Create a new DNS "branch" that shortens the FQDN DNS search path.
    Have your VM(s) operate as fully authoritative for that test zone.