How can we ensure that LE's DNS lookups are in sync with the values seen in our DNS lookup?

Hello,

For at least customer it happens that the domains they create are validated successfully by our application; however when it comes to the LE validation, sometimes it is not validated by Let's Encrypt.

Here is the timeline for the problem for domain nyebuickgmc.com:

  1. at 2021-02-18T16:30:29,971 domain is still not validated on our side.
  2. at 2021-02-18T16:53:41,797 domain was validated on our side using 2 name servers: dns102.register.com., dns101.register.com.
  3. we initiate validation call to LE
time: 2021-02-18T16:53:41,842
uri: https://acme-v02.api.letsencrypt.org/acme/chall-v3/10899687020/rheA5A
inputData: '"{\"type\":\"dns-01\",\"keyAuthorization\":\"U2gSjnYjpFB5ToAVDhjNCJ8_14GhDXGib4AKc7pU9nw.NLDLvhNs-PphLmua-tmwqzgbPpUlW-GEBkyoUKiF_Yw\",\"resource\":\"challenge\"}"'
payload: eyJub25jZSI6IjAxMDRnSC1jLWczZmZNTk1iaVFiOExHMFVYVGhWdlk5SWtYRmdjWU9RaWdkWFhvIiwiYWxnIjoiUlMyNTYiLCJ1cmwiOiJodHRwczovL2FjbWUtdjAyLmFwaS5sZXRzZW5jcnlwdC5vcmcvYWNtZS9jaGFsbC12My8xMDg5OTY4NzAyMC9yaGVBNUEiLCJraWQiOiJodHRwczovL2FjbWUtdjAyLmFwaS5sZXRz
ZW5jcnlwdC5vcmcvYWNtZS9hY2N0LzEzMSJ9.eyJ0eXBlIjoiZG5zLTAxIiwia2V5QXV0aG9yaXphdGlvbiI6IlUyZ1NqbllqcEZCNVRvQVZEaGpOQ0o4XzE0R2hEWEdpYjRBS2M3cFU5bncuTkxETHZoTnMtUHBoTG11YS10bXdxemdiUHBVbFctR0VCa3lvVUtpRl9ZdyIsInJlc291cmNlIjoiY2hhbGxlbmdlIn0.BK3vV5q2SfV9s6Jm
cKaW9_jfv5-cBCNPZPDMpka3GbPrzZOLM4oQRCaGQkGdoUzN3nPmzSFVR51J_E4ovgtmjJlw3cQr_wUUFQ63zWkmh4xb8lRycW7i2_1CIkLIReibWFIedj9t2lQOo-_kHZZstVdF_1eS1JExHdrIe9wPHwaQ7DBydkgJVAdnuB13LlgTzkHiYcOPHQrFaAmDqf4BAbyIHbj_NXbOabC8Gcobsrt5GyXbQJigBE-b3Gf6yzbBHWIvjqLHOyExZ
Iv5XgrZ25L8J5y7zFFuUgDKWGQKkFBaBHO73208Hjl2b4I-8_mEh9O1J8su67IwRvNgaSzxvg
body: '{"payload":"eyJ0eXBlIjoiZG5zLTAxIiwia2V5QXV0aG9yaXphdGlvbiI6IlUyZ1NqbllqcEZCNVRvQVZEaGpOQ0o4XzE0R2hEWEdpYjRBS2M3cFU5bncuTkxETHZoTnMtUHBoTG11YS10bXdxemdiUHBVbFctR0VCa3lvVUtpRl9ZdyIsInJlc291cmNlIjoiY2hhbGxlbmdlIn0","protected":"eyJub25jZSI6IjAxMDRn
SC1jLWczZmZNTk1iaVFiOExHMFVYVGhWdlk5SWtYRmdjWU9RaWdkWFhvIiwiYWxnIjoiUlMyNTYiLCJ1cmwiOiJodHRwczovL2FjbWUtdjAyLmFwaS5sZXRzZW5jcnlwdC5vcmcvYWNtZS9jaGFsbC12My8xMDg5OTY4NzAyMC9yaGVBNUEiLCJraWQiOiJodHRwczovL2FjbWUtdjAyLmFwaS5sZXRzZW5jcnlwdC5vcmcvYWNtZS9hY2N0L
zEzMSJ9","signature":"BK3vV5q2SfV9s6JmcKaW9_jfv5-cBCNPZPDMpka3GbPrzZOLM4oQRCaGQkGdoUzN3nPmzSFVR51J_E4ovgtmjJlw3cQr_wUUFQ63zWkmh4xb8lRycW7i2_1CIkLIReibWFIedj9t2lQOo-_kHZZstVdF_1eS1JExHdrIe9wPHwaQ7DBydkgJVAdnuB13LlgTzkHiYcOPHQrFaAmDqf4BAbyIHbj_NXbOabC8Gco
bsrt5GyXbQJigBE-b3Gf6yzbBHWIvjqLHOyExZIv5XgrZ25L8J5y7zFFuUgDKWGQKkFBaBHO73208Hjl2b4I-8_mEh9O1J8su67IwRvNgaSzxvg"}'
responseCode: '200'
headers: HttpHeaders({date=[Thu, 18 Feb 2021 16:53:41 GMT], server=[nginx], content-length=[185],
  x-frame-options=[DENY], link=[<https://acme-v02.api.letsencrypt.org/directory>;rel="index",
  <https://acme-v02.api.letsencrypt.org/acme/authz-v3/10899687020>;rel="up"], content-type=[application/json],
  connection=[keep-alive], location=[https://acme-v02.api.letsencrypt.org/acme/chall-v3/10899687020/rheA5A],
  boulder-requester=[131], cache-control=[public, max-age=0, no-cache], strict-transport-security=[max-age=604800],
  replay-nonce=[0104neLSz28v3S90SkPFHmGwtMnZy2rOz_JOtPsGX-2kVRM]})
response: '{"type":"dns-01","status":"pending","url":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/10899687020/rheA5A","token":"U2gSjnYjpFB5ToAVDhjNCJ8_14GhDXGib4AKc7pU9nw"}'
data: None
  1. LE returns that domain validation is in the pending state, we wait 5 seconds to get the updated status
time: 2021-02-18T16:53:46,968
uri: https://acme-v02.api.letsencrypt.org/acme/authz-v3/10899687020
responseCode: '200'
headers: HttpHeaders({date=[Thu, 18 Feb 2021 16:53:46 GMT], server=[nginx], content-length=[551],
  x-frame-options=[DENY], link=[<https://acme-v02.api.letsencrypt.org/directory>;rel="index"],
  content-type=[application/json], connection=[keep-alive], cache-control=[public,
  max-age=0, no-cache], strict-transport-security=[max-age=604800]})
response: '{"identifier":{"type":"dns","value":"nyebuickgmc.com"},"status":"invalid","expires":"2021-02-22T21:19:24Z","challenges":[{"type":"dns-01","status":"invalid","error":{"type":"urn:ietf:params:acme:error:unauthorized","detail":"No
  TXT record found at _acme-challenge.nyebuickgmc.com","status":403},"url":"https://acme-v02.api.letsencrypt.org/acme/chall-v3/10899687020/rheA5A","token":"U2gSjnYjpFB5ToAVDhjNCJ8_14GhDXGib4AKc7pU9nw"}]}'
data: None
  1. This time we get status invalid.

Now my question is to help explaining why such situation could happen that after successful validation on our side, LE cannot confirm this domain? Which servers were used for the validation and what were the responses?
We are looking for suggestion how to avoid similar situations in the future.

Thanks,
Michal

1 Like

You need to use DNS as an "outsider" would use it.
Start at the root "." and work your way left until you get to your domain.
for "nyebuickgmc.com":

  1. Ask root "." servers: What are the authoritative DNS servers for "com."?
  2. Then ask any/all of the systems returned by 1: What are the authoritative DNS servers for "nyebuickgmc.com."?
    [if the name was longer, you would continue left until you reach the end]

So then you go to the name servers provided by 2, and ask them for the validation information.
[which is cross-checked from multiple locations on the Internet to ensure response validity]

3 Likes

It can sometimes be tricky to understand the voluminous output, but a simulation of the DNS resolution process used by Let's Encrypt and described by @rg305 is

https://unboundtest.com/

2 Likes

Our validation algorithm works exactly this way. We go from right to left and we ask all returned name servers to confirm the TXT entry.
Problem is that our validation completes, but validation on LE side fails. We are looking for the reason of such behavior and that's why I was asking for some information how the validation was executed on LE side and which servers were asked (and what was returned by them).

1 Like

Usually the safest thing to do it wait for at least a minute after you think your nameservers are up to date before querying them but you should check with register.com and ask the them. Some organisations have custom DNS services which may still serve cached results from the primary and secondary server.

I personally would recommend using an acme-dns server instead of updating your DNS every time, that way you remove the need to update your actual DNS zone records.

2 Likes

Yes, and to clarify this a little bit, although I don’t know whether or not Register.com does this:

Many large nameservers’ IP addresses are served from multiple instances and locations, using “anycast.” Even if you’re receiving the response you expect from a nameserver, we might end up querying a different instance of that same nameserver that isn’t yet up to date.

Some providers offer an API that will reliably tell you whether all instances have received a given update. If that’s available from your provider, then it’s the best option for deciding whether you’re ready to request a certificate.

3 Likes

Thank you all for the feedback so far, but I would like to return to my original question - what was the IP address of the name server that Let's Encrypt queried in this particular case? I would like to compare that with the data we used.

1 Like

While waiting for the data from LE logs, I have question regarding usage of "unboundtest.com". Is it possible to use it as a tool for the domain pre-validation for the production cases? Are there any restrictions on how to use it?

1 Like

It's open source, so you could run your own instance:

The configuration file used on unboundtest.com is also openly available:

https://unboundtest.com/conf

Which is configured to mimic the settings used by the Let's Encrypt validation server as much as possible.

It's written in Go however, so YMMV with that :stuck_out_tongue: (I'm not a fan...)

1 Like

Welcome to the forum @mgw!

We don't have the info you're looking for readily available, I'm afraid. And as James pointed out - it wouldn't necessarily be informative. We could say "we contacted 192.88.99.101" and you could say "we checked 192.88.99.101, and it returned the correct answer!" But in the presence of anycast, the physical server that responded to our query to 192.88.99.101 could be a completely different machine than the one that answered your query.

The only really reliable way to know your DNS updates are fully propagated to all of a DNS provider's authoritative servers is for the DNS provider to give you an API that tells you when the updates are fully propagated. Most providers don't offer this, because it's surprisingly hard to do reliably. Given that, the second- and third-best options, in no particular order, are:

  • Use acme-dns, as @webprofusion mentions (thanks @webprofusion!)
  • Wait some fixed amount of time that is almost always sufficient for propagation to all servers. Depending on your DNS host, 30 minutes is very probably enough. You might be able to go down to 10 minutes.

It wouldn't work well for that. It's currently coded very naively, in a way that makes it effectively single-threaded. I also maintain it personally rather than official, and it may go down at random with no guarantee of uptime.

If you wanted, you could take the unbound.conf from the linked repo, use that to run your own Unbound instance, and do test queries against that. That would be a little closer to what we do in prod, but I think it would still not fully solve your problem because of the issues with geographic propagation and anycast.

4 Likes

Of course, if you want to get even closer to what Let's Encrypt does in Production, you could set up several such systems in different AWS regions around the world, and check that all of them resolve to what you're expecting.

2 Likes

Thank you all for the valuable feedback. We need to return to the drawing board...

1 Like

I also verify that all my DNS servers authoritatively answers the given challenge TXT records with appropriate value, before triggering ACME server side authorization verification. I do not use any 'anycast' DNS servers, however root and other TLD DNS servers referring to my DNS server might use.

I just got the following error an hour ago:

During secondary validation: DNS problem: query timed out looking up CAA for mydomain.org (urn:ietf:params:acme:error:dns)

The DNS protocol is inherently unreliable (or just its actual implementation, I do not know), like the UDP protocol. You must retry, I see no other way out.

2 Likes

As an FYI, Boulder will retry each lookup (A, AAAA, TXT, CAA) up to 3 times, rotating to different Unbound instances. And Unbound has a somewhat complicated internal retry regime described at NLnet Labs Documentation - Unbound - Unbound Timeout Information. Still, as you say, sometimes the Internet just has problems and you get errors despite all the retries.

1 Like

@jsha , thanks. I see both from the document you linked as well as from the query log of my name serves, that unbound does not use TCP to handle time-out condition.

2 Likes

That's correct. I think that's common to all recursive resolvers: They fall back to TCP only when they get an affirmatively truncated response (i.e. one with the TC bit set).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.