Random NXDOMAIN failures


#1

My domain is: composedb.com

We’re utilizing the xenolf/lego project as a library for issuances via DNS challenge with a modified version of its DNS preCheck logic where we are doing an arbitrary sleep of 30s once all nameservers have been verified, as well as logging the nameservers checked. The majority of the time things work as expected on the first attempt but every-now-and-then we get the NXDOMAIN error back from let’s encrypt. An excerpt from a recent failure is below:

2018/07/16 14:51:07 [INFO][930596040.composedb.com] acme: Obtaining bundled SAN certificate
2018/07/16 14:51:08 [INFO][930596040.composedb.com] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz/REDACTED
2018/07/16 14:52:26 [INFO][930596040.composedb.com] acme: Trying to solve DNS-01
2018/07/16 14:52:34 [INFO][930596040.composedb.com] Checking DNS record propagation using [google-public-dns-a.google.com:53 google-public-dns-b.google.com:53]
time="2018-07-16T14:52:34Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:52:39Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:52:44Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:52:49Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:52:54Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:52:59Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:04Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:10Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:10Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d2.googledomains.com.
time="2018-07-16T14:53:15Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:15Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d2.googledomains.com.
time="2018-07-16T14:53:15Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d3.googledomains.com.
time="2018-07-16T14:53:20Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:20Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d2.googledomains.com.
time="2018-07-16T14:53:20Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d3.googledomains.com.
time="2018-07-16T14:53:25Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d1.googledomains.com.
time="2018-07-16T14:53:25Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d2.googledomains.com.
time="2018-07-16T14:53:25Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d3.googledomains.com.
time="2018-07-16T14:53:25Z" level=info msg="querying for TXT record" fqdn=_acme-challenge.930596040.composedb.com. nameserver=ns-cloud-d4.googledomains.com.
time="2018-07-16T14:54:01Z" level=error msg="acme: Error -> One or more domains had a problem:\n[930596040.composedb.com] acme: Error 400 - urn:ietf:params:acme:error:dns - DNS problem: NXDOMAIN looking up TXT for _acme-challenge.930596040.composedb.com\n"

The 30s sleep has resulted in us seeing less occurrences of the error but hasn’t resolved it completely. I went this route based on the logic in certbot as I was hoping the combination of checking nameservers and sleeping for a bit would be enough. In the case above, when the request was retried within ~1 minute, there was no issue with obtaining the certificate.

Is there anything else that can be done to check/verify the TXT record is ready to be checked by boulder? A previous discussion on this similar topic didn’t really give me anything big to work on since the resolvers used by boulder would like to be kept private and could potentially change. My hope is there can be something additional to be checked and would also improve the lego library.

/cc @cpu


#2

Google Cloud DNS has an API to check whether a change has been deployed to all of their nameservers. You don’t have to wait an arbitrary amount of time.

https://cloud.google.com/dns/api/v1/changes/get
https://cloud.google.com/dns/api/v1/changes

Querying all of the nameserver IPs from your location isn’t enough, since they use anycast and Let’s Encrypt will probably query different instances. (And I don’t know what Google’s consistency model is anyway. All servers in one PoP may not be updated simultaneously.)


#3

Thanks for the info @mnordhoff! The gcloud integration in the lego library already polls the change api to ensure it succeeds.


#4

If it’s still able to fail, that’s concerning… :confused:


#5

Have you considered talking to Google DNS Support? There isn’t much that we can do from our side to help in this case. Boulder contacts your authoritative DNS servers but in this case what appears as one authoritative nameserver is likely many individual servers with traffic distributed by anycast.

Your code/Lego can check all of the authoritative namservers to ensure the required records are present but if Boulder checks from a different network perspective it may be sent to a different nameserver under the same domain name that doesn’t have the record. Short of checking from every possible perspective you’re stuck relying on Google’s own API to tell you when the record has been propogated fully. If their API for this isn’t reliable it sounds like a bug they should investigate!


#6

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.