DNS timeout from Let's Encrypt servers

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: www.atd.net

I ran this command: getssl -f www.atd.net

It produced this output:

getssl: ACME server returned error: 403: "detail": "Error finalizing order :: While processing CAA for www.atd.net: DNS problem: query timed out looking up CAA for www.atd.net",

My web server is (include version): Apache 2.4.6

The operating system my web server runs on is (include version): CentOS 7

My hosting provider, if applicable, is: N/A

I can login to a root shell on my machine (yes or no, or I don't know): Yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): No

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): getssl V2.48

(Note: we are using dns-01 for verification)

So I am aware that the normal answer to this is "your DNS servers are blocked by a firewall". But I can not reproduce that. Specifically:

  • I can resolve that domain from my home connection across the Internet just fine
  • I tested it using https://unboundtest.com/ and it worked fine
  • I ran a test across 100 RIPE Atlas probes and the vast majority were able to resolve the domain fine

I DID test using https://letsdebug.net/ and that failed with the error: DNS problem: query timed out looking up TXT for _acme-challenge.www.atd.net (I got a certificate recently for that domain and so when I run getssl the previous domain authorizations are still valid).

So, obviously there is something blocking one of the queries for the domain atd.net, but I cannot figure out what it is. I have a suspicion that it is something not for atd.net, but something wrong resolving the nameservers for that domain. It is possible for me to file a ticket upstream regarding that, but the FIRST thing they are going to ask me is, "What IP address range are the queries coming from?" I know that Let's Encrypt doesn't guarantee that the IP address range for validation will stay the same, but without knowing what it is now I can't pursue things further with our networking people.

I am perfectly willing to believe we are doing something wrong but I cannot figure out what it would be (this has worked fine for over a year so any issues are recent). In case it comes up I am aware the zone atd.net is DNSSEC signed but there are no DS records in .net for that domain; my understanding is that should not matter for this issue.

2 Likes

Thank you for the detailed description; I similarly don't see any problems with a couple other test tools I tried.

Does the problem happen on the Let's Encrypt staging environment, production environment, or both? (They've been working on upgrading their DNS resolver, so I think they're currently running different versions, and it'd be good to know if your DNS server is broken for one and not the other.)

4 Likes

We are indeed using a newer version of Unbound in staging than in production.

5 Likes

I tried it against staging (well, I tried to get a certificate for www3.atd.net), and I got:

getssl: www3.atd.net:Verify error: "detail": "DNS problem: query timed out looking up TXT for _acme-challenge.www3.atd.net"

(I believe letsdebug.net also uses staging, so this doesn't surprise me). So it is unrelated to the version of unbound, I think. Any other ideas?

1 Like

And your original getssl call was working against the live system? I haven't used it before, but taking a quick look through its documentation it seems to use the staging system by default (?) so I just wanted to make sure that you see the same problem in both environments.

3 Likes

Absolutely, it was working against the production system (for over a year). Yes, the default configuration for getssl runs against staging, but our configuration is set to use production; I changed the configuration just for www3.atd.net.

2 Likes

I see your very consistent history of production certs. Right now is earlier than normal for you to request one. Was there something on your end that changed you were trying to test? I also don't see any reason this shouldn't work - just curious.

And, you seem skilled but just in case ... when changing the getssl CA= value to staging you will get a test cert but it is not valid. So, was smart to use a different name than your real one so you did not clobber your production system. I also use getssl and wish it had a --dry-run type feature but care is needed with staging.

That said, staging is also more tolerant of -f retries so be careful about too many of those with production (5 failures per hour per account per domain). The error message is clear when this happens and is not a cause of this odd DNS failure.

3 Likes

As for why I am renewing that one now, the reason for that is I tried getting a certificate for a NEW server and it failed. So after a bunch of thrashing I decided to take a step back and try renewing a certificate that recently worked to make sure it wasn't me (well, okay, it might STILL be me). I did run into the problem you mentioned with too many verification failures on the new server I tried which is another reason I wanted to try an already-working configuration.

As an aside when I was debugging this problem over the past few days I did see a few "Internal errors" reported from boulder. This was both on production and staging (the staging errors were via letsdebug.net). I did not see any of those today; I do not know if that is relevant.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.