How to debug intermittent challenge failures?

Over the past few days, I have run several test configurations with certbot (using --break-my-certs). Every time, ~2/10 subdomains fails the challenge. Running certbot again then gets succeeds with the remaining subdomains. What's odd is that the subdomains that fail are different every time.

I've checked the domains with letsdebug as well, and there too I get variable results without having made any changes to my DNS records.

Example error:

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: sub1.xxxxxxx.com
   Type:   dns
   Detail: During secondary validation: DNS problem: SERVFAIL looking
   up CAA for sub1.xxxxxxx.com - the domain's nameservers
   may be malfunctioning

   Domain: sub2.xxxxx.com
   Type:   dns
   Detail: During secondary validation: DNS problem: SERVFAIL looking
   up CAA for sub2.xxxxxxx.com - the domain's
   nameservers may be malfunctioning
 - Your account credentials have been saved in your Certbot
   configuration directory at /etc/letsencrypt. You should make a
   secure backup of this folder now. This configuration directory will
   also contain certificates and private keys obtained by Certbot so
   making regular backups of this folder is ideal.
1 Like

Hello :slightly_smiling_face:

It sounds like you might have non-authoritative/malfunctioning DNS servers. I recommend using dig and curl to determine which are causing you issues.

@rg305

If you're around, I believe you'll have more to add here.

2 Likes

Ok this is one of the domains that failed earlier

id xxxxxx
opcode QUERY
rcode NOERROR
flags QR RD RA
;QUESTION
sub.xxxxxx.com. IN A
;ANSWER
sub.xxxxxxx.com. 3599 IN CNAME xxxxxxx.com.
xxxxxxx.com. 3599 IN A 167.xxx.xxx.xxx
;AUTHORITY
;ADDITIONAL

Edit: I just tried again and yet another subdomain failed (but the ones that previously failed were fine). I checked it with the dig tool you linked to and got the same result as above (except for the subdomain and ip of course). When I re-ran the certbot command, the subdomain that just failed failed worked fine.

Since the problem seems to be intermittent, is there a way to get more information about what went wrong when the failure occurred?

1 Like

The certbot error message lets you know which record lookup failed, but not which nameserver.

Try the tools here:

2 Likes

You could try DNSViz as well.

It's going to be hard for people here to help if you don't want to reveal your domain names, though.

4 Likes

Thanks for the invite @griffin ... But without a real domain name there isn't much anyone can do.

And I'm not about to attempt a careers' worth of knowledge transfer, so they can figure this thing out for themselves.

Then your DNS already holds the problem.

Who does your DNS?
[or is that super top-secret info as well?]

Not from this end.
[at least that we know of - but we don't work for LE, we're just mainly volunteers here]

You've been pointed to some good DNS tools, use them or any others you can find.
But I'm pretty sure you problem is in your DNS servers.

2 Likes

Ok yeah lol I guess it isn't really that top secret.

Base domain name is mylittlestashbox.com I am using Google Domains and DigitalOcean.

I will check out those tools and get back.

1 Like

If that is one of the names that sometimes shows an intermittent problem.
And that problem is ALWAYS the same:

It would be a very strange and unique situation.

So let's be perfectly clear about of those base assumptions.
Please respond to each of the following:

1. All the domains that are having this unique problem use the exact same set of DNS servers:

mylittlestashbox.com    nameserver = ns-cloud-c1.googledomains.com
mylittlestashbox.com    nameserver = ns-cloud-c2.googledomains.com
mylittlestashbox.com    nameserver = ns-cloud-c3.googledomains.com
mylittlestashbox.com    nameserver = ns-cloud-c4.googledomains.com

"YES" or "NO"
[if "NO", added details would be useful]

2. The "renewal"/"test" process requires DNS changes to be made:
"YES" or "NO"

3. I use this ACME client, and version.

2 Likes

A post was merged into an existing topic: Cannot issue for "riyadh.ye": Domain name is an ICANN TLD

I wasn't paying close attention the first few times, but SERVFAIL has appeared the last several tries.

1. All the domains that are having this unique problem use the exact same set of DNS servers:

  1. Yes, they all use the exact same name servers. Although the Google servers you listed are my "main" name servers, I also had to create A records on the Digitalocean side to get things to work. That is expected, right? In any case, it's been that way for months and this has never been an issue in the hundreds of tests I did before this past week.

2. The "renewal"/"test" process requires DNS changes to be made:

  1. I am not sure what you mean exactly, but I don't think so. At least I don't touch any of my name server settings when I am running certbot.

3. I use this ACME client, and version.

  1. I am using certbot-nginx from the Ubuntu 20.04 repositories. In /etc/letsencrypt/renewal/mylittlestashbox.com.conf I have server = https://acme-staging-v02.api.letsencrypt.org/directory, is that what you are asking for?

Please explain this:

For #2, please show the entire file:
[you can remove the account numbers]
cat /etc/letsencrypt/renewal/mylittlestashbox.com.conf

For #3, please show the output of:
certbot --version

2 Likes

On the most recent run, I was able to create all the certs without error, but this is the file:

Regarding the A records on Digitalocean, all I know is that I could not go to mylittlestashbox.com or any subdomains until I had created the record on the Digitalocean side. I have deleted them now and it still seems to be working.

Perhaps either the change I just made has not propagated yet, or perhaps I did not wait long enough during the initial setup and falsely assumed that my creation of those records is what caused it to start working.

conf file:

~$ cat /etc/letsencrypt/renewal/mylittlestashbox.com.conf
# renew_before_expiry = 30 days
version = 0.40.0
archive_dir = /etc/letsencrypt/archive/mylittlestashbox.com
cert = /etc/letsencrypt/live/mylittlestashbox.com/cert.pem
privkey = /etc/letsencrypt/live/mylittlestashbox.com/privkey.pem
chain = /etc/letsencrypt/live/mylittlestashbox.com/chain.pem
fullchain = /etc/letsencrypt/live/mylittlestashbox.com/fullchain.pem

# Options used in the renewal process
[renewalparams]
account = xxxxxxxxxxxxxxxxxxxxxxxx
rsa_key_size = 4096
server = https://acme-staging-v02.api.letsencrypt.org/directory
authenticator = nginx
nginx_server_root = /etc/nginx

certbot version:

~$ certbot --version
certbot 0.40.0

Please show what you mean by this:

I simply mean that I can visit the site without a DNS error after deleting the A records from Digitalocean.

Perhaps this is related to my issue in the OP though:

1 Like

PLEASE show how you delete records from Digital Ocean.

That error seems unrelated.

1 Like

Like this:

1 Like

Can you add a test name with any random IP?
[to see if it does indeed propagate into your real DNS zone]

1 Like

I added one pointing to a different droplet. Do I also need to create the record on Google Domains for this test?

1 Like

NO don't touch Google.
I'm trying to prove if this DO entry does anything or not.
Right now it seems to NOT do anything at all.

Reason #1: The NS entry "ns-cloud-D1.googledomains.com" does not match your zone records:
"ns-cloud-C1.googledomains.com".

Reason #2: When the D1 server is queried, it knows nothing of your domain:
nslookup mylittlestashbox.com ns-cloud-d1.googledomains.com
*** UnKnown can't find mylittlestashbox.com: Query refused
meanwhile C1 works just fine:

nslookup mylittlestashbox.com ns-cloud-c1.googledomains.com
Name:    mylittlestashbox.com
Address: 167.172.250.131

Reason #3: Would be when you add a record in DO and it never shows up in the real zone.
[that waits to be seen - lets give it a few hours]

2 Likes

Ok that makes sense, just wanted to make sure. I have not touched anything on the Google side yet.

BTW I ran certbot again and there was again a failure for just one of my subdomains, whose A record has been deleted from DO but is still untouched on Google.

Challenge failed for domain notes.mylittlestashbox.com
http-01 challenge for notes.mylittlestashbox.com
Cleaning up challenges
Some challenges have failed.

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: notes.mylittlestashbox.com
   Type:   dns
   Detail: During secondary validation: DNS problem: SERVFAIL looking
   up CAA for notes.mylittlestashbox.com - the domain's nameservers
   may be malfunctioning
 - Your account credentials have been saved in your Certbot
   configuration directory at /etc/letsencrypt. You should make a
   secure backup of this folder now. This configuration directory will
   also contain certificates and private keys obtained by Certbot so
   making regular backups of this folder is ideal.
1 Like