Would it be possible to get a detailed explanation of how Let's Encrypt resolves hostnames via DNS?
We have to pre-verify certificate authorizations before asking Let's Encrypt to verify them. Else we'd run into rate limits very fast (and we did, at first). So essentially we have to "emulate" how Let's Encrypt resolves hostnames.
Right now we do the following: we use both the SOA and NS of a hostname to resolve a hostname. This works in most cases, but sometimes it doesn't and yet Let's Encrypt is still able to verify an authorization if I manually push it through. This led me to believe LE is doing something differently.
For example, we had one failing our pre-verification check this morning because the SOA timed out.
Do you use the SOA to resolve the hostname? If so, do you have a timeout setup? If this fails, do you rely on the domain's NS only?
This would help us relieve some pain for our users.
Worth stating the obvious that all of your NS for the domain have to give a valid response, not just one. So if you are writing changes to DNS before validation you need to ensure all NS have the same response (and they can all reply to a CAA record query - so no NXDOMAIN and no SERVFAILs will be tolerated during validation).
Yesterday one of my users who had a domain with Google Cloud DNS was returning a SERVFAIL response on the CAA record check, which was presumably a transient failure behind the scenes at Google, so it seems everyone is capable of getting this stuff wrong.
There will always be transient problems when comms are involved. Taking a cue from the Let's Debug test site, you could check
Resolve hostname IP to A and/or AAAA
Connect to http://(domain)/.well-known/acme-challenge/YourSpecialToken
You should expect http 404. You could warn about any other http codes but still try cert request. If request times out then maybe not even try cert request. I assume you are doing http challenges. Note LE Servers will connect using IPv6 if AAAA record in hostname DNS else IPv4 if just A. This check is most likely to have transient errors.
If find CAA record check valid value
Let's Debug does more than that but its purpose is different. You might consider checking Let's Encrypt Server operational status like it does too.
Personally, I would only check the most likely items causing problems. You are more familiar with your clients than I am so know those best.