Hi Let's Encrypt team,
Thank you for all your brilliant effort keeping Let's Encrypt running.
I'd like to contribute by suggesting a DNS-01 challenge checklist to be published somewhere on the site, which I hope will be helpful for those who struggle to satisfy them like me.
The checklist should mention the following points:
1). Validate your DNS server with Unbound DNS Checker (https://unboundtest.com). This service is designed to mimic the behaviours of Let's Encrypt CA servers, and its logs are more in details compared to those of Certbot.
Reason: I've been encountering error messages of Certbot like "Timeout" and "SERVFAIL" without much details and the advice of "making sure the TXT records have been put in place" is not of much help, especially when a local "dig" or Google's Online Dig tool returns the desirable outcome. I've only become aware of "unbound" by reading community posts. I believe it will save Let's Encrypt users and engineers many hours of researching/posting/replying if "Validate your DNS server with Unbound" is mentioned on the official guidance.
2). Make sure your DNS server handles CAA queries (DNS query QTYPE 257) in a legitimate way. If your DNS server is implemented to support such queries then great. If not, it should reply with NOERROR (RCODE 0) and an empty list of AnswerRR. Any other RCODE will fail Let's Encrypt's CA servers.
Reason: There are many community posts that are related to CAA queries. Let's Encrypt website has an article talking about his (Certificate Authority Authorization (CAA) - Let's Encrypt). This should be mentioned in the checklist.
3). Make sure your DNS server honours DNS 0x20 case randomness. Let's Encrypt servers appear to randomize QNAME cases when verifying your domain ownership, and your DNS server should preserve QNAME label cases when replying to these queries.
Reason: It's likely most DNS servers do preserve QNAME cases, especially those popular ones. But, there are also people who implement their own DNS servers with a minimum set of feature compliance just so that their servers can power their own websites, like me. Reading through RFC-1035 and writing a program that decodes DNS queries and encode replies, validating its behaviours with Dig, spinning up a service on a VPS and hooray, your have your own programmable DNS service! But it won't satisfy Let's Encrypt if you are unaware of this case randomness thing and believe that you can safely convert incoming QNAME to lower cases to match against your domains and reply with the lowered NAME in your AnswerRR, you are in trouble. This issue is extremely difficult to locate as no one will validate their DNS server with case randomized queries and Certbot's error message of "SERVFAIL while looking up the TXT record" doesn't point a traumatized user to this direction. A little mentioning of this will be of great help.
4). Make sure your DNS server handles AAAA queries (QTYPE 28) in a legitimate way even if IPv6 addresses are not in use in your case. Let's Encrypt's CA will send [a lot of] AAAA queries to your DNS Server and they need to be replied with RCODE 0 and an empty AnswerRR list if you are not implementing AAAA queries at all.
Reason: I'm not sure if a DNS server returns RCODE 4 on AAAA queries will break the verification but I've seen those requests in the logs and decided to handle them in the same manner as I handled the CAA ones.
5). Stay calm when seeing a plethora of DNS queries upon validating your domain with Certbot. Let's Encrypt CA servers appear to send [a lot of] requests to your DNS server when you request a domain validation with Certbot. This is normal as Let's Encrypt validates your domain from multiple server located across the internet.
Reason: When dealing with issues that were actually caused by mishandling the CAA queries and mistakenly converting QNAME to lower cases, I saw many incoming queries whenever I perform a DNS-01 challenge. This led me to think if there's something wrong with my Nginx stream forwarding or my UDP packet handling as it looks like some queries had not been handled properly and those queries have been "replayed" either by Nginx or Let's Encrypt's CA servers. Knowing that such behaviours are normal would be helpful.
Above are the points that I've encountered in the past few days and I think there are other people who are experiencing similar issues, and it takes a bit of luck to come across the right community post to navigate their way out. Please correct me if any of above is incorrect, or amend the list if I missed anything. After all I highly recommend such a checklist be published on the official website.
Thanks again for your effort.