In my situation, the problem isn't Manual DNS domain validation in terms of getting the acme challenge token into the zone files. Automation is not an option in my situation.
The last certificate to validate was one that had two hostnames, the domain itself and www. So more certificates (I already have 8) with fewer names (the worst offender in this case only had 2) is not the issue in my case.
Your Response Rate Limiting (RRL) directives may be part of the problem. Let's Encrypt's validation service will issue simultaneous DNS queries. This includes "climbing the tree" for CAA checks, so potentially more than 5 queries - especially if you're trying to validate multiple hostnames that are served by the same authoritative DNS server. It might be helpful to check your BIND logs for "rate limit drop"s at the times you request validation.
I will adjust my certificate renewal SOP to temporarily comment out the rate-limit portion of my config.
However, I am still trying to discover why it took more than a dozen attempts (with new tokens that were externally verified before renewing each time) to get some certificates validated over the course of 5 hours. Temporarily removing the rate-limits may help, but I don't think that's the underlying issue here.
Thank you for your great work (I love the Let's Encrypt project) and your assistance.
You can't share a public domain name?
No errors or warnings from dnsviz.net. All ok from EDNS Compliance Tester.
Use your imagination.
I dunno. It seems more and more likely this is the case.
You started with a bunch of certs to renew. A few worked and the rest failed because your DNS server essentially started denying requests. Then every time you retried, a few more succeeded, right? That seems to imply that your DNS servers will only tolerate a portion of the query traffic the validation servers are sending before they get blocked for the rest of that run.
The fact that you had new orders with new tokens to validate is irrelevant. The validation servers still need to make the same number of queries to check both the TXT values and CAA status from multiple vantage points around the Internet.
Bottom line, your DNS servers are sending SERVFAIL responses for some portion of the traffic going to them. Fix that and you fix your problem.
This is a stock example of a flawed help request. (sorry) It is passionate, but not productive.
Here's my take of whatâs wrong and why itâs a problem, both for you and the community trying to assist. We are here to help but you have to give us real, valid information, data...
Whatâs Wrong with the Request...
- No Technical Details Provided
Despite being prompted to fill in essential diagnostic info (domain name, command run, error messages, software versions, etc.), you, the OP provided none of it. This violates our communityâs support protocol and makes troubleshooting nearly impossible.
Letâs Encrypt problems are often very specificâdomain configuration, DNS propagation, ACME client behavior, etc. Without concrete details, nobody can diagnose the issue.
- Username Appears Auto-Generated
The name dLubxiWGGVB8vXyLzUM3 looks like something generated by pwgen or a similar tool. This is a red flag in two ways:
Trust: It makes the post look like spam or AI-generated. Although it may be legit... Dunno
Community Engagement: A generic, throwaway-looking account suggests that you aren't really invested in the community or the quality of the conversation. Lets be real here.
It undermines credibility before anyone even reads the message.
- Tone: Ranting, Not Asking
The post is emotional and reads like a rant, not a request for help. Repeating the "renew, edit, reload, check, fail" loop over and over adds heat but no light. It dramatizes the issue without moving toward a solution.
Frustration is valid, but venting without actionable questions turns helpers away.
- Assumption of Blame
âWhy do I get punished for your servers having problems?â
This is presumptive. Without evidence that the problem lies with Let's Encrypt infrastructure (and not local DNS misconfiguration, caching, or scripting errors), it alienates those who might help. It makes helpers defensive or dismissive.
- Unclear What Help Is Wanted or really needed.
You finish with âThere has to be a better way.â Thatâs vague. Are you asking for automation tools? A change to the ACME protocol? Validation process improvements?
Helpers donât know whether to troubleshoot a specific error or give general advice.
What a Good Help Request Would Look Like
Using the official Letâs Encrypt support format, you shouldâve said something like:
My domain is: example.com
I ran this command: certbot certonly --manual --preferred-challenges dns
It produced this output: SomeDomainName: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.example.com
My web server is: N/A (DNS-only challenge)
My OS is: Ubuntu 20.04
Hosting provider: Self-hosted
Root shell access: Yes
Control panel: None
Certbot version: 1.32.0
I manually add TXT records using my own DNS server, and validation works for 49 out of 50 domains. The last one repeatedly fails even after propagation checks with dig. Any idea why?
Bottom Line
The original post wasted 6 minutes of reading time and 5 hours of frustration without giving anyone the tools to help. You were venting, not problem-solving.
If you want help, ask for help. Don't just yell at the wall.
Think about it.
Rip
Thank you for taking the time to write up your feedback. I appreciate it.
I have some feedback, and I am hopeful that the Let's Encrypt team will consider it on its merits.
I appreciate the fact that once a token has been validated, it does not have to be re-tokenized and renewed, even if there are other names in the certificate that have not yet had validated tokens. That is really smart and saves a lot of trouble. It would be silly to have to replace validated tokens because some of the names in the certificate were not validated for some reason. That would be extra work. It was smart for the Let's Encrypt team to see that and to gracefully take that situation into account.
The Trouble
However, what is troublesome is for the tokens that are not instantly validated for whatever reason (issues with your servers, issues with the networks between your servers and mine, issues with my server, or some combination), those tokens are automatically invalidated, and a new token is issued which needs to then be validated. And that (manual) process repeats until whatever the issue was is resolved and all tokens are validated.
Obviously, it can be very frustrating when one follows all of the instructions and verifies the results before attempting renewing/validating.
-
I have a valid token, and that valid token is where it is supposed to be.
-
I have verified that the correct token is in the correct place and is accessible via lookup from an external third-party server after (re)issuing the certificate and before renewing/validating it.
-
I have other names that have validated.
Most times a token validates the first time. Sometimes, if it doesn't validate the first time, it validates the second time (with a new token).
Other times, like the other night, it took a 15 attempts for one name's token to be validated, one name of two names in a certificate, where the other name validated the first time. Or rather, the 15th new token issued for the same name, which was verified to be correct and accessible from a third-party server every time, the 15th token for the same name was finally validated. 5 hours to get a token (or rather the 15th token), that was proven to be valid and accessible, to be validated. That was a very frustrating experience.
I've replaced all of the drives, operating systems, and software in all of the nodes of a high-availability cluster, without the cluster going down, in less time than that.
This isn't the first time I've (re)issued certificates with Let's Encrypt. This isn't the first time I've had these issues.
The Suggestion
My humble suggestion would be for the system to not immediately invalidate and reissue a new token when that token can't be accessed, but for there to be a way for the system to gracefully fail and try again after a reasonable waiting period. For example:
"There was a problem accessing the token for _acme_challenge.www.example.com. The server will make a second attempt again in 5 minutes. Don't renew the certificate during that time."
"There was a problem accessing the token for _acme_challenge.www.example.com. The server will make a third attempt again in 5 minutes. Don't renew the certificate during that time."
"There was a problem accessing the token for _acme_challenge.www.example.com. The server made three attempts and the token is now invalid. Please renew the certificate and try again."
I really think that would be a big improvement. Kind of the way that an email server gracefully fails when it can't reach the recipient's server, so it tries again later.
Thank you for your consideration in this matter. Thank you to the people in this thread who have offered advice and possible solutions. Thank you to the Let's Encrypt team for a great product.
Firstly, thank you for your patience and the generous praise you have expressed despite the frustrations you have experienced.
Speaking only for myself (and thus not for the Let's Encrypt staff) as someone who reads (or at least strongly skims) virtually every public post made in this community, I can certainly understand your frustrations and therefore suggestions though certain raw statistics make me scratch my head a bit here.
- As an ACME client author with multitudes of users and a longtime Let's Encrypt user myself from shared hosting up to Kubernetes clusters, I have personally experienced very minimal issues regarding token verification except during maintenance and technical issue windows, which have been exceedingly rare during my 5+ years as a Let's Encrypt subscriber
- Let's Encrypt issues certificates for nearly 60% of the domain names on the public internet, which accounts for many millions of certificates on a regular basis, yet we don't seem to be overrun with frequent, systemic issues being reported in this community
- Having monitored nearly every support request made in this community for the past five years, the majority of times I've heard marked concern expressed regarding the architecture or systems or approach of Let's Encrypt they are almost always expressed by those with systems or approaches that exhibit various antipatterns and a seemingly curious disregard for substantial quantities of empirical evidence and the keen observations made by extremely experienced volunteers with combined analytical and debugging hours in this specialized area numbering in the tens of thousands of hours
Of course there will always be edge cases and one size never fits all even with the greatest of intentions and virtually unlimited resources. That stated, I think there's something to be said for systems and practices that don't conform to what the vast majority of systems and practices seem to find functional >99% of the time. While all systems can be improved, I personally question the wisdom/efficiency in making improvements to a system that works very well for the overwhelming majority of cases to be more effective with edge cases rather than the edge cases likely being improved themselves by working better with the well-proven system. Just my two cents.
Retrying challenges is something that's allowed and specified in the ACME protocol (RFC 8555, section 8.2). However, this is not something that Let's Encrypt has implemented. See the Boulder divergences from ACME document for that.
And while I also cannot speak for the Let's Encrypt team, I think it's highly unlikely that Let's Encrypt will implement this feature, at least not in the near future, if ever. As @griffin already stated, Let's Encrypt successfully issues millions of certificates on a daily basis. And while you might not be the only person to benefit from this currently lacking retry feature, I doubt there's a compelling reason to implement it, seeing the sheer number of clients that can successfully work without it.
It's probably a better idea to focus on why you're having this issue instead of relying on Let's Encrypt to come with a workaround. Because even if there were some challenge retry feature, it would still only be a workaround for some malfunctioning DNS.
I am certainly open to the possibility that I am the only person in this world with this issue. And I'm always looking to implement best practices.
And believe you me, I've spent many hours over the years working with Letâs Encrypt certificates, having the same issue again and again, and trying to solve it on my own without asking for help (until this thread, when I clearly hit my limit).
However, I have to believe that I'm not the only person who manages 50+ names. I believe that my suggestion would be helpful for others who manage many names.
No, I don't try to renew all 50 names at once. I have 8 certificates, which I renew one at a time. So really, the limiting of traffic shouldn't be the issue.
But as I said before, I've added a new entry to my standard operating procedure for renewing Letâs Encrypt certificates: temporarily comment out all DNS traffic limitations. It puts my server temporarily at higher risk for DDoS attacks, but I guess that's what I have to do to renew my certificates?
I don't know what else to do, as my DNS servers pass all of the tests I can find on the internet, and work flawlessly in every other regard 24/7/365.25.
My previous suggestion of automatically retrying to validate tokens that can't be reached would seem to be a more elegant solution for those of us with a bunch of names. It would probably use fewer resources for Let's Encrypt too.
Some additional info and thoughts that may or may not be useful to you.
I warmly agree with the commenters who recommended you to automate the challenge-responses. You say that âAutomation is not an option in my situationâ. Thatâs up to you of course, but I wouldnât expect anyone else to take this position for granted. Please note that certificate lifetimes are being shortened. After 2029-03-15, no WebPKI certs can be longer than 47 days, so you would have to this dance every month. To each their own, but I would loathe doing manual responses to ACME challenges at that frequency.
Please keep in mind that both ACME and DNS are very flexible systems. Whatever the root cause is for the failed lookups, there are probably ways to work around it, if you canât find and solve the root cause.
One fact that might be useful: The DNS server that responds to ACME challenges does not have to be the same server that you use for your main operations. You could even outsource this to part to Cloudflare, Google Cloud or whatever you want, at minimal cost. Then the worry of being DDoSed is gone. (Again, without changing your main DNS.)
I recently wrote an article about some tricks that can be very useful when utilizing the DNS-01 challenge. Iâll link it here in case itâs useful for anyone: Using DNS for responding to ACME challenges. There are some interesting comments on the Lobsters forum.
The reason the article could be interesting in your case is basically this: By placing ACME challenge-responses in a separate zone (or even a separate domain), you can keep the rate-limiting on your primary DNS server.
Lastly, another thing to consider: Do you utilize any kind of geo-blocking? I think one can assume that the lookups from Letâs Encrypt can originate from basically anywhere in the world. So if you have geo-blocking in place, that could be a source for the errors you see.
Note that that's not a promise, just an implementation detail and I'm not sure if there's a decision on this already but that authorisation cache is getting shorter and shorter (used to be 30 days, maybe still is, could go to few days/hours).
You should not rely on that mechanism.
Even though that rate limit should be at "kernel is overloaded/distributed malicious requests" levels, not "one user is talking too much" levels. It's DNS, it's supposed to survive stress tests without dropping requests.
Good point.
Even so, two other strong arguments for placing the challenge-responses in a zone handled by Google Cloud (or a similar service):
- Their HTTP API is very useful for automation.
- Updates are fast. With Google Cloud I would expect propagation to take less than 60-90 seconds or so.
The biggest advantage of doing so has always been privilege minimisation from my point of view.
Propagation... is almost instant almost everywhere.
How long domain validation stays valid differs between profiles and is documented: Profiles - Let's Encrypt. It's the value of the 'Authorization Reuse Period' property, which 30 days right now for classic.
The maximum for this is also going down and will be 10 days on 2029-03-15. (The max a CA can currently do is 398 days)
This sounds really familiar...
Iâm not sure if I have fully understood the unique selling points of acme-dns. Anyway, I guess itâs a good thing that there are many solutions to choose from.