We use LE to generate/renew certs for thousands of customers, and within the last 2 days, all certificate generation has failed with this same error:
Error creating new cert :: Rechecking CAA: Internal error getting validation method for mydomain.com,
Of course, it says some other domain other than mydomain.com, depending on what we’re generating.
We’re really in hot water if we can’t sort this one out, and the error message is not only cryptic to me, but also google returns exactly zero results for “Internal error getting validation method for”
We’re running on ubuntu 16.04.2 which means we’re using an old version of certbot known as ‘letsencrypt’ version 0.4.1. The command we run is something like:
So, obviously our validation method is an HTTP resource, rather than DNS. This seems obvious by the CLI options --webroot and -w, and it’s always worked great. This sudden failure to “get validation method” is a complete shock. The only hunch I’ve got right now is that I wonder if some newer version of this old letsencrypt command line client has been patched/updated and perhaps our prod servers have installed that package updated without my realizing it.
This might be an internal error that there’s nothing you could do about, or there could be a real CAA problem with one of your domains, but a bug in the new CAA rechecking routine is masking the real error.
Please note CAA records are always checked, and have nothing to do with whether or not you use HTTP or DNS verification.
Can you share one of the domains that is failing? It would make it much easier for a Let’s Encrypt engineer to look into this from their side.
Are all of your domains failing in this way? Do they all use the same DNS servers?
Hmm… Alright you’ve got me thinking. I’ve got to double check some privacy concerns with business folks before I share any domains on here, but now I’ve got something to go on. It’s only a list of ~30 domains its failing on (although that list will be growing daily). I’m going to strip out the domains one by one until I find a specific one causing it to fail.
As a tip, use https://unboundtest.com/ to check the CAA records for the affected domains. The site is created by @jsha (Boulder engineer) and uses the same procedure used by Let’s Encrypt to check the domains.
It looks like you've uncovered a bug in a piece of code that we released to the staging environment yesterday. We will be addressing this bug shortly and you should be able to issue for these domains in the staging environment again. In the mean time the production environment should work for you without error.
Apologies for the inconvenience. Thanks for communicating the error back to us, I appreciate it!
We’ve reverted the staging release while we evaluate fixes for the bug. You should be able to issue using the staging environment again without any errors. Please let me know if this is not the case.
Thanks again for your patience & reporting the bug!
Thanks for letting us know about this bug! We’ll work on it. A couple followups based on your original message:
It sounds like you issue against both staging and prod for each domain. Is that right? Do your logs indicate clearly when a failure happens against staging versus when it happens against prod? That’s useful information for us in future bug reports, and hopefully can help you evaluate how severe a given failure is and how urgently you should worry.
Are you currently renewing your certificates when they have 30 days left on them? We strongly recommend this in our integration guide so that if a bug like this does make it to production, there is time to fix it before your certificates start expiring.
Yes. If a domain is going to fail certification, we want that to use up staging-environment rate limiting. In the past, domains were immediately attempted against prod and it would only take a few coincidental bad domains in a row to hit a rate limit, and then we couldn't produce certs for ~a week. With this process, we increase the integrity of our attempts against prod.
Yes we do, and sounds good. In this case, it was failing in staging, but the error was too generic for our error handling logic to determine 1 bad domain and retry without that 1 bad domain... So the process stopped and never attempted against prod. This "Rechecking CAA " error message would say that all domains in the request failed (we do as many as 100 domains per SAN cert).
Yea, our custom software is in charge of trying against staging and then production. The week-long rate limit we experience when failing against production is brutal, so our software simply won't go to production with domains unless it's succeeding in staging. This is the first time a staging-specific bug has ever been encountered. We might write a new setting into our software so that we can skip staging when we suspect something like this is going on, but I don't want the software ever deciding on its own to go straight to production.
RESOLUTION: Working after the revert! We successfully acquired SANs for several hundred domains just now and we’re back in action.
Thank you all for your time and attention on this. We’ll need to research the CAA stuff and put some CAA checking into place before the send to LE staging I think