Internal error getting validation

lancedolan · October 4, 2017, 6:33am

We use LE to generate/renew certs for thousands of customers, and within the last 2 days, all certificate generation has failed with this same error:

Error creating new cert :: Rechecking CAA: Internal error getting validation method for mydomain.com,

Of course, it says some other domain other than mydomain.com, depending on what we’re generating.

We’re really in hot water if we can’t sort this one out, and the error message is not only cryptic to me, but also google returns exactly zero results for “Internal error getting validation method for”

We’re running on ubuntu 16.04.2 which means we’re using an old version of certbot known as ‘letsencrypt’ version 0.4.1. The command we run is something like:

letsencrypt certonly --webroot [--staging] --csr /path/to/mycert.csr -w /var/www/html -d $MYDOMAIN

So, obviously our validation method is an HTTP resource, rather than DNS. This seems obvious by the CLI options --webroot and -w, and it’s always worked great. This sudden failure to “get validation method” is a complete shock. The only hunch I’ve got right now is that I wonder if some newer version of this old letsencrypt command line client has been patched/updated and perhaps our prod servers have installed that package updated without my realizing it.

Patches · October 4, 2017, 7:01am

This might be an internal error that there’s nothing you could do about, or there could be a real CAA problem with one of your domains, but a bug in the new CAA rechecking routine is masking the real error.

Please note CAA records are always checked, and have nothing to do with whether or not you use HTTP or DNS verification.

Can you share one of the domains that is failing? It would make it much easier for a Let’s Encrypt engineer to look into this from their side.

Are all of your domains failing in this way? Do they all use the same DNS servers?

lancedolan · October 4, 2017, 7:06am

Hmm… Alright you’ve got me thinking. I’ve got to double check some privacy concerns with business folks before I share any domains on here, but now I’ve got something to go on. It’s only a list of ~30 domains its failing on (although that list will be growing daily). I’m going to strip out the domains one by one until I find a specific one causing it to fail.

sahsanu · October 4, 2017, 7:14am

Hi @lancedolan,

As a tip, use https://unboundtest.com/ to check the CAA records for the affected domains. The site is created by @jsha (Boulder engineer) and uses the same procedure used by Let’s Encrypt to check the domains.

Cheers,
sahsanu

cpu · October 4, 2017, 12:54pm

Hi @lancedolan,

It looks like you've uncovered a bug in a piece of code that we released to the staging environment yesterday. We will be addressing this bug shortly and you should be able to issue for these domains in the staging environment again. In the mean time the production environment should work for you without error.

Apologies for the inconvenience. Thanks for communicating the error back to us, I appreciate it!

cpu · October 4, 2017, 1:39pm

Hi again @lancedolan,

We’ve reverted the staging release while we evaluate fixes for the bug. You should be able to issue using the staging environment again without any errors. Please let me know if this is not the case.

Thanks again for your patience & reporting the bug!

jsha · October 4, 2017, 3:48pm

Hi @lancedolan,

Thanks for letting us know about this bug! We’ll work on it. A couple followups based on your original message:

It sounds like you issue against both staging and prod for each domain. Is that right? Do your logs indicate clearly when a failure happens against staging versus when it happens against prod? That’s useful information for us in future bug reports, and hopefully can help you evaluate how severe a given failure is and how urgently you should worry.
Are you currently renewing your certificates when they have 30 days left on them? We strongly recommend this in our integration guide so that if a bug like this does make it to production, there is time to fix it before your certificates start expiring.

lancedolan · October 4, 2017, 6:17pm

Yes. If a domain is going to fail certification, we want that to use up staging-environment rate limiting. In the past, domains were immediately attempted against prod and it would only take a few coincidental bad domains in a row to hit a rate limit, and then we couldn't produce certs for ~a week. With this process, we increase the integrity of our attempts against prod.

Yes we do, and sounds good. In this case, it was failing in staging, but the error was too generic for our error handling logic to determine 1 bad domain and retry without that 1 bad domain... So the process stopped and never attempted against prod. This "Rechecking CAA " error message would say that all domains in the request failed (we do as many as 100 domains per SAN cert).

Yes, we're doing that.

Patches · October 4, 2017, 6:28pm

+1 if I had known you were encountering this in just staging I would have probably tried it myself and noticed it was broken last night.

But the brackets you put around [--staging] in your first post suggested to me that you had tried both, so I didn't bother.

lancedolan · October 4, 2017, 6:33pm

Yea, our custom software is in charge of trying against staging and then production. The week-long rate limit we experience when failing against production is brutal, so our software simply won't go to production with domains unless it's succeeding in staging. This is the first time a staging-specific bug has ever been encountered. We might write a new setting into our software so that we can skip staging when we suspect something like this is going on, but I don't want the software ever deciding on its own to go straight to production.

lancedolan · October 4, 2017, 6:34pm

RESOLUTION: Working after the revert! We successfully acquired SANs for several hundred domains just now and we’re back in action.

Thank you all for your time and attention on this. We’ll need to research the CAA stuff and put some CAA checking into place before the send to LE staging I think

jsha · October 4, 2017, 6:52pm

Excellent, glad to hear things are working again for you!

system · November 3, 2017, 6:52pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Random LetsEncrypt errors on domain record lookups Help	11	583	December 29, 2023
CAA error when creating certificate Help	9	849	June 9, 2024
CAA exception notifications don't mention the failing domain Feature Requests	18	4078	September 7, 2017
LetsEncrypt renewal error - Error finalizing order :: While processing CAA, SERVFAIL looking up CAA Help	9	973	March 2, 2024
Certbot-auto gives "The server experienced an internal error :: Error creating new cert"	12	13755	August 12, 2016

Internal error getting validation

Related topics