On all hosts I've started with letsencrypt so at first I only fetched a cert for the main DNS name and it all worked fine.
Then I tried adding some cnames to the certs, those cnames have been in DNS for years. Two of the hosts run webservers, so I used --webroot and the cnames were verified without problems.
The three other hosts do not run webservers but other services like tomcat etc. For those I used --standalone and called e.g.
It produced this output:
`Renewing an existing certificate for bioserver4.bio.ifi.lmu.de and 2 more domains
Certbot failed to authenticate some domains (authenticator: standalone). The Certificate Authority reported these problems:
Domain: rstudio4.bio.ifi.lmu.de
Type: dns
Detail: During secondary validation: DNS problem: server failure at resolver looking up A for rstudio4.bio.ifi.lmu.de; DNS problem: server failure at resolver looking up AAAA for rstudio4.bio.ifi.lmu.de
Note: it only complained about the second cname "rstudio4" while "jhub4" was fine obviously. For another server it complained about the first cname but not about the second.
For another server I requested a cert for the first time and only for a cname, so the command was only with one -d and without --renew-with-new-domains.
But the result was all the same: it always complained with the above error about the A and the AAAA record.
DNS lookup looks fine, however: bioserver4 /root# host rstudio4.bio.ifi.lmu.de rstudio4.bio.ifi.lmu.de is an alias for bioserver4.bio.ifi.lmu.de. bioserver4.bio.ifi.lmu.de has address 141.84.2.19 bioserver4.bio.ifi.lmu.de has IPv6 address 2001:4ca0:4000:1011:141:84:2:19 bioserver4.bio.ifi.lmu.de mail is handled by 120 acheron.ifi.lmu.de. bioserver4.bio.ifi.lmu.de mail is handled by 100 mailin1.ifi.lmu.de.
However: just calling the exact same certbot command a second time worked without error for all servers and fetched the cert for all domains including any cnames.
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
certbot 1.22.0
This sounds like there is some kind of rate limiter or DDoS protection happening on your DNS Servers. Do you run those yourself?
The Let's Encrypt authorization process checks from (currently 5) places around the world. There are many DNS queries made from each auth center. Note in your first error it said "Secondary validation failure". This means the primary LE center was successful (it is in the USA) but one of the secondary centers failed.
The reason it may have worked later is due to LE caching of successful authorizations. Once a specific account / domain name satisfies a challenge it is cached for some period of time. Your original cert request failed as not all the challenges succeeded. But, later only the ones that failed prior needed to be redone by LE.
Because of these fewer challenges later on there would also be many fewer DNS queries.
There are perhaps other causes for this kind of failure but a DNS Server that has rate limiting of some kind is a strong candidate.
Do you have access to the DNS server logs? Do you control their configurations?
Hi,
the server is running at the unversity. I've asked the admins and there ist no rate-limiting at all. I've sent queries for 250 hostnames from my home ip (thus, an external non-university source) within 30 seconds and they all got answered. And also queried all our ~ 50 cnames within 10 seconds and got correct replies. So it's no rate-limiting problem...
Also please note that it also happened for the one server that had only a cname, i.e. the real hostname was not contained in the domains. Thus it failed with the one single request for the cname. While the other hosts where I first started fetching the cert only for the real hostname and no cnames all worked fine.
So it seems to be related to cnames somehow. Can I help debugging this further in any way?
Sure, the DNS queries required to resolve CNAMEs is different. But, your problem is intermittent. You even said they all worked when tried a second time.
CNAMEs are very common and Let's Encrypt issues over 6 million certs every day. We would be seeing very many failures if a problem affected every CNAME and we are not.
I don't see a problem with individual DNS queries either. But, the speed and burst of queries during a cert request is high.
Do all the failure messages indicate "Secondary validation" ?
What was the actual command for this single domain name? I'd like to suggest some tests and a simple case of one name is easiest.
Just did a little test: added another cname bla4.bio.ifi.lmu.de for bioserver4 in our DNS server and called certbot certonly --standalone -d bioserver4.bio.ifi.lmu.de -d jhub4.bio.ifi.lmu.de -d rstudio4.bio.ifi.lmu.de -d bla4.bio.ifi.lmu.de --renew-with-new-domains -n --agree-tos -m <...> --cert-name bioserver4
The three first domains were already known and part of the certificate I fetched yesterday. The call failed with the known error:
Renewing an existing certificate for bioserver4.bio.ifi.lmu.de and 3 more domains
Certbot failed to authenticate some domains (authenticator: standalone). The Certificate Authority reported these problems:
Domain: bla4.bio.ifi.lmu.de
Type: dns
Detail: During secondary validation: While processing CAA for bla4.bio.ifi.lmu.de: DNS problem: server failure at resolver looking up CAA for bla4.bio.ifi.lmu.de
Called it again and it worked. Then I added another cname blu4.bio.ifi.lmu.de and called certbot certonly --standalone -d bioserver4.bio.ifi.lmu.de -d jhub4.bio.ifi.lmu.de -d rstudio4.bio.ifi.lmu.de -d bla4.bio.ifi.lmu.de -d blu4.bio.ifi.lmu.de --renew-with-new-domains -n --agree-tos -m <...> --cert-name bioserver4
This worked at the first try. So I added another cname ble4.bio.ifi.lmu.de and called certbot certonly --standalone -d bioserver4.bio.ifi.lmu.de -d jhub4.bio.ifi.lmu.de -d rstudio4.bio.ifi.lmu.de -d bla4.bio.ifi.lmu.de -d blu4.bio.ifi.lmu.de --d blue.bio.ifi.lmu.de -renew-with-new-domains -n --agree-tos -m <...> --cert-name bioserver4
The result was
An unexpected error occurred:
Certification Authority Authorization (CAA) records forbid the CA from issuing a certificate :: Error finalizing order :: rechecking caa: During secondary validation: While processing CAA for jhub4.bio.ifi.lmu.de: DNS problem: server failure at resolver looking up CAA for jhub4.bio.ifi.lmu.de
Thus, suddendly jhub4 failed again which has worked in the calls before. I called it again => exact same error. I called it again => worked and created the certificate.
So, this error seems to be highly non-deterministic
Yes, very true. It all centers around DNS query issues. But, whether they are at your facility or Let's Encrypt's is the question.
The CAA query is different than the others. Before Let's Encrypt can issue a certificate it must look at CAA records to see if it is allowed. This CAA recheck happens even if the authorization was already cached. There is a CAA cache too but it is very short in comparison.
Notice the message said "error finalizing order" which happens only after satisfying all the challenges (either new or by reusing previous).
The CAA rechecks can cause a very large number of queries to arrive at your DNS servers. I believe that is even larger number when CNAMEs are used which is possibly why this seems related.
I did more test with only a single domain as you proposed. First I added (one after another) new cnames bli4,blu4,blo4 and called certbot certonly --standalone -d <cname>.bio.ifi.lmu.de --renew-with-new-domains -n --agree-tos -m steiner-cert@bio.ifi.lmu.de --cert-name bioserver4
with the according single cname. They all worked fine.
Then I did the same call with only the "old" cname jhub4.bio.ifi.lmu.de => failed with the above "looking up CAA for jhub4..." error. The did the same call with the old cname rstudio4.bio.ifi.lmu.de => same "looking up CAA for rstudio4" errors with jhub4. Called a second time for rstudio4 => worked.
Called again for bli4 => worked.
Called again for rstudio4 => failed.
Called again for rstudio4 => failed.
Called again for rstudio4 => worked.
I cannot detect any scheme when it works and when it fails...
The --dry-run will deactivate prior challenges on the Staging LE system so each time is fresh. It also does not modify your existing production certs.
More importantly, the Staging system allows many more failures before blocking you with Rate Limits like production will. These Rate Limits are not causing these errors but might become an issue with repeated tests. A rate limit error message is very clear: Rate Limits - Let's Encrypt
I need some more time to make further suggestions. Intermittent DNS query problems are very difficult to diagnose. I will say these are almost always some kind of rate limit in your system. Perhaps some kind of "smart DDoS" feature which looks at where the requests originate or something like that. It is difficult to reproduce the high volume of queries LE makes using your own methods.
If possible, try reproducing one of these errors somewhat consistently with the --dry-run. Then, run these and have your DNS network staff monitor their logs at that time looking for rejected or failed queries.
It is possible to be an LE problem perhaps affecting just one of the "secondary" centers. But these are exceedingly rare.
There were two other people posting after you with similar problems. The Let's Encrypt team investigated and found a problem. I don't know that it is fully fixed yet but it may already be working better than it was. I will let you know when I hear more complete info.
I wanted to be sure you heard this news as you start your day tomorrow.
Thank you very much for persisting. Your info was very helpful.
Thanks for keeping me updated I'm on the road today, so can't do further testing right now, but let me know if you want me to repeat some tests after the bug is considered fixed!
There was an unintended component upgrade during a recent migration that caused the problem. Let's Encrypt has automated monitoring of their validation servers but the failure rate from this specific issue wasn't high enough to be detected. They have increased the sensitivity of those monitors to become aware of them sooner. And improved the upgrade mechanism to avoid repeats of that.
Sorry for the disruption. Please let us know if you see any further trouble.