Query timed out looking up CAA for com?

This post was flagged by the community and is temporarily hidden.

P.S. All the above verified OK and my cert was issued when I re-ran the same command an hour later.

This is 100% a system error at your end.

This is what I see from recent past.

https://community.letsencrypt.org/search?q=secondary%20validation

2 Likes

P.P.S.

To help you fix your bug - here's the order of requests that were in here, with the ones that failed marked at the start of the line with a single "*"

Performing the following challenges:
dns-01 challenge for ***.com
dns-01 challenge for **********.com
dns-01 challenge for *********-*****.com
dns-01 challenge for *********-****.com
dns-01 challenge for ****************.com
dns-01 challenge for *****-*******.com
dns-01 challenge for *****-********.com
dns-01 challenge for *****-*****.com
dns-01 challenge for *****-*********.com
dns-01 challenge for ***********.com
dns-01 challenge for *****.com
dns-01 challenge for *******.com
dns-01 challenge for ***********.com
* dns-01 challenge for *******.com
dns-01 challenge for ************.com
* dns-01 challenge for ******.com
* dns-01 challenge for *****.com
dns-01 challenge for *****.com
dns-01 challenge for ****-******.com
dns-01 challenge for ***********.com
dns-01 challenge for **********.com
* dns-01 challenge for **********.com
dns-01 challenge for **********.com
dns-01 challenge for **********.biz
* dns-01 challenge for **********.co
dns-01 challenge for **********.com
dns-01 challenge for **********.net
dns-01 challenge for **********.org
* dns-01 challenge for **********.us
dns-01 challenge for ***********.com
* dns-01 challenge for ********-*****.com
dns-01 challenge for ********-****.com
dns-01 challenge for *************.com
dns-01 challenge for **********-*****.net
dns-01 challenge for ***********-****.com
* dns-01 challenge for ***************.co
dns-01 challenge for ******-********.com
dns-01 challenge for **************.com
dns-01 challenge for ****-***********-*****.com
dns-01 challenge for ****-***********.com
dns-01 challenge for ****-****************.com
dns-01 challenge for ***************.com
dns-01 challenge for ********************.com
dns-01 challenge for **********************.com
dns-01 challenge for *****-*****.com
dns-01 challenge for *******.email
dns-01 challenge for ***********.biz
dns-01 challenge for *********-*****.com
dns-01 challenge for *****.com
dns-01 challenge for *******.com
dns-01 challenge for *****.com
dns-01 challenge for *****.com
dns-01 challenge for ****-******.com
dns-01 challenge for ***********.com
dns-01 challenge for **********.com
dns-01 challenge for **********.com
dns-01 challenge for **********.com
dns-01 challenge for **********.biz
dns-01 challenge for **********.co
dns-01 challenge for **********.com
dns-01 challenge for **********.net
dns-01 challenge for **********.org
dns-01 challenge for **********.us
dns-01 challenge for ***********.com
dns-01 challenge for ****-***********-*****.com
dns-01 challenge for ****-***********.com
dns-01 challenge for ****-****************.com
dns-01 challenge for ***************.com
dns-01 challenge for ********************.com
dns-01 challenge for **********************.com

This post was flagged by the community and is temporarily hidden.

Hi @gitcnd,

I agree that you've encountered a Let's Encrypt network or infrastructure problem which Let's Encrypt staff should look into in order to reduce the likelihood that it will recur.

I'm not very happy with your use of "need to" here. Let's Encrypt doesn't offer an SLA and has maintained very high availability and reliability overall. This kind of error can be certainly be a huge nuisance if it occurs as part of an initial issuance, but Let's Encrypt's efforts to fix it are on a best-effort basis. If you need a specific SLA, you can purchase a commercial CA service that offers one. Let's Encrypt staff are not at all indifferent to the reliability of the service and are always interested in working with the community to debug problems—a process which has worked well on this forum many times before and led to good resolutions (sometimes identifying specific routing or firewall problems, or bugs in infrastructure tools).

Could you share the exact time and date that these errors occurred? That would probably be helpful for Let's Encrypt team members investigating problems like this.

It's not invented and run by Cisco. It's invented by Mozilla, the University of Michigan, and EFF, and run by ISRG.

This completely-free-to-every-end-user service has very good reliability overall for the years it's been operating, but it's not perfect and has never been perfect. You can consistently get better "service" by working cooperatively with the people behind it instead of insulting them.

10 Likes

@schoen - Cisco is a founder, and was involved in standing LE up from day 1 - Let's Encrypt - Wikipedia

I STRONGLY disagree with all your sentiments. All Certificate Authorities, no matter what they charge, or how much you might appreciate their services or cost structure, must always hold themselves to the highest of standards.

The above holds DOUBLY true for ones that reach significant scale.

CA's are literally the root of all trust online. It is way beyond wrong to cut them ANY slack for internal infrastructure failures or lack of "perfection". People's money, reputation, safety, and sometimes lives rely on these things working properly. Imagine if someone wasn't aware that this renewal system was subject to frequent screwups, and medical equipment became unreachable as a result.

I will not share the time and date. This is not a "time and date problem" - this is a basic failure of the LE implementation to detect critical errors in their system REGARDLESS of what time they occured, and automate a means to fix it.

And, let's look at this with some common sense here - it's only a 5 minute fix for someone to detect a failure to resolve the ROOT name servers, and send a system page. Once they get a storm of pages flooding in, they'll be able to realize "Hmm - we really should implement some kind of fallback for when X keeps going down..." (and, for all we know, X might be them getting blocked for DDoS'ing the root names servers instead of honouring the TTLs they get).

Hi, @gitcnd,

I've taken a first look at this problem. We'll keep looking into this, but it would be most helpful to know your request date/time and a sample domain name. That's because this was not the kind of system-wide failure that one might infer from that error message. So, we might need to pick apart your specific API request in order to be more confident about what happened.

It's good to know that your request was for 100 hostnames. That is supported, but it demands the most DNS queries out of any API request we process, and so it's the most likely to run into DNS performance problems: with us, your authoritative nameservers, and/or Internet connectivity in general.

Our issuance process has a timeout on DNS queries. If our Validation Authority (VA) component has a long chain of DNS queries pending, and the timeout clock runs, it will report the timeout in the context of the current query it's working on. In this case, my guess is that the clock happened to run during the (extremely short) time the resolver was pulling the CAA record for com. from its cache.

(I know the broad outline of this is correct, as a first try at explaining what probably happened, but I could be wrong about the exact way that code is structured.)

We do see this exact error message happen fairly often, and I believe it's for that reason. Different During secondary validation... errors can have very different reasons, some but not all of which are related to one another.

So, preventing this error may require some more detailed performance troubleshooting of your request - which is challenging, but of course we'll try to figure out what we can. Knowing your authoritative nameservers would be a helpful place to start.

And we do ask that you avoid bunching automated renewals at very popular times, if you can help it, whether that means picking a random time or doing a sleep(rand()) in your client integration. We work hard to provide capacity for demand spikes, but since no system can ever be perfect, this change can ease both our job and your job.

Section 1.4.2 of Internet Security Research Group's Certification Practice Statement, Version 3.0 (current) prohibits the use of Let's Encrypt certificates in any system in which failure could lead to injury, death, or environmental damage. Just as you expect, we take our responsibilities extremely seriously, and that includes knowing the limits of how much responsibility we can safely undertake.

I hope this helps.

7 Likes

@gitcnd You might have some valid points (I'm not going to argue about that now), but the way you convey those points is dubious at best. Remember that most of those points are your opinions and are not facts. Please post accordingly.

5 Likes

This post was flagged by the community and is temporarily hidden.

This post was flagged by the community and is temporarily hidden.

6.4% of hospitals use Let's Encrypt certificates (and that's not counting the equipment they rely on internally - so the real number will be much higher). Do you actually have any process in place that enforces your terms of use? A death is a death, whether or not you prohibited it in your TOS...

Ah, this is the kind of thing I was talking about earlier in my other thread regarding communication barriers. This entire thread began with a negative attitude towards the receiver (in this case, the LE staff) and is continuing along this line now with this recent negative attitude towards you. It doesn't look like there's an experiential mismatch as the poster later claims that they have 37 years (37?! In a row?) of experience in this field.

His assertion about knowing how feedback can help make management and staffing decisions based on feedback is correct. However, I'm not sure if he's aware that there have been studies on how to provide feedback in more constructive and effective ways. By beginning this thread with a negative attitude and continuing to do so, he has conditioned anyone reading it to also experience a negative attitude towards him. If that was his intention, it's actually creating the opposite effect of what he initially intended, which was to catch the attention of the LE staff and encourage them to implement his suggestions.

10 Likes

I understand why you assumed that query timed out looking up CAA for com was a black-and-white issue of Let's Encrypt being unable to communicate with the root nameservers. But, in fact, it's not a black-and-white issue.

This error message was because your request as a whole timed out. (See my reply above.) It's coincidence that the timeout expired during a query to com., causing that specific error message. Counterintuitively, I know, it does not signify that our ability to query com. was broken. We do cache DNS.

It is, in general, deliberate and appropriate for us to enforce a timeout. It's not yet known whether this instance of a timeout was the fault of our infrastructure, your infrastructure, or connectivity in between. In order to know that, i.e. whether our system was actually "broken" in some way, we'll require more information about your specific request.

11 Likes

I've opened an issue for Boulder to overall improve the error message here: Make VA DNS deadline expiration error messages better · Issue #5346 · letsencrypt/boulder · GitHub. That issue has similar information to what @JamesLE has already supplied above, but as code changes go, will compile more technical detail as it goes along.

Among the actions I called out in that issue is to examine if we can emit detailed performance information for the queries of the order: Particularly for secondary validation, sometimes it can simply be a brief unlucky period of high latency that drives the order to exceed the deadline.

10 Likes

When I read this thread, as a long time user and believer in Let's Encrypt it caused an immediate subconscious negative opinion around the issue, which is a legitimate issue around timeouts on large orders with an unclear error message. I was going to write a post somewhat along these lines, but yours nailed it.

I used to have a bad habit of raising issues with vendors using a negative / hostile tone when I was sure that the issue was on their end, but I generally found those discussions to be much less productive. Not to mention the few times that the issue was actually related to my configuration which just made me look bad.

I think this issue could have been brought up, troubleshot and explained a lot better with different language. Especially since LE is a nonprofit organization accomplishing great things for a huge number of people on a tight budget.

10 Likes

Thanks for the affirmation as a long-time user. It's good to know that I wasn't off-base in my assessment of this situation.

1 Like