Consistent "During secondary validation: DNS problem" between 01:00 UTC and 02:00 UTC?

jdavid1 · September 6, 2021, 2:31am

Are the issues with Secondary DNS validation servers still occurring? I see a lot of these errors across a number of domains. The error message varies between:

During secondary validation: DNS problem: SERVFAIL looking up A
During secondary validation: DNS problem: query timed out looking up A

The previous issues I've seen on this topic suggest that it was load-related and predisposed to occur when people schedule their job at *:00. But as far as I can tell, those issues are believed to be resolved. Yet this still happens.

And it is time-related, not so much by what part of the hour, but rather what hour of the day. Each site runs renewal attempts (when needed) once per day, and each site has a random hour of the day when that happens. Today, for example, sites that tried to renew during the 01:00 UTC hour failed. (At times ranging from 01:01 UTC to 01:13 UTC.) But all the ones that occurred during 00:00 - 01:00 and after 02:00 UTC hour (so far) have succeeded.

This has been going on for quite a while, but this is the first time I've noticed that it's always at the same time of day.

The DNS servers and configurations have been rigorously checked and everything is OK. There are no issues with RPKI (valid) or DNSSEC (not used). Also, FWIW, the "Let's Debug" test passes every time for affected domains.

It's pretty tempting to just not schedule these during that hour, but I would really like to get to the bottom of this.

Thanks for any insight!

jcjones · September 6, 2021, 3:35am

Generally speaking, the 01:00 UTC hour is the highest volume traffic of every day, and is the peak for which we provision.

I remember in 2015/2016 it was 00:00 UTC, I don't know when 01:00 UTC became the heaviest slot.

MikeMcQ · September 6, 2021, 3:06pm

@jcjones Sorry for hijack although related ... Is there an LE site page that shows demand per hour so server admins can schedule off-peak cert requests?

I recently setup LE and arbitrarily chose 02:00 UTC (w/45min random variance) for my cron jobs assuming it would be a quiet hour. I would happily choose something else if I only knew the low volume time slots.

Thanks

jdavid1 · September 6, 2021, 3:44pm

@MikeMcQ I don't think that's much of a hijack, because it gets right at my next question, which is what's the best way to address this? Because there are a couple of different angles: get people to pick different times, and help Let's Encrypt better handle the load.

Your suggestion is a great one. I might take it one step farther and ask for hourly coefficients based on measured demand that could be used to weight the probability that a given hour will be chosen. Or just some kind of API endpoint for "what time should I schedule renewal checks?" that looks at that data and does the math for the user and hands back a randomly-generated update time weighted by demand. Easier is better!

Another way to approach that side of the problem might be to detect those overloads and instead of kicking back a misleading "your DNS is broken!" message, display instead "hey, we're too busy right now, could you please reschedule your renewal for a different time of day?"

The other side of the problem is the secondary validation failures. I looked at this, and over a sustained period of time, 80% of my requests between 01:00 UTC and 02:00 UTC fail with "During secondary validation" messages. At all other times, it's <1%. That's pretty dramatic.

From the Let's Encrypt side, it's a double whammy. Load causes failures. But failures cause load, because failed renewals come back the next day, on top of the new renewals for that day. And the next thing you know, a small bump in demand during one hour of the day turns into an ever-growing stampede as previous days' failures aggregate. That's the kind of vicious circle that can drive a web service to its knees. So it's well worth addressing.

I think I read somewhere on here that secondary validation uses AWS. So is it just a question of getting more donations in to pay for more AWS? Would it help for some of us to provide Let's Encrypt with servers on our networks to use for secondary validation? Or are there some technical bottlenecks in the software?

griffin · September 6, 2021, 4:57pm

FWIW from my experience here, I can certainly say from the standpoint of those of us providing aid that addressing secondary validation failures has quite frequently been the bane of our existence.

Is it a global filter?
Is it a responsive filter?
Is LE overloaded?
Gremlins?

MikeMcQ · September 6, 2021, 5:15pm

@jdavid1 Very good points. I especially liked your explanation of the cascading demand after failures. LE also has spiked loads if it would need to do a mass cert revocation again. These encourage ways to spread demand.

I would prefer an API returning the ranking of hours. I could then choose a low-volume hour which also suits my own time preference. I may not choose the lowest-volume hour, but at least I wont be piling on busy times. I would continue to vary the time within the hour randomly. Doing the rankings by shorter slices would be fine too.

I can also see value for an API to return a single 'best time'. I would just prefer the ranking and choice.

The rankings would not need to be updated very often. Thus, the cost of serving a static object would be very low.

This part of the discussion probably belongs in the Feature Requests section.

Nummer378 · September 7, 2021, 12:10am

That's interesting. My renewals run randomly somewhere between 23:00 - 03:00 UTC (used to be more statically fixed in between 01 - 02 UTC depending on daylight saving time, but I reworked that sometime last year). In about 3 years, I've never had a single case of "during secondary validation" errors (or any other DNS servfail error message from LE). Maybe I have been just lucky this entire time, but it seems odd. What nameservers are you working with? Are you maybe using nameservers that get hammered (by LE) during this time of the day? It would be interesting to see if only certain setups are affected by this, while others are not. Might be some networking issue between AWS and your nameservers that is load related?

jdavid1 · September 7, 2021, 3:35am

Our renewals are randomly scheduled throughout the day. They are, within some small variation, roughly evenly distributed. So the same renewal load that is no problem whatsoever between 0:00 UTC and 1:00 UTC is suddenly a catastrophic problem between 1:00 UTC and 2:00 UTC, then no problem again for the entire rest of the day.

1:00 UTC - 2:00 UTC is not a peak time for us.

The amount of DNS traffic that Let's Encrypt represents is vanishingly small as a percentage.

If there were any problems, however small, with our DNS servers or with connecting to AWS from our network at any time of day, our customers and monitoring infrastructure ensure that we would hear about it every few seconds until it was fixed.

So I have pretty good reasons for thinking the problem is not on our end. And since Let's Encrypt doesn't publish the IP addresses they use for validation, I really can't investigate this further. What would be really helpful would be some data from Let's Encrypt about what the overall failure rates for secondary validation are by time of day, as well as any insight into causes of variance or possible fixes.

jcjones · September 8, 2021, 4:31am

We've been adding more observation tools to try and get a better handle on the 01:00-02:00 UTC situation; it's one of the things that as the service has overall gotten higher and higher reliability, becomes a clearer and clearer outlier as it has overtaken what we affectionately have called "Cron o-clock" for many years, being about 00:00-00:30 UTC. It's also definitely a peak load-makes-for-retry-storm vicious cycle.

We've a code change for Boulder in the pipe that should help with one aspect, but it's still a bit of a mystery. We know that recursive DNS resolution from multiple places across the planet tends to increase in latency, which of course Boulder doesn't do (as Boulder resolves authoritatively), but is a curious puzzle piece. At the same time that happens, our validation rates for DNS lookups drop, and they do so from both secondary and primary vantage points.

It's sort of as-if some major DNS provider does nightly maintenance during that window, but... anyway, this is definitely a mystery that we'll just keep adding more methods of observation until we have it nailed down.

As to a time-of-day scheduling map... that seems like a reasonable feature request, but I'd still like to serve 429s with a retry-after header as-needed and just have that work. But that's the Mozillian in me.

jdavid1 · September 8, 2021, 2:29pm

Thanks for the insight, @jcjones !

It sounds like the obvious first action is for us to (at least temporarily) reschedule checks out of that window. I'll do that now.

Above and beyond that, what's the next step we can take that would be the most helpful to you?

MikeMcQ · September 8, 2021, 2:44pm

That is certainly good for when you are under stress. But, spreading the load can help avoid that happening. What's the old saying - an ounce of prevention ...

Example, if 'everyone' started making requests at 00:00 UTC waiting for the 429 to inform of 'best next time' it would become a very spikey event.

jdavid1 · September 8, 2021, 3:51pm

Here's an interesting one from 01:12 UTC today:

"During secondary validation: DNS problem: query timed out looking up CAA for com"

Yes, just com. The name being renewed was a bare domain (e.g., example.com).

I found one for net and one for org as well.

This, in conjunction with:

made me take a look at the distribution of TLDs this occurs with. (Like, maybe it's Verisign?) But, alas, it's a pretty accurate sample of the TLD's we work with. Over half the sample is com. Net and org are well-represented, with a scattering of other gTLDs and 9 ccTLDs from three continents. Registries touched include at least Afilias, Radix, Verisign, and Zodiac. So, no common element smoking gun there, I'm afraid.

But if the secondary validation servers are having trouble confirming that there's no CAA for .com, it does imply the problem is pretty close to them.

I also pulled the full stats for what errors occur. Here's what that distribution looks like:

All of those are of type urn:ietf:params:acme:error:dns except the last one, which is urn:ietf:params:acme:error:connection. (That one may be unrelated.)

The other thought I had was in response to the mention of recursive DNS vs. authoritative. A significant majority of the DNS failures we see involve CNAMEs to a name in another domain.

For example, renewing the cert for www.example.com where:

www.example.com. IN CNAME load-balancer-123.example.net.

That's not recursive, exactly, but it is a bunch of extra round-trips. Perhaps that helps explain why this hits us so hard.

jcjones · September 8, 2021, 8:09pm

The relevant Boulder issue for this is Make VA DNS deadline expiration error messages better · Issue #5346 · letsencrypt/boulder · GitHub , which has somewhat changed into an overall evaluation of our DNS resolution capacity, but the gist is: that message doesn't mean anything about actually resolving the TLD, and instead is the result of lots of timeouts happening at the same time.

We've been reducing the occurrence of that message since I filed the ticket in the spring by adding more secondary validation capacity. However, this particular 01:00UTC spike currently appears to have a different cause than simply our capacity, as the rest of our metrics don't show the same kind of overload, and our external probes start resolving DNS slowly, too.

I actually added more metrics just this morning that hopefully will give us more information here in 5 hours for today's iteration.

Thanks a lot for making that plot of the errors you've encountered, and the tidbit about use of CNAMEs. As a general bit of knowledge, that probably does affect the severity of impact you're seeing at those times: overall, even during the 01:00 hour, >90% of new-order requests still succeed.

That's terribly below our SLO, but it could easily be that the requests which do fail are those that have indirection.

jdavid1 · September 9, 2021, 1:25am

So, if I understand you (and that issue, which contains quite a few insider acronyms ) correctly, it sounds like you're saying that the process of resolving DNS for validation involves a long series of queries. Something like:

com NS
com CAA
example.com NS
example.com CAA
www.example.com A

Plus a whole extra set if you get back www.example.com CNAME. And possibly more for DNSSEC.

And in addition to per-query timeouts, you have a "global" time budget for the whole sequence of queries to complete.

If the global time budget is exceeded, then the returned message is "query timed out for" whatever query was being attempted when time ran out, even if the real problem is that some earlier query succeeded but took 95% of the budgeted time to do it.

Do I have that part right?

If so, is that overall timeout shorter for secondary validation than for primary? Or is there any reason why secondary validation might take longer?

I confirmed today with a more extensive sample (n>10000) of logged Let's Encrypt responses that all "query timed out" responses we get for domains using our name servers occur during secondary validation—literally, all of them. If there's a general DNS problem, it only affects the secondary validation servers, not the primary ones.

And, FWIW, 98.7% of timeouts in the sample occur 01:00 - 01:15 UTC, which fits your graph fairly well:

(We don't run renewals between 00:50 and 01:00.)

Unfortunately, I also found that only 60% of names that experienced query timeouts during secondary violation use CNAMEs. That's not really a strong enough correlation to conclude, "Aha! It's the CNAMEs pushing it over the limit!"

We did move all of our renewals out of the 01:00-02:00 window before 01:00 today, though I'm quite sure we're not doing nearly enough certificates to move the needle on Let's Encrypt's overall error rate.

jcjones · September 9, 2021, 11:21pm

Yes, precisely.

The main difference is that secondary happens in datacenters where we aren't privy to the entire network design; the timeouts are the same.

Nice to hear, but we get DNS problems everywhere. Problem? It's always DNS.

My new metrics last night imply strongly that we're getting rate-limited at secondary validation, so we've deployed a mitigation to half our secondary validation sites to compare. We'll see in about 2 hours if that's a winning strategy or not!

jcjones · September 10, 2021, 3:23am

This might have fixed it.

The validation sites with the mitigation were rock-solid, and those without had timeouts. Overall, the API had smooth sailing through peak load tonight, for the first time in a while.

Let me know if you saw otherwise. Tomorrow I'll make this change permanent and roll it to the other half of the secondary validation sites.

jdavid1 · September 10, 2021, 3:58am

Nice! That sounds like a great discovery.

Since we moved all of our validations out of that hour, we have no "during secondary validation" errors related to our DNS or web servers in the last 48 hours. None. Zero. But since they weren't during that hour, that's not terribly informative.

If it would be helpful, I can pile up a bunch of them to run during that time tomorrow and report back what happens.

jcjones · September 10, 2021, 3:26pm

I don't want to make any extra work for you, but I suspect you can use any time of day again. But it's still true, 01:00 is heaviest load. So... :: shrug ::

Anyway, this is a fun example of Let's Encrypt being bitten by rate limits as opposed to being the one doing the biting. Rate limits get us all.

jdavid1 · September 10, 2021, 4:17pm

For the time being I've unblacklisted that hour, so our hits will occur approximately evenly throughout the day.

In the longer term, our usage of LE is fairly decentralized and looking into this has exposed some design flaws with that approach that I'm not thrilled about. I'm looking into what we can do about that, which should make our usage of LE a little more considerate.

If you're able to make any sort of demand map available as @MikeMcQ originally suggested, that'd definitely something we could implement as part of that effort.

jcjones · September 10, 2021, 4:31pm

Gotcha. Our demand map is probably going to come in the form of what we call ARI, the ACME Renewal Information extension, which Aaron is about to post a first draft of to the ACME IETF mailing list. See this initial post for context.