Consistent "During secondary validation: DNS problem" between 01:00 UTC and 02:00 UTC?

The relevant Boulder issue for this is Make VA DNS deadline expiration error messages better · Issue #5346 · letsencrypt/boulder · GitHub , which has somewhat changed into an overall evaluation of our DNS resolution capacity, but the gist is: that message doesn't mean anything about actually resolving the TLD, and instead is the result of lots of timeouts happening at the same time.

We've been reducing the occurrence of that message since I filed the ticket in the spring by adding more secondary validation capacity. However, this particular 01:00UTC spike currently appears to have a different cause than simply our capacity, as the rest of our metrics don't show the same kind of overload, and our external probes start resolving DNS slowly, too.

I actually added more metrics just this morning that hopefully will give us more information here in 5 hours for today's iteration.

Thanks a lot for making that plot of the errors you've encountered, and the tidbit about use of CNAMEs. As a general bit of knowledge, that probably does affect the severity of impact you're seeing at those times: overall, even during the 01:00 hour, >90% of new-order requests still succeed.

That's terribly below our SLO, but it could easily be that the requests which do fail are those that have indirection.

8 Likes