Load balancing to avoid busiest time(s) of day for renewal issuance requests

petecooper · October 27, 2019, 8:46pm

Question: what’s the best time of day to run a renewal check? Is there (public) data on the requests over a typical day or week, which could be graphed to show peaks and troughs?

I see a lot of tutorials from well-meaning authors, and they invariably schedule the crontab task to run on the hour, on the half hour, or sometimes on the quarter hour. It’s not a stretch to think there may be busier times of day on the LE servers.

For my fleet, I set a random hour and random minute in my crontab, so there’s a better chance of avoiding the busy times. I never knowingly encounter timeouts, so there’s no problem as such, but I would prefer to be a responsible user and not be part of an on-the-hour DoS army, if that’s feasible.

Thank you in advance. I’m new, and I trust this is the right place to ask this question.

rg305 · October 28, 2019, 1:27am

…maybe someone could add an “are you busy - if so I can call later (+some random time)” api.

But wait; What if the busy checker itself should also get too busy?
[since the busy check would be happening in place of the actual check, we only switch one for the other]
We will need “redundant redundancy” (patent pending).
Where we spread the load across thousands of bitcoin mining rigs through the planet to ensure 9999% uptime.

[I’m just messing]

I think it is a good idea to at least run a test daily while incorporating some sort of random offset (+/- 1000 seconds) to allow for better likelihood of actual load spreading than that worst cast scenario: Where every new user follows the exact same tutorial and it has a hard set specific time. AAAAH!!!

But your idea could lead to some automated pre-check periodic adjustments.
Where it calls into “this service” and asks what was the least used time of the week last week. and skews towards that answer (not directly at it, that would put all callers at the same place at the same time).

Better yet include “this is my current setting, the number of check I do, etc.” … “which way should I move?” and allow the service to better understand the impact of that answer it will provide.

We should call it: APPOINTMENT SCHEDULER!
[what do you mean that is already a thing and we can’t copyright it? - dam]

petecooper · October 28, 2019, 6:45am

Thanks, Rudy.

That's pretty much it, yes. It's pretty safe to assume that the initial issuance happens at one time of day, and the renew run happens at one (or more) of:

the time(s) set in crontab;
the time set by the hosting organisation;
shortly after an email reminder arrives;
…not so shortly after an email reminder arrives and it becomes rather more urgent;

There will doubtless be people who schedule these things in bulk and not give a thought to the impact. I know techs who deliberately avoid the Monday to Friday awake hours so there's minimal risk of downtime if a service doesn't restart as it should…and conversely, there are other techs who deliberately schedule during the awake hours just in case there is a problem.

As I say, I've not knowingly hit a production rate limit or other hurdle, so this is not in reaction to anything bad, it would just be helpful to know roughly what time of day or week is best avoided or aimed-for.

Is there a perceived value in sharing or publicising these stats, do you think, or is it just me who's interested?

rg305 · October 28, 2019, 6:49am

I'm sure you will hear a more definitive response once L.A. wakes up.
But the short answer may be something like:
It all runs through Cloud Flare (or some other CDN) and they haven't told us that we are stressing their systems in any way at any time. So (at least at the moment) this is not a real concern.

petecooper · October 28, 2019, 7:01am

Good point, well made. Thanks, Rudy.

rg305 · October 28, 2019, 7:03am

Nonetheless, I do like how you think and plan

mnordhoff · October 28, 2019, 7:25am

Speaking for myself, I'd be curious to see graphs, but I don't know if it would be valuable to publish them.

For example, it's possible that the load from well-behaved clients is so low that making them avoid peak times wouldn't make a significant difference.

The recommended best practice is to renew at completely random times of the day. I think you're the first person to additionally suggest avoiding peak times.

As an example, Certbot packages use a cron job similar to this:

0 */12 * * * root perl -e 'sleep int(rand(43200))' && certbot -q renew

Or this timer:

[Unit]
Description=Run certbot twice daily

[Timer]
OnCalendar=*-*-* 00,12:00:00
RandomizedDelaySec=43200
Persistent=true

[Install]
WantedBy=timers.target

Additionally, certbot renew itself will sleep for a few minutes when run non-interactively, to help mitigate the effects of poorly timed cron jobs.

_az · October 28, 2019, 7:57am

We have some idea that "midnight" is a peak time (or was around a year ago):

But whether that means midnight in a particular TZ, or midnight in a number of different TZs, I'm not sure. Which city in the world has the most dense population of web servers?

rg305 · October 28, 2019, 8:01am

So what you’re saying is:
Avoid all 24 midnights! - LOL
[just to be safe]
We can start a campaign: Just say no to zero.
Top of the hour = bottom of the _____?
Pile, list, [work in progress]

Life without humor is like humor without life… but more sad.

_az · October 28, 2019, 8:03am

Unironically this, haha.

From what I gleaned from posts at the time, Let's Encrypt's nginx servers were having a meltdown for a few days after the switch to Cloudflare. One user reported that switching their cron time "fixed" the issue, and I think Let's Encrypt reported that they ended up increasing their capacity as well.

rg305 · October 28, 2019, 8:06am

May be getting number of requests or very specific detail is overkill.
May be just monitoring the physical port usage [of the nginx servers] with SNMP would show a graph we can all understand quite easily.

Some may argue that such information may lead to intentional DDoS attacks.
But an open anything will always face the same and yet will also have “the power of the many” to resolve whatever may come and will do so very quickly.

So I’m all for publicizing any and all network utilization “stats” and such.

petecooper · October 28, 2019, 8:54am

That was my understanding, thanks for the clarification. I'm bodging something similar in my crontab with a random hour and random minute. I didn't think to use Perl to insert a random sleep, that's neat. I'd factored in the renew run at a set time and attempted pure Bash to put a random sleep period in, but the Perl route is much cleaner. Thanks for the tip!

My vote would be the quaint town of UTC in the state of Zulutime.

jsha · October 28, 2019, 2:59pm

Thanks for the question! Indeed, as you and others have guessed, we do see peaks of load at popular cron times. Surprisingly the current peak isn’t 0000 UTC, it’s 0100 UTC. That does move around somewhat as different countries move into and out of DST. The first of the month is usually a particularly big spike, because many people have their crontabs set up to fire just once a month (or even every 60 days). This is a bad idea, as I’m sure you can guess. We mention randomized renewal times in our integration guide.

I appreciate your randomization of your cron times! That’s good, not just from a load-spreading standpoint, but also increases your own systems’ reliability. We do see increased error rates during spikes. We try to keep them under control but it seems like there will always be a slightly higher chance of errors around 2200, 2300, 0000, and 0100 UTC.

The way Certbot solves this is with an every-twelve-hours cron (systemd) job that also includes a randomized sleep for up to 12 hours. Since we found that many Certbot users set up their own cron jobs at the top of the hour, we also built in a shorter (~3 minute) randomized delay in certbot renew itself, when run non-interactively.

Thanks for your thought on this aspect. If you see such tutorials, please share them here or send them to me. I am happy to reach out to their authors about correcting them.

petecooper · October 28, 2019, 3:41pm

Thanks, Jacob – very helpful, and much appreciated. That pretty much answers my question, especially with the Certbot load spreading and mitigation.

JamesLE · October 28, 2019, 4:29pm

Many thanks for the question! I’d like to second the point you noticed: that it’s helpful to use that (Certbot style) short random sleep to avoid always hitting us at 0-1 seconds after the start of a minute. Even better, many sleep implementations (like Perl’s Time::HiRes::usleep) can handle fractional (floating point) seconds.

petecooper · October 28, 2019, 4:43pm

Man, I'm only just getting my head around Bash, don't give me more things to learn over winter! Saying that, a Bash sleep delay in crontab would achieve something along the same lines with a bit less finesse than Perl.

RealAndy · November 4, 2019, 2:45pm

0100 UTC isn’t so surprising when you see that this is midnight in the European Union (or 0200 UTC with DST).

system · December 4, 2019, 2:45pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
API to get suggested renewal time Feature Requests	16	1106	July 29, 2018
Brief overloads at midnight UTC (a request for help!) Client dev	10	1385	August 18, 2022
Certbot Random Renew Triggers Adding Extra Entries to Log File Server	4	1732	April 27, 2017
Why need a random sleep in cron for certbot? Help	4	2691	February 23, 2020
I have reached rate limit today, when unlocked? Help	7	622	October 19, 2021

Load balancing to avoid busiest time(s) of day for renewal issuance requests

Related topics