What would make Certbot scale better

I thought I should record my observations about this somewhere since I brought up the topic...

(1) Each OCSP query has some time overhead, so if you have a huge number of certs, certbot renew takes a long time even when no renewals are due. This is especially annoying when trying to run it interactively (when working on a server's certificate setup) but could also be a problem if it's set to run frequently and can't finish one invocation before a later invocation happens.

(2) On a large Apache install, apachectl graceful can actually take a long time to finish (I'm guessing it's at least linear in the number of VirtualHosts but may also be affected somehow by the number of active connections and other factors). If you're using certbot --apache you have up to three apachectl graceful invocations per cert request (one after configuring Apache to satisfy the challenge, another after reconfiguring it to remove the challenge, and a third for deployment). When using another authenticator, you'll still usually have one apachectl graceful per deployment, so if you have a certbot renew that results in 15 renewals on some occasion, you'll either have 45 Apache reloads or 15 Apache reloads, neither of which is great if a single reload takes a long time (or a lot of CPU?).

I was able to work around the second issue with a deploy-hook that created a file indicating that a deployment reload of the web server was pending, which would then happen only once (if needed) after certbot renew finished.

Some architectural changes that might help Certbot's scaling (maybe especially with Apache integration):

(1) Parallelize OCSP queries for revocation status, or perform them in a more stochastic or opportunistic way somehow. Maybe turn them off by default when certbot renew is run interactively?

(2) Parallelize challenge satisfaction and challenge cleanup whenever multiple names are or will be requested, even across multiple certificate renewals. That is, all of the challenges could be obtained from the CA, and then a single action would attempt to satisfy all of them (with a single reload), and then the client would perform a single challenge cleanup, and then deal with the CA's report about which challenges were or were not verified.

(3) Parallelize deployment, whether with a new installer interface that attempts to install an arbitrary number of certificates, or with a new deploy-hook interface that gives a deploy-hook an arbitrary number of certificates to deploy at once.

(4) Experiment with keeping a copy of the notAfter date in the renewal configuration file after a successful renewal and using that instead of parsing it out of the PEM file. (This one might be bad for reliability and not produce that much speedup.)

2 Likes

try to look at ARI first then check OCSP as fallback? not sure how fast those are though

2 Likes

ARI brings a new "issue": the frequency the ACME client should be run (or at least check the ARI info) is probably going to be way more frequent than "just" twice a day.

2 Likes

In the current version of Certify The Web we check OCSP and ARI stuff every 5 minutes for a subset of the managed certificates being maintained, as otherwise there's not enough time in a month to check thousands of certs without doing large batches. Currently our largest known production instance maintains about 22,000+ certs on one machine but we'd like to cater for much more than than. Currently 50k certs on one instance is quite hard work especially if you inject a lot of failures. CTW has the advantage of running as a service (it may be one of the few that does), so it can choose it's own frequency for maintenance tasks.

3 Likes

@webprofusion Does CertifyTheWeb run as a daemon? Or using the Windows equivalent of cron? Using cron/systemd is probably less efficient compared to running as a daemon with multiple threads doing multiple stuff at the same time with timers continuously running and triggering things. Probably not that easy to code in Python, but I'm not familiar with that.. (Python and threads... :roll_eyes:)

threading โ€” Thread-based parallelism โ€” Python 3.12.2 documentation Oeehh, threading :grinning:

PEP 3143 โ€“ Standard daemon process library | peps.python.org / python-daemon ยท PyPI Oeehhh, daemonizing a Python application :grinning:

2 Likes

Yes, it runs as a standard (windows) service, and the UI talks to that. Other apps (certbot, win-acme etc) use a windows scheduled task (same as cron). I'm sure pretty much anything can be made to run as a daemon or equivalent, timers etc are very lightweight. The disadvantage is that memory allocated on startup generally stays in use rather than being freed when the current job has finished. In the case of certbot that would be the python runtime overhead. We're working on minimizing that in CTW, but it's definitely not insignificant especially for low memory environments.

3 Likes

I'm going to argue against turning off OCSP checks (and ARI once implemented) when running certbot renew interactively: If an administrator gets a notification from the CA that their certificates are revoked (or about to be), then ideally them just logging in and running certbot renew would be sufficient (if they didn't want to wait for the next time the cron job would otherwise run).

4 Likes

That's reasonable. In that case, maybe it should be a non-default option to skip the check.

3 Likes

But wouldn't that only be useful if run interactively? Is there strong benefit for that?

The background renew should check OCSP or ARI so that it can renew unattended.

Generally I agree that modestly large installs could be improved. The --deploy-hook signal is clever idea. That said, is there enough emphasis on Certbot going forward for such architectural changes with all that entails - even more docs, announcements, education, more tests, beta cycles, ...

Today I think modestly large installs would just be better off not using --apache or -nginx plugin in favor of webroot and doing a single server reload after. Or, of course migrating to caddy, Apache mod_md or similar.

I have to wonder about going beyond modestly large installs to even bigger ones. It seems those challenges are not speed of a single cert request but managing rate limits. And, managing backup CA's and similar. How would people be coached as to the suitability of Certbot for these? I mean, we can't well say "look this Certbot is great for big installs" only to have it fall due to these other issues.

7 Likes

A few thoughts:

IMHO, if someone has "a large number of certificates" they should not be using Certbot and the recommendations should be to use another client.

I am not saying this because Certbot is historically terrible for large integrations (which especially holds true for Nginx over Apache), but purely from a product-market opportunity fit. There are many clients now that specialize in large scale integrations, while Certbot has been positioned as the "entry level" and "swiss army knife" of ACME clients. I think the Certbot team's resources would be better spent addressing the smaller use cases and expanding compatibility in general.

If someone were to prefer using Certbot for large-scale systems, while I agree with the overall ideas @schoen shared - I am concerned for their potential impact and over-complication of management on smaller installations. As another Python developer, I definitely have anxiety over bringing in async code and the guaranteed sh*tshow of user issues because that happens in every project I've contributed to.

My suggestion is that larger deployments who wish to use Certbot should use a dedicated Certbot process invoked by a new subcommand, perhaps certbot coordinator somewhat akin to running certbot standalone mode on a higher --http-01-port with ProxyPass (not sure how to support tls-alpn-01). Using the new subcommand would utilize a secondary lockfile, so it can run in the background and still allow Certbot to run as normal. If normal Certbot detects the daemonized version, it messages that process to process the commands - otherwise it runs as normal.

The daemonized version would then be configured - via command line - with the deploy hook options regarding what to run and when. This might be for groups of domains/server blocks, or in 15 minute intervals.

That would largely remove the need for utilizing async code, the daemon process should be able to handle most things itself, but can use the standard threads/multiprocessing support. If users run into issues, disabling the daemeonized version would just cause certbot renew to fallback on normal operations.

7 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.