I'm trying to renew multiple SSL certificates in parallel using Certbot, but running into a locking issue. Looking for the best approach to handle this.
Problem:
When I try to renew 5 certificates simultaneously, Certbot throws:
"Another instance of Certbot is already running"
This is because Certbot uses a global lock file at: /var/lib/letsencrypt/.certbot.lock
Only one Certbot process can run at a time by default, so all parallel renewal attempts fail.
What I tried:
While debugging, I manually deleted the lock file and killed the stuck Certbot process — after that the 2nd certificate renewed successfully. But this is obviously not a proper solution for production.
What I'm looking for:
A clean approach to run multiple Certbot renewals in parallel without them blocking each other.
Any advice from people who have done this at scale would be really appreciated!
The lock file exists for a good reason - Certbot needs exclusive access to its configuration directory to avoid race conditions when writing certificate files and renewal configs. Bypassing it risks corrupted state.
That said, there are clean ways to speed up bulk renewals:
Option 1 - Sequential but fast:certbot renew already handles all certificates in one run. It processes them sequentially but skips any that are not due for renewal. For 5 certs this typically finishes in under a minute total since most of the time is waiting for ACME challenges, not CPU work.
Option 2 - Separate config directories: If you genuinely need parallel execution, run each Certbot instance with its own config/work/logs directories:
Each instance gets its own lock file so they won't conflict. You would need to have originally obtained each cert with the matching --config-dir.
Option 3 - Use a different ACME client: If you manage many certificates at scale, tools like acme.sh or lego do not use global lock files and can run multiple instances natively. They also support parallel DNS challenge propagation which is often the real bottleneck.
Option 4 - Let's Encrypt rate limits matter more: Before parallelizing, note that Let's Encrypt has rate limits (50 certificates per registered domain per week, 300 new orders per account per 3 hours). Parallel renewals hit these limits faster. Sequential certbot renew naturally stays within limits.
Personally I think if you are running such a large system that you need parallel operation of Certbot that you would be better off using a different ACME Client. The lego Client is commonly used and often gets new features sooner than Certbot. See: Lego :: Let’s Encrypt client and ACME library written in Go. For example, per Let's Encrypt notice the new dns-persist-01 challenge will be available promptly in lego. And, lego had support for ARI long before Certbot.
But, to answer your latest question, mostly yes, see this section of Certbot docs about lock files: User Guide — Certbot 5.5.0.dev0 documentation. Beware that using the --apache or --nginx plugins will not work well in parallel even with these overrides
Note that once you start using non-standard locations you need to specify the config file for every command. For example, certbot certificates normally shows all of the certs on your system. But, once you stop using the default locations you now must do certbot certificates --config-dir X and will see just those certificates. The same is true for the renew command and so on.
I assume you want to run parallel as you have noticed the performance problems with Certbot running at scale. Here in this community we generally discourage Certbot for large installations because of that. Although some people are satisfied.
If you provide more details about the underlying problem we may be able to provide more info.
The lock file issue points to a deeper design assumption: certbot is built for one server managing its own certificates. Running multiple instances is fighting that model rather than working with it.
If the goal is renewing many certificates quickly, the architecture that actually scales is a central system that handles all ACME interactions, then pushes certificates to servers that need them. The servers themselves don't need ACME clients at all.
Whether that's a hosted platform or something you build, it's a different approach than trying to parallelize certbot. https://www.certkit.io/blog/servers-shouldnt-need-acme walks through why the distributed model stops working as cert counts and renewal frequency grow.
In my case, I’m already using a central service that serializes certbot execution (single global lock), handles DNS challenges via API hooks, and stores/distributes certificates. So I’m avoiding parallel certbot runs and per-server ACME clients.
That said, I’m still invoking certbot per certificate rather than managing ACME flows directly.
Do you see this model holding up reasonably well for mid-scale usage, or would you expect it to hit architectural limits fairly quickly compared to a true centralized ACME implementation?
That’s fair — in my case, it’s actually a centralized setup rather than multiple independent servers.
Initially, I explored parallel execution because I observed that certbot runs could sometimes hang or take longer than expected, and I wanted to avoid blocking other renewals. However, that approach introduced conflicts, so I moved to a serialized model with a global lock and watchdog handling.
Right now, everything runs from a single service that processes certificates one by one, handles DNS challenges via API, and stores the results centrally. So it’s not distributed across servers, but it’s also not the typical single-server certbot usage either.
I was mainly trying to make the renewal process more reliable rather than scale it aggressively.
Do you know why? Usually that would be comms or perhaps DNS API. If the DNS API is a contributor the upcoming dns-persist-01 challenge would be helpful. In short, it allows a one-time TXT record that persists for multiple challenges (even indefinite). The EFF has not announced when they plan to support it in Certbot and Let's Encrypt themselves is not yet in production with that but plans to be this quarter (here). If your DNS takes a while to sync worldwide that would shrink the time needed for a renewal.
You sound like you would have checked but just in case ... for parallel DNS Challenges you need to ensure DNS updates don't collide using dns-01 challenge.
Just for my curiosity does that mean you do something like certbot renew --cert-name X for individual certs and have watchdog kill it if it runs too long? And, then cleans up Certbot lock files?
I’ve seen a bunch of implementations similar to what you’re describing, and there’s no reason it won’t work, but you need to recognize that in your timed cron jobs and deployment scripts is mission-critical infrastructure.
I like to appropriate Greenspun’s tenth rule for certificate automation:
Any sufficiently complicated Cerbot implementation contains an ad hoc, informally-specified implementation of a certificate lifecycle management.
Certbot itself isn’t really the limiting factor here. It handles renewal, and it does it really well. It’s all the other stuff that’s going to become hard to manage:
how do you deploy the certificates to endpoints?
how do you handle adding/removing certificates from the system?
how do you handle unscheduled renewals from issues or ARI?
how do you audit the process for compliance or incident review?
how do you monitor that everything worked?
All of those are beyond the scope of certbot, and you end up building them yourself with all the scripts/systems/processes you build around it. That’s what’s going to break down over time when things change, or when you forget, or when you move on to a different position.
Initially, my plan was to run certbot in parallel to speed things up, especially considering DNS delays and propagation time. However, I ran into issues with DNS challenges colliding.
So now I’m moving towards a serialized/queue-based approach. The idea is that when one certificate is being processed by certbot, any new request will check if a lock file exists and wait until the current execution completes (i.e., until the lock is removed) before proceeding.
Additionally, if certbot takes too long or gets stuck, I’m using a watchdog to terminate the certbot process and clean up the lock before allowing the next request to proceed.
Would you consider this approach reliable, or would you suggest a better way to handle this?
I don't know all your requirements but renewals and new requests are usually separate.
That is, a scheduled certbot renew command already serially processes each of the renewal config files in /etc/letsencrypt/renewal. Often not much work is done for each cert although occasional checks of the ARI endpoint is needed (assuming Certbot 4.1 or later).
You can control the beginning of the renew, of course, by adjusting the cronjob or systemd timer for the renew. Default is typically twice/day with initial sleeps to avoid chronic runs at congested times (like on each hour).
A request for a new cert during a renew run will get rejected because of Certbot's own locks. The total duration of the renew command then dictates the total "lockout" for new requests. Presumably new requests are relatively infrequent so maybe not a problem. And, if your total renew duration is extremely long I still think a different ACME Client (like lego) would be better. There are a number of things that slow Certbot for larger installs.
Sorry if you already understood all of this. Just wanted to take a step back.
It wasn't clear to me why you'd need your own serialization unless you have a particularly dynamic cert pool like a hosting service or similar. Which, if so, suggests looking at a different ACME Client or strategy.