The operating system my web server runs on is (include version): CentOS 7
My hosting provider, if applicable, is: DigitalOcean
I can login to a root shell on my machine (yes or no, or I don't know): yes
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no
The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 1.32.0
I have a webserver running nginx with 1000+ vhost from several 2nd level domain. it is running well until now. then now there are more intermittent failure on renewal and registering new domain. either it's throwing 404, or 500, even though the DNS record is correct. other issue is the certbot process is getting much longer, i notice perhaps when the vhost reach 800+, it could take 5 minute/subdomain. i think it is happen because the certbot is reading more vhost configuration. because when i run certbot certonly, it's quite fast. also in my other webserver which running less vhost number from the same domain, the renewal and certificate generation is still working good.
so the question is,
is there a known practical limitation of number of vhost for certbot ? i know there is a weekly/daily rate limit, but is there a limitation/recommendation on how much vhost on a webserver utilizing certbot ?
if there is limitation, how the larger webserver companies manage high volume of domain with certbot ?
For very large nginx configs you may need to add this option when using the --nginx plug-in as authenticator:
--nginx-sleep-seconds NGINX_SLEEP_SECONDS
Number of seconds to wait for nginx configuration
changes to apply when reloading. (default: 1)
Increasing this may help with the 404 errors. The 500 error would need more info to research
The --nginx plug-in requires nginx to be reloaded by the plug-in. Using the --webroot method uses the existing running nginx as is so avoids that. But, it works differently and may be a significant change at this stage of your system. Try the longer sleep seconds first
thanks for suggesting that. i've try the --nginx-sleep-second=30 previously, the result is still inconsistent (at some time it works, the other wasn't). i think i should figure out the number
There are a handful of performance issues that arise from large amounts of vhosts.
Large hosting companies/installations typically do not use Certbot webserver plugins or even use Certbot at all, and instead use either an in-house solution or an application/library designed for large numbers of users.
There are a handful of ways you can workaround performance issues for use cases like yours. The first two things I can think of:
Try to use as many certificates as possible. LetsEncrypt will issue a Certificate with up to 100 names, but you should bundle as few domains as possible into a Certificate. This will avoid issues with pending authorizations that are not cleaned up during a failed order. If you are trying to get a certificate with 100 domains and the challenge fails on the first domain, there are 99 pending challenges that must be cleaned up (verified, failed or deleted) and there is a rate limit of 300. This is a common cause of accounts getting stuck. I think Certbot versions after a certain point avoid this, but I believe older versions do not and many other clients do not handle this at all.
As you guessed, Certbot needs to parse the nginx config files which can take quite a lot of time and memory as vhosts increase. If possible, don't use the nginx plugin and instead use Standalone (Certbot can run on a higher port, and nginx proxies port 80 traffic onto it to complete challenges) or Webroot. Structure your nginx vhosts to expect the SSL Certificates in an expected place. Use shared macros/includes for the nginx configuration.
Certbot's apache/nginx plugins are designed for very basic use cases. If you have hundreds of domains, you really aren't the target audience for them.