Certbot auto renew kills apache


#1

Hello,

when my server runs certbot renew automtaticaly at night, apache is killed sometimes. The server has about 30+ Domains and certificates and sometimes its running smooth, sometimes not. Server resources (CPU, RAM) are far away from limits.

It produced this output:
In the syslog I can see that certbot is started and my server monitoring shows apache is down short time later. Nothing helpful in apache logs, letsencrypt.log looks like this:

2019-01-17 03:52:39,470:DEBUG:certbot.main:certbot version: 0.28.0
2019-01-17 03:52:39,471:DEBUG:certbot.main:Arguments: [’-q’]
2019-01-17 03:52:39,471:DEBUG:certbot.main:Discovered plugins: PluginsRegistry(PluginEntryPoint#apache,PluginEntryPoint#manual,PluginEntryPoint#null,PluginEntryPoint#standalone,PluginEntryPoint#webroot)
2019-01-17 03:52:39,494:DEBUG:certbot.log:Root logging level set at 30
2019-01-17 03:52:39,495:INFO:certbot.log:Saving debug log to /var/log/letsencrypt/letsencrypt.log
2019-01-17 03:52:39,506:DEBUG:certbot.plugins.selection:Requested authenticator <certbot.cli._Default object at 0x7fef0c1d2278> and installer <certbot.cli._Default object at 0x7fef0c1d2278>
2019-01-17 03:52:39,521:INFO:certbot.renewal:Cert not yet due for renewal
2019-01-17 03:52:39,521:DEBUG:certbot.plugins.selection:Requested authenticator apache and installer apache
2019-01-17 03:52:39,522:DEBUG:certbot.plugins.selection:Selecting plugin: * apache
Description: Apache Web Server plugin - Beta
Interfaces: IAuthenticator, IInstaller, IPlugin
Entry point: apache = certbot_apache.entrypoint:ENTRYPOINT
Initialized: <certbot_apache.override_debian.DebianConfigurator object at 0x7fef0c219780>
2019-01-17 03:52:39,523:DEBUG:certbot.plugins.storage:Plugin storage file /etc/letsencrypt/.pluginstorage.json was empty, no values loaded
2019-01-17 03:52:39,532:INFO:certbot.renewal:Cert not yet due for renewal
2019-01-17 03:52:39,533:DEBUG:certbot.plugins.selection:Requested authenticator apache and installer apache
2019-01-17 03:52:39,533:DEBUG:certbot.plugins.selection:Selecting plugin: * apache
Description: Apache Web Server plugin - Beta
Interfaces: IAuthenticator, IInstaller, IPlugin
Entry point: apache = certbot_apache.entrypoint:ENTRYPOINT

(I cut off about 30+ renew statements which are all the same beside the “Initialized: <certbot_apache.override_debian.DebianConfigurator object at 0xalphanumeric-string”)

Initialized: <certbot_apache.override_debian.DebianConfigurator object at 0x7fef0c1c9f28>
2019-01-17 03:52:39,715:DEBUG:certbot.plugins.storage:Plugin storage file /etc/letsencrypt/.pluginstorage.json was empty, no values loaded
2019-01-17 03:52:39,716:DEBUG:certbot.renewal:no renewal failures

My web server is (include version):Apache 2.4.29

The operating system my web server runs on is (include version): Ubuntu 18.04

I can login to a root shell on my machine: Yes

I’m using a control panel to manage my site: No

Any idea what I could improve in order to keep my server up and running without being killed by certbot auto renew?


#2

Something should be noting the stoppage of your Apache service.

if it’s not in the Certbot renewal log, then /var/log/apache2/error.log should have something marking the times at which graceful reloads were invoked by Certbot (and any errors following that).

Maybe even in dmesg - Certbot might explode in memory usage for some reason and cause the oomkiller to run.


#3

Well, here is an extract of the apache error.log of that day and time, no entries at 3:52am, only 1h before. The log entry from 4:39 was created by my apache2 restart command.

[Thu Jan 17 02:42:05.067119 2019] [mpm_event:alert] [pid 12009:tid 139624674277120] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 04:39:42.849265 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12009 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:42.849317 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12010 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:44.851291 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12009 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:44.851334 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12010 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:46.853281 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12009 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:46.853322 2019] [core:warn] [pid 22242:tid 139625090800576] AH00045: child process 12010 still did not exit, sending a SIGTERM
[Thu Jan 17 04:39:48.855280 2019] [core:error] [pid 22242:tid 139625090800576] AH00046: child process 12009 still did not exit, sending a SIGKILL
[Thu Jan 17 04:39:48.857644 2019] [core:error] [pid 22242:tid 139625090800576] AH00046: child process 12010 still did not exit, sending a SIGKILL
[Thu Jan 17 04:39:49.861655 2019] [mpm_event:notice] [pid 22242:tid 139625090800576] AH00491: caught SIGTERM, shutting down
[Thu Jan 17 04:39:49.977135 2019] [mpm_event:notice] [pid 2579:tid 140347054758848] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured – resuming normal operations

I installed certbot the way described on the website for ubuntu18.04 and apache, I also have the same setup running on other servers without any issues.


#4

Regarding the crashes, has it only crashed if a domain is actually up for renewal?

For example, with the partial log in your original post, were all of the domains Cert not yet due for renewal or did some of them actually report as doing some work?

It might help to upload the full log from that incident to pastebin or something.

If you are getting crashes even when Certbot is not actually doing any renewal work, I would maybe investigate the possibility of a different kind of resource exhaustion - ulimits (or a very high number of workers/threads configured for Apache). This is a little suspicious, but I admit it could easily be a red herring:


#5

Can you show the cron job that is run (to include any scripts that may be called).
Thanks.


#6

Thank you all very much for your input!

First: Its a brand new Ubuntu 18.04 server (VServer, 16 Cores, 32 GB RAM), not much tweaking in configs done (yet), just moved about 40 domains and websites to the server, set up LE certs for all domains so @_az there is no cert due for renewal yet as they are all pretty new.

@rg305 There is also no cron job entry for renew as its all brand new, so I am wondering why certbot is running every night. Maybe it has been placed automatically during installation in some cron or timing settings?!

Also thank you @_az for the note about the worker thread message, I raised the values. If I run sudo certbot renew --dry-run I get the congratulations message about all renewals were successful, but during the renewal process, apache is killed and not restarted. The log states apache graceful restart successful mixed with mpm_event notices about unable to create worker thread. Here is an excerpt of the apache error.log. – php7.2-fpm.log is empty by the way:

[Thu Jan 17 22:12:05.320183 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:05.348560 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:05.348587 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:06.353058 2019] [mpm_event:alert] [pid 30797:tid 140719303735040] (11)Resource temporarily unavailable: AH00474: apr_thread_create: unable to create listener thread
[Thu Jan 17 22:12:06.353084 2019] [mpm_event:alert] [pid 30798:tid 140719303735040] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:12.211597 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:12.240922 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:12.240947 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:15.053285 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:15.082096 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:15.082118 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:16.085585 2019] [mpm_event:alert] [pid 31074:tid 140719720258496] (11)Resource temporarily unavailable: AH00480: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:17.085958 2019] [core:error] [pid 28302:tid 140719720258496] AH00546: no record of generation 5 of exiting child 31074
[Thu Jan 17 22:12:18.279876 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:18.312527 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:18.312557 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:19.315997 2019] [mpm_event:alert] [pid 31250:tid 140719303735040] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:25.289197 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:25.317794 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:25.317817 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:26.322057 2019] [mpm_event:alert] [pid 31369:tid 140719303735040] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:26.322071 2019] [mpm_event:alert] [pid 31368:tid 140719303735040] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:28.168231 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:28.196603 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:28.196623 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:29.200391 2019] [mpm_event:alert] [pid 31506:tid 140719303735040] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:31.420813 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00493: SIGUSR1 received.  Doing graceful restart
[Thu Jan 17 22:12:31.450223 2019] [mpm_event:notice] [pid 28302:tid 140719720258496] AH00489: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.0g configured -- resuming normal operations
[Thu Jan 17 22:12:31.450251 2019] [core:notice] [pid 28302:tid 140719720258496] AH00094: Command line: '/usr/sbin/apache2'
[Thu Jan 17 22:12:32.453702 2019] [mpm_event:alert] [pid 31619:tid 140719720258496] (11)Resource temporarily unavailable: AH00480: apr_thread_create: unable to create worker thread
[Thu Jan 17 22:12:32.455029 2019] [mpm_event:alert] [pid 31621:tid 140719547123456] (11)Resource temporarily unavailable: AH03104: apr_thread_create: unable to create worker thread

What might be the best way to solve this issue?
Thank you all!


#7

Nice log, I think that paints a bit clearer picture.

What are the ulimits for the Apache master process currently?

cat /proc/28302/limits

I have to admit I am unfamiliar with the use of the Apache event worker in production (only ever see prefork around), so my interpretations might be a bit suspect.

But it would be good to see what tunings you have configured in your Apache config for it - https://httpd.apache.org/docs/2.4/mod/event.html


#8

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 268435456 268435456 bytes
Max resident set unlimited unlimited bytes
Max processes 1546440 1546440 processes
Max open files 8192 8192 files
Max locked memory 16777216 16777216 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1546440 1546440 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us


#9

Apache event worker is not graved in stone, I just wanted to have http/2 running on Ubuntu18.04 but I think I can also live with other modes if its more stable. Now I realize my other server are on different modes … maybe the reason why its running fine on these servers :smiley:


#10

Yeah … this stuff can be tricky to chase, because limits get enforced at a variety of points.

One last quick check I can suggest is:

systemctl status apache2

you should see a line like:

Tasks: 55 (limit: 4915)

Now, if you spam that status command as Certbot is running, you might be able to check whether the number of tasks grows and/or approaches the limit.


#11

Ubuntu may not use cron, try:
systemctl list-timers


#12

Tasks: 129 (limit: 195)

That sounds logic, thank you :wink:


#13

Please also show:
systemctl status apache2

Also, take a look within
/lib/systemd/system/apache2.service
You might want to include/increase to:

[Service]
TasksMax=infinity

Within Apache config, review:
ThreadsPerChild


#14

Is that when idle or when Certbot crashes the server? If it’s the idle count, I would give a solid 90% chance that this is your issue.

As an experiment, maybe try this (as root):

systemctl set-property apache2.service TasksMax=5000

Then leave it for a week or something.

Regarding rg305’s suggestion, I am not sure whether modifying the systemd file directly is a good idea, might get wiped out by package updates.


#15

129 is the value in normal operation mode of the server, no certbot running! I just watched it while running certbot renew and after the first 15 certs or so tasks count hits the limit of 195 and thats it.

I raised TaskLimit to 5000 now, I am a bit afraid to set infinity here, sorry @rg305 :wink:

Now I run certbot renew while watching apache error log and now it looks clean, just the apache graceful restart notices, nothing else. Certbot renew finished with full simulation of all renews and apache is still up and running. Yeah! Tasks count goes up to more than 300 at the same time :open_mouth:

I am pretty wondering to hit all these default limits already so early with only 40 websites/certificates, there is still some more waiting to be moved to this server, feels good to see it running smooth for the moment. Lets wait and watch the next days and weeks. Thank you once again @_az and @rg305!


#16

My understanding is that the maximum number of tasks should be bounded by your mpm_event configuration. So adding more websites should not cause this issue to occur again.

This is what really puzzles me. There’s no way that is the actual DefaultTasksMax for Ubuntu. So either you have that set somewhere on your system, or sytsemd is doing some weird (and unhelpful) things to your system:

sudo grep -R DefaultTasksMax /etc/systemd

Very happy for you though, you can keep http/2 now :slight_smile: !


#17

Just to give you feedback: Its up and running for 10 days now. Really looks like TaskLimit was the key here.