Unable to Renew or issue new certs - Killed


#1

I have about 100 domains I’m serving through apache. I’ve noticed that whenever I request a new cert or try to renew, it was quite slow and CPU usage from certbot would go up to 100%. After some time, I would get ‘Killed’ on my output. Any pointers for where to start? Logs at /var/log/letsencrypt/letsencrypt.log don’t show much.

My domain is:

I ran this command:
sudo certbot renew

It produced this output:

Cert is due for renewal, auto-renewing…
Killed

My web server is (include version):
apache/2.4.18 (Ubuntu)

The operating system my web server runs on is (include version):
ubuntu 16.04.4 LTS

My hosting provider, if applicable, is:

I can login to a root shell on my machine (yes or no, or I don’t know):
yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel):
no


#2

Hi,

What’s your server config? (I’m suspecting there’s not enough cpu to run… But that’s unlikely)

@cpu

Thank you


#3

Do you suppose that @cpu can provide everyone with more CPU? :slight_smile:


#4

@stevenzhu which server config file are you asking for or suggesting I look at?


#5

For reference, server itself is a c5.xlarge from amazon web services https://aws.amazon.com/ec2/instance-types/c5/


#6

I actually want to have some free RAM and large ssd :wink:

@and then it’s out of my knowledge refused how certbot can reach 100% CPU usage.

@cpu can you please take a look at this…

Thank you


#7

I’m afraid I don’t have any ideas about what could cause Certbot’s CPU usage to spike (despite my chosen nick perhaps indicating otherwise!). @schoen or @bmw are probably best positioned to debug.


#8

It would be good to see some of the logs from /var/log/letsencrypt; another thought is to increase Certbot’s verbosity with -v options (although I’m concerned that that may show more about network communications rather than about Certbot’s own actions). If we don’t learn anything from that, we can try to think of other debugging options.

This might be an example

https://docs.python.org/2/library/trace.html

as Certbot could be run through the trace module. Or again

https://docs.python.org/2/library/profile.html

Using one of these, we could get more detailed low-level information about what Certbot was doing.


#9

@schoen Thanks for your suggestions

The log looks like this:

2018-06-01 15:19:10,898:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2018-06-29 23:35:45 UTC.
2018-06-01 15:19:10,899:INFO:certbot.renewal:Cert is due for renewal, auto-renewing...
2018-06-01 15:19:10,899:DEBUG:certbot.plugins.selection:Requested authenticator webroot and installer apache
2018-06-01 15:19:11,149:DEBUG:certbot_apache.configurator:Apache version is 2.4.18 

That’s it, I think that certbot gets killed before it outputs anything helpful

I’m somewhat a python novice, I’m really not sure how to run certbot with the trace or profile libraries, I built a script that looks like this:

from subprocess import call
call(["certbot", "renew"])

which I ran like this:

python -m trace --count -C . renew.py

which generated a lot of files

pickle.cover  re.cover  renew.cover  sre_compile.cover  sre_parse.cover  struct.cover  subprocess.cover  trace.cover

Where can I go from here?


#10

That’s a good thought but unfortunately the -m trace isn’t going to survive the subprocess.call operation, because the subprocess.call will make the operating system end up starting a fresh copy of Python which doesn’t know about the -m option. Therefore, your existing trace files refer only to the process of running certbot renew, rather than to actions that it took.

In order to get the trace for Certbot itself, you would have to run Certbot itself under a Python interpreter that has -m trace. I suspect you could accomplish this with something like

python -m trace --count -C . $(which certbot) renew

I think the profiler output might be more relevant than the trace output, but getting the profiler output might also be a little more work, so maybe we should start with the trace output.


#11

This gave me:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/trace.py", line 819, in <module>
    main()
  File "/usr/lib/python2.7/trace.py", line 807, in main
    t.runctx(code, globs, globs)
  File "/usr/lib/python2.7/trace.py", line 513, in runctx
    exec cmd in globals, locals
  File "/usr/bin/certbot", line 11, in <module>
    load_entry_point('certbot==0.22.2', 'console_scripts', 'certbot')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 561, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 553, in get_distribution
    dist = get_provider(dist)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 427, in get_provider
    return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 715, in find
    raise VersionConflict(dist, req)
pkg_resources.VersionConflict: (certbot 0.19.0 (/usr/lib/python2.7/dist-packages), Requirement.parse('certbot==0.22.2'))

I’m not sure why there is a reference to certbot 0.19.0, I believe I attempted to upgrade it to the latest while I was running into this issue

certbot --version returns

certbot 0.22.2

#12

I forgot how this is supposed to work, but maybe try with python3?


#13

This worked

python3 -m trace --count -C . $(which certbot) renew

But output was the same, ‘Killed’, and no files were created


#14

Hmmm, I wonder if the trace module only creates the files upon a successful exit?

I just tried this by writing a program that sends itself a SIGKILL (via import os; os.kill(os.getpid(), 9)) and it indeed didn’t give any trace output.

I’ll check whether cProfile has a similar or a different behavior.


#15

By the way, could you try running ulimit -a to see if you have a per-process CPU-time limit? You might be able to temporarily remove that limit if it’s the reason that the process is getting killed.

For example,

ulimit -t 1; echo 'scale=100000; 4*a(1)' | bc -l

results in Killed (the bc process will receive SIGKILL when it takes more than 1 second of total CPU time).


#16

I’ve confirmed that the cProfile module has the same behavior (if the Certbot process is killed while under profiling, no profiling statistics are reported). So, the question about the ulimit might really be relevant because we might need to stop Certbot from getting killed in order to get trace or profile data out.


#17

ulimit -t returns unlimited, so doesn’t look like it is the OS killing the process. I believe that one of the domains was hanging the process. I am not sure why, but, I went through my list of expiring domains and cleared through them using the certonly option to renew each one individually and was able to work through my backlog. I’m also using that on new domains as well (rather than the apache2 installer)


#18

I wonder if it was a different resource—maybe I should have suggested ulimit -a.

It’s a pity that the Unix architecture doesn’t provide us a way to get a more specific error when the OS kills a process based on a resource limit.


#19

Any hunches as to what might be the next best place to look?

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30464
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 30464
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


#20

Huh, none of those look particularly bad or unusual to me!