Bug: installing with certbot impedes further nginx conf changes without reboot

This is promising; I had an unverified hypothesis that 20.04 and that version of certbot did not mix well.
Will attempt new install on those lines with a clear head in short order. Gracias.

1 Like

Oh gosh, yes... it's likely due to the Certbot version.

The linux/os specific packages are all EXTREMELY out of date and not recommended for usage, as the distributions themselves don't keep their packages up-to-date. While most releases dating back many years are compatible with the LetsEncrypt API, the majority of work in Certbot's upgrade cycles have been on webserver/os integrations - addressing edge cases and situations that are often like the one you are experiencing (and likely include it). The Certbot project moved to Snapd because it lets end-users run the most current version.

If you're worried about running snapd just-for-certbot, I believe you can use a hook in the crontab certbot installs to enable/disable the snapd daemon during renewals. This will give you precicse control over when to run that daemon.

4 Likes

reporting back with certbot 1.20.0 installed.

Fresh VPS. installed certbot via snap.
created an conf file for nginx in addition to the default file,
[here I may have erred, should've maybe tested on just the default conf]
then attributed a certificate. All appears well. Checked the conf file for certbot changes.
I proceeded to remove a blank line... then ... same error

$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
$ systemctl status nginx.service
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2021-10-30 04:49:22 UTC; 40s ago
       Docs: man:nginx(8)
    Process: 19260 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 19261 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)

$ journalctl -xe
    
Oct 30 04:18: 5 fidely.club systemd[871]: run user-0.mou t: Succeeded.
-- Subject: Unit   cceeded
-- Defined-By  systemd
-- Support: http://www ubuntu.c m/support
--
-- The unit UNIT has successfully entered the 'dead' state.
Oct 30 04:18:35 fidely.club systemd[871]: ru -user-0.mount: Succeeded.
-- Subject: Unit succeed d
-- Defined-By  systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit UNIT has successfully entered the 'dead' state.

note the timestamps of journalctl any previous entries had all succeeded

I then attempted to issue a certificate for a modified default nginx.conf. It failed for certbot, but providing a new set of data

Encountered exception during recovery: certbot.errors.MisconfigurationError: nginx restart failed:
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] still could not bind()
nginx restart failed:
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] still could not bind()

Rebooting server. Yes nginx can be soft restarted after reboot.
Ask for a cert. It generates, conf file seems OK. https pages serve up.
Modify a conf file by removing a blank line and the bug re-emerges:

$ sudo reboot
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo certbot --nginx -d demo.saltalafila.online
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for demo.saltalafila.online
Successfully received certificate.
$ sudo vim /etc/nginx/sites-enabled/default
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
$ systemctl status nginx.service
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2021-10-30 05:11:08 UTC; 52s ago
       Docs: man:nginx(8)
    Process: 1316 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 1317 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)

$ journalctl -xe  # had no info of use as the last entry had a timestamp prior to making above change
-- Startup of the manager took 104817 microseconds.
Oct 30 05:07:50 fidely.club sshd[1047]: Received disconnect from 5.171.96.112 port 15399:11: disconnected by user
Oct 30 05:07:50 fidely.club sshd[1047]: Disconnected from user jerdvo 5.171.96.112 port 15399

re-doing the process without requiring certbot intervention

$ sudo reboot
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo vim /etc/nginx/sites-enabled/default
# remove blank line & save
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$

Out of thoroughness, I then attempted to sudo certbot renew --dry-run
One cert passed, the second had some test failures (huh?) log enclosed Processing: letsencrypt.log...
letsencrypt.txt (73.5 KB)
Right, Do that again...
sudo certbot renew --dry-run
and the above stream of "certbot.errors.MisconfigurationError: nginx restart failed:" was presented anew upon attempting to generate the first certificate renewal.

Minmised test. New VPS. only default.conf configuration for nginx

sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo certbot --nginx -d testtwo.fidely.club
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for testtwo.fidely.club

Successfully received certificate. [...]
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.

i.e. without even editing the .conf file anew certbot has touched something improperly.
a reboot allows editing of conf files and soft re-booting of nginx.

1 Like

@dvo Looks like you continue to have this problem in Certbot:

(this is example msg and pid from earlier but same symptoms persist)

Can you show the line in your nginx.conf for pid. It will be something like:
pid /var/run/nginx.pid;
I ask because most of your examples of the conf show the server sections only.

Also, show the contents of the nginx system service file. Location depends on your system. Maybe:
/etc/systemd/system/nginx.service
or
/usr/lib/systemd/system/nginx.service

And, show the pid file with: ls -l (pid file name) (hopefully above both have same name)

With your new fresh VPS, have you tried just two consecutive sudo service nginx restart without any certbot in between? Does that work?

2 Likes

Note, this is a fresh install and nginx.conf was not modified; it is in default form.
pid /run/nginx.pid;

/usr/lib/systemd/system/nginx.service

[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target

$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 5 Oct 30 08:44 /run/nginx.pid

Yes, sudo service nginx restart does run twice consecutively without certbot invoked. As illustrated above, this issue arises once certbot is invoked.

Right, sorry, forgot you had that in just prior post #23.

Is your nginx still the one built with passenger? It may well not have anything to do with it. But, this kind of failure is not at all common so there is something unusual with your setup / config.

Would you re-run your test series in post #24 but show the running pids for nginx before every sudo command. Something like this but check in case your command format is different: sudo ps -eF | grep -E "nginx|PID"

Something goes wrong with the reload of nginx that Certbot does after the config updates. It looks like it cannot send a signal to the running nginx. And, the running nginx remains in an odd state. Seeing the pids might help unlock this mystery.

Update: Also, important, save the /var/log/letsencrypt/letsencrypt.log for the Certbot command. Show the line in the log that shows the failed reload command and the pid it shows. Probably worth looking at /run/nginx.pid before and after certbot as well. I think we will find something interesting comparing them all especially before and after the Certbot command.

1 Like

One hint that may help shed some light: Certbot doesn't use systemd to reload or start nginx. (There's some history behind this behavior).

It tries to reload nginx using:

nginx -c /etc/nginx/nginx.conf -s reload

and if that fails, it tries to start nginx with:

nginx -c /etc/nginx/nginx.conf

The one behavior (that I'm aware of) which doesn't work so well is if nginx is already stopped when you run Certbot:

  1. Stop nginx
  2. Run certbot --nginx
  3. Try restart nginx with service nginx restart

At (2), nginx fails to send a reload signal using nginx -s reload, so it assumes nginx is not running, and it starts it using nginx -c /etc/nginx/nginx.conf. systemd is not aware of this.

At (3), systemd does not realize that nginx is actually running already (even though the pidfile is present) and tries to start it. nginx fails to start because the ports are occupied by the other instance of nginx. systemd also wipes out the /run/nginx.pid file at (3) which makes the situation worse, since now using nginx to control nginx won't work either, meaning you have to killall nginx to get back to a working state.

It's all a bit ugly and I'm not too sure what is stopping Certbot from controlling nginx with systemd instead, but I will try look into it later in the week.

For now, you can choose from:

  1. (Recommended) Make sure nginx is running when you run Certbot, or
  2. Use nginx -s reload after you make your modifications (this will work whether or not nginx was started with systemd), or
  3. Use --webroot and configure nginx manually.

It's possible also that there is some reason for your trouble other than "nginx was stopped when you ran Certbot" (such as if nginx was crashing during the initial reload attempt), but I think the underlying explanation about systemd would probably still underlie it.

Thanks @rg305 and @MikeMcQ for the big help so far.

5 Likes

Thanks @_az You saved me the trouble of reporting the problem with Certbot not running before using --nginx. I just ran into that problem trying to work this problem.

In this case the certbot reload fails but due to the nginx process not running. From the letsencrypt log:
nginx: [alert] kill(1547, 1) failed (3: No such process)
Any ideas why this would fail like this? I looked at the nginx plug-in code and it looks to start nginx after a failed reload. That start without using systemd likely explains why this poster ends up with an unstable nginx after running Certbot - as you note. But, any clues why their reload would fail this way? Or, would you recommend this person abandon the --nginx plug-in in favor of --webroot?

Update: Ack, I was not clear when I said "but due to the nginx process not running". We believe that nginx was running but the certbot reload says it was not - failing with "no such process". If OP follows the steps I outlined in post #27 we will find out more.

2 Likes

I'd guess because the pidfile has become out of sync with reality, through a combination of previous commands (i.e. systemctl and nginx fighting over it). As a result, the reload HUP signal gets sent to some non-existent process.

The choice to use --nginx or not is up to whatever OP is comfortable with. My interpretation is that it would be easiest to make sure nginx is running when you run Certbot, and there shouldn't be any problems. But certonly --webroot is a solid choice too, if you are willing to configure the certificate in nginx yourself, and remember to include a --deploy-hook to reload nginx.

3 Likes

Is there a particular format recommended? Is nginx -s reload always safe?

Any systemctl/nginx interaction problems with hooks?

2 Likes

The explanation by _az in post #28 here was very helpful. The history in the link provided was also helpful. I found this post and this one by bmw also instructive (the total history is long).

@dvo If you are still interested I am willing to try to help debug the reason behind certbot's reload failing with 'no such process'. You were right in thinking certbot "touches something improperly" (the way it starts nginx) but this should only happen after a failed reload. Let us know if you want any help proceeding. Cheers

2 Likes

Yes, I am interested in getting through this issue. However, being outside my usual zone, I need a clear head and linear test plan, based on the above suggestions. will adress shortly.

2 Likes

Happy Cake Day!
image

2 Likes

Here goes a play-by-play of the test session:

$ sudo reboot
[...]
$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 4 Nov  1 09:14 /run/nginx.pid
$ sudo service nginx status
   returns satisfactorily
$ sudo ps -eF | grep -E "nginx|PID"
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         757       1  0 27379  5576   0 09:14 ?        00:00:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data     760     757  0 27519 13352   0 09:14 ?        00:00:00 nginx: worker process
jerdvo      1120    1035  0  2039  2428   0 09:17 pts/0    00:00:00 grep --color=auto -E nginx|PID
$ sudo service nginx restart
~$
$ sudo vim /etc/nginx/sites-enabled/default
# added new server name
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 5 Nov  1 09:20 /run/nginx.pid
$ sudo ps -eF | grep -E "nginx|PID"
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root        1339       1  0 27379  5572   0 09:20 ?        00:00:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data    1342    1339  0 27519 11220   0 09:20 ?        00:00:00 nginx: worker process
jerdvo      1348    1035  0  2039  2448   0 09:21 pts/0    00:00:00 grep --color=auto -E nginx|PID
$ sudo service nginx status
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-11-01 09:20:17 UTC; 4min 2s ago
       Docs: man:nginx(8)
    Process: 1313 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 1325 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
   Main PID: 1339 (nginx)
      Tasks: 16 (limit: 1136)
     Memory: 14.1M
    CGroup: /system.slice/nginx.service
             ├─1326 Passenger watchdog
             ├─1329 Passenger core
             ├─1339 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
             └─1342 nginx: worker process

Nov 01 09:20:17 [...] systemd[1]: nginx.service: Succeeded.
Nov 01 09:20:17 [...] systemd[1]: Stopped A high performance web server and a reverse proxy server.
Nov 01 09:20:17 [...] systemd[1]: Starting A high performance web server and a reverse proxy server...
Nov 01 09:20:17 [...] systemd[1]: Started A high performance web server and a reverse proxy server.

now invoking certbot

$ sudo certbot --nginx -d testthree.fidely.club
#  [...]  Successfully received certificate. [...]  Deploying certificate
$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 5 Nov  1 09:25 /run/nginx.pid
$ sudo service nginx status
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Mon 2021-11-01 09:25:34 UTC; 1min 3s ago
       Docs: man:nginx(8)
    Process: 1313 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 1325 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
   Main PID: 1339 (code=dumped, signal=SEGV)
      Tasks: 0 (limit: 1136)
     Memory: 1.6M
     CGroup: /system.slice/nginx.service

Nov 01 09:20:17 [...] systemd[1]: nginx.service: Succeeded.
Nov 01 09:20:17 [...] systemd[1]: Stopped A high performance web server and a reverse proxy server.
Nov 01 09:20:17 [...] systemd[1]: Starting A high performance web server and a reverse proxy server...
Nov 01 09:20:17 [...] systemd[1]: Started A high performance web server and a reverse proxy server.
Nov 01 09:25:34 [...] systemd[1]: nginx.service: Main process exited, code=dumped, status=11/SEGV
Nov 01 09:25:34 [...] systemd[1]: nginx.service: Killing process 1443 (nginx) with signal SIGKILL.
Nov 01 09:25:34 [...] systemd[1]: nginx.service: Killing process 1443 (nginx) with signal SIGKILL.
Nov 01 09:25:34 [...] systemd[1]: nginx.service: Failed with result 'core-dump'.
$ sudo ps -eF | grep -E "nginx|PID"
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root        1478       1  0 27646 18180   0 09:25 ?        00:00:00 nginx: master process nginx -c /etc/nginx/nginx.conf
www-data    1501    1478  0 27750 14996   0 09:25 ?        00:00:00 nginx: worker process
jerdvo      1519    1035  0  2039  2584   0 09:27 pts/0    00:00:00 grep --color=auto -E nginx|PID
$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 5 Nov  1 09:25 /run/nginx.pid

the pid is now pointing to a different directory /etc/nginx/nginx.conf compared to the original state and when retarting without certbot's intervention: /usr/sbin/nginx
at which point we are now in the failing state

$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
$ systemctl status nginx.service
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2021-11-01 09:33:44 UTC; 34s ago
       Docs: man:nginx(8)
    Process: 1874 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 1875 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)

$ journalctl -xe
Nov 01 09:15:56 [...] sshd[1032]: Received disconnect from 5.171.89.128 port 17016:11: disconnected by user
Nov 01 09:15:56 [...] sshd[1032]: Disconnected from user jerdvo 5.171.89.128 port 17016
Nov 01 09:29:18 [...] sshd[1600]: Received disconnect from 5.171.89.128 port 17032:11: disconnected by user
Nov 01 09:29:18 [...] sshd[1600]: Disconnected from user jerdvo 5.171.89.128 port 17032
Nov 01 09:29:28 [...] sshd[1681]: Received disconnect from 5.171.89.128 port 16499:11: disconnected by user
Nov 01 09:29:28 [...] sshd[1681]: Disconnected from user jerdvo 5.171.89.128 port 16499
Nov 01 09:31:29 [...] sshd[1778]: Received disconnect from 5.171.89.128 port 17101:11: disconnected by user
Nov 01 09:31:29 [...] sshd[1778]: Disconnected from user jerdvo 5.171.89.128 port 17101
Nov 01 09:31:58 [...] sshd[1858]: Received disconnect from 5.171.89.128 port 16861:11: disconnected by user
Nov 01 09:31:58 [...] sshd[1858]: Disconnected from user jerdvo 5.171.89.128 port 16861

@_az suggestion nginx -s reload was tried next. Alas...

$ ls -l /run/nginx.pid
ls: cannot access '/run/nginx.pid': No such file or directory
$ sudo nginx -s reload
nginx: [error] open() "/run/nginx.pid" failed (2: No such file or directory)

I enclose the letsecnrypt log letsencrypt_log.txt (54.5 KB)
as well as process.txt (2.7 KB)
a file that documents what processes are taken in the creation of the VPS before invoking certbot.
I believe this allows a fully replicable instance.

2 Likes

@dvo Excellent. I will study this and respond later this morning. Thanks

2 Likes

Please show:
grep -R nginx.pid /etc

1 Like

@rg305 Rudy, can you hold off for a minute. nginx is failing with segv and the mix of certbot starting nginx directly and also dvo using systemd and other nginx packages is messy. I at least am close to reporting a better description of what is happening. Maybe even a resolution - not sure.

2 Likes

NP
I just thought that might show something that might be currently overlooked.
Almost as if... nginx is being started with a specific pid file path somewhere and differently elsewhere.

2 Likes

Ok. That info was very helpful @dvo The key to the problem is that nginx fails with a segment violation (segv) at 09:25:34. Sadly, I do not know the cause of that but I continue to believe it has something to do with your nginx packaging. I did not fully research the process.txt you provided but maybe someone else would notice something.

Here is a timeline of key events.

09:20:17 service nginx status (systemd) shows:
         Main PID: 1339
         Others: worker:1342, passenger core:1329, passenger watchdog:1326
09:25:29 certbot started (per LE log)
09:25:34 systemd nginx.service status=11/SEGV main process exited (per service nginx status)
         kills pid 1443 - but why that one? that pid not seen in ps display or service status just before certbot)
09:25:37 certbot reload fails due to missing pidfile (per LE log)
         was: nginx -c /etc/nginx/nginx.conf -s reload (per certbot code)
         pidfile missing as systemd deleted nginx.pid as result of SEGV cleanup 
         After a failed reload certbot tries to start nginx directly
         certbot issues: nginx -c /etc/nginx/nginx.conf for that
         (this is known by looking at configurator.py certbot code - start not shown in LE log)
09:25:xx commands after certbot complete
         /run/nginx.pid timestamp matches 09:25:37 direct start of nginx by certbot (not with systemd)
         ps -eF display also shows 09:25 start time
         Main PID 1478, worker:1501
         confirms certbot direct start now in effect
         ps -eF grep not setup to show passenger so their status not known
09:33:44 service nginx restart fails
         this is expected since last nginx start was direct, not with systemd
09:34??  next commands after restart fails
         /run/nginx.pid not found 
         as expected since systemd removed it after failed restart at 09:33:44
         nginx -s reload fails due to missing /run/nginx.pid

As _az noted, certbot starts nginx directly using a command like nginx -c /etc/nginx/nginx.conf. Mixing a direct start with systemd causes problems as described earlier.

A very puzzling item is at 09:25:34 systemd killed pid 1443 while certbot was running. I do not understand why that pid was killed. It was not shown in any prior displays even the ones right before certbot started. It was 3 seconds before certbot started nginx directly which got a pid of 1478.

I saw no evidence that mixed pidfile locations were a problem nor problems with paths to certbot itself or its config. Although, I wouldn't mind seeing results of this:

echo $PATH
which -a nginx

I don't expect any surprises but ...

Unless someone sees a problem in your process.txt packaging I think your better way forward is to use certbot --webroot and avoid the nginx plug-in. This means having to setup the ssl definitions yourself but it seems you could do this in your default template once. You would be able to reload / restart nginx using systemd and avoid some problems. Also, certbot would not be modifying your nginx.conf on the fly so less likely to cause integration problems. That's what I have so far. Hope this helps.

1 Like