Bug: installing with certbot impedes further nginx conf changes without reboot

@dvo Just guessing ... Could it have anything to do with the passenger config? I do not know much about it but reading its docs I see this:

passenger_root path;

Refers to the location to the Passenger root directory, or to a location configuration file. This configuration option is essential to Passenger, and allows Passenger to locate its own data files.

But, I do not see this "essential" setting in your config. The nginx problem you describe is unusual so I am just looking for something unusual in your conf.

Could you try adding that setting or even removing passenger from the nginx conf just as a test to see if it is related to that?

https://www.phusionpassenger.com/library/config/nginx/reference/#application-loading

2 Likes

I will attempt to remove the reference to passenger, but I am quite certain that is not the source.

Why then would

sudo reboot
[...]
sudo nginx -t
sudo service nginx restart

then allow nginx to restart?

The only change that generated the error (and I tested this in isolation) is the invocation of sudo certbot --nginx -d [...]

As noted, I am just looking for something unusual. nginx does not usually behave this way after certbot --nginx.

As background, certbot will issue a nginx -s reload after updating the config. This sends a signal to nginx to reload the conf. This will not be shown in the nginx.service status except for the new PID for the worker process. Note if you do sudo systemctl reload nginx those do appear in the nginx service status (and the new pid of course).

It seems to me that the nginx state is getting "off" and only apparent the second time a reload / restart is done. Another test is to try several sudo nginx -s reload without using certbot just modifying the conf slightly each time as you did.

My guess is something about the passenger integration / install is causing it. Especially when I saw a key config item missing from it.

3 Likes

It's not really possible for nginx to require a reboot for changes to take effect. The issue is most likely due to a bug in the process controller script or process controller itself.

Try issuing a kill -HUP {nginx "master" process id}, which is how nginx does a graceful restart (the main process rereads configuration files, after handling their own active requests, each child process will respawn). That should work, and would indicate to me the issue is with systemctl.

[I apologize to any offended by "master". I personally avoid that term in place of "main" or "primary", but nginx has not yet updated it's terminology to more inclusive words.]

1 Like

You seem to be onto something. I attempted a second certificate on the same end point & was going to try the suggestion.

While the certificate got generated, a new error did arise. I find it preferable to communicate this before attempting the suggestion as it might provide better insight.

Rolling back to previous server configuration...
nginx: [alert] kill(1547, 1) failed (3: No such process)
Encountered exception during recovery:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/certbot/client.py", line 529, in deploy_certificate
    self.installer.restart()
  File "/usr/lib/python3/dist-packages/certbot_nginx/configurator.py", line 919, in restart
    nginx_restart(self.conf('ctl'), self.nginx_conf)
  File "/usr/lib/python3/dist-packages/certbot_nginx/configurator.py", line 1202, in nginx_restart
    raise errors.MisconfigurationError(
certbot.errors.MisconfigurationError: nginx restart failed:
b''
b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/certbot/error_handler.py", line 124, in _call_registered
    self.funcs[-1]()
  File "/usr/lib/python3/dist-packages/certbot/client.py", line 634, in _rollback_and_restart
    self.installer.restart()
  File "/usr/lib/python3/dist-packages/certbot_nginx/configurator.py", line 919, in restart
    nginx_restart(self.conf('ctl'), self.nginx_conf)
  File "/usr/lib/python3/dist-packages/certbot_nginx/configurator.py", line 1202, in nginx_restart
    raise errors.MisconfigurationError(
certbot.errors.MisconfigurationError: nginx restart failed:
b''
b''
nginx restart failed:
b''
b''
IMPORTANT NOTES:
 - An error occurred and we failed to restore your config and restart
   your server. Please post to
   https://community.letsencrypt.org/c/help with details about your
   configuration and this error you received.
 - Congratulations! Your certificate and chain have been saved at:
   /etc/letsencrypt/live/testtwo.fidely.club/fullchain.pem
[...]

Note:

sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
1 Like

That's interesting. Need to keep gathering facts :slight_smile: I would do a sudo systemctl status nginx.service in between each command to watch for messages and keep track of the pids that are active (in the CGroup). Or, a briefer display of pids using something like: ps -eF | grep -E "nginx|PID"

A quick search through this forum found a similar "no such process" error although no remedy was found. Maybe further clues though? Could your pid folder be damaged or odd in some way?

Update: @dvo Oh, forgot to include link to that other thread I mentioned.

2 Likes

The journal log seems like it no longer contains the nginx problem.
You'd have to run that soon after the problem returns.

As for the test.txt file - that is the HTTPS config.
I was looking for the HTTP config.

1 Like

Also, I can't seem to find it, what version of certbot are you running?

1 Like

'folder damaged or odd in some way'. I somehow doubt it as my work flow is always the same from one VPS to another. But under Ubuntu 20.04, they all exhibit this same behaviour.

I will try spinning a new one up tomorrow and document the status between each step.

I suspect you are using certbot 1.9.0 (or lower) and this can be fixed by upgrading certbot
Show:
certbot --version

2 Likes

certbot 0.40.0
I only have that config for as http is then being forwarded to https

lower than or equal to 1.9.0?
but why would sudo apt install certbot python3-certbot-nginx install by default (yesterday!) what appears to be a much lower version?

That config contains:

    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/test.fidely.club/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/test.fidely.club/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

So I would dare say it is NOT the HTTP config file requested.

As for:

Ubuntu supports snap since... forever.
You should follow the recommended installation guide:
Certbot - Ubuntufocal Nginx (eff.org)

2 Likes

This is promising; I had an unverified hypothesis that 20.04 and that version of certbot did not mix well.
Will attempt new install on those lines with a clear head in short order. Gracias.

1 Like

Oh gosh, yes... it's likely due to the Certbot version.

The linux/os specific packages are all EXTREMELY out of date and not recommended for usage, as the distributions themselves don't keep their packages up-to-date. While most releases dating back many years are compatible with the LetsEncrypt API, the majority of work in Certbot's upgrade cycles have been on webserver/os integrations - addressing edge cases and situations that are often like the one you are experiencing (and likely include it). The Certbot project moved to Snapd because it lets end-users run the most current version.

If you're worried about running snapd just-for-certbot, I believe you can use a hook in the crontab certbot installs to enable/disable the snapd daemon during renewals. This will give you precicse control over when to run that daemon.

4 Likes

reporting back with certbot 1.20.0 installed.

Fresh VPS. installed certbot via snap.
created an conf file for nginx in addition to the default file,
[here I may have erred, should've maybe tested on just the default conf]
then attributed a certificate. All appears well. Checked the conf file for certbot changes.
I proceeded to remove a blank line... then ... same error

$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
$ systemctl status nginx.service
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2021-10-30 04:49:22 UTC; 40s ago
       Docs: man:nginx(8)
    Process: 19260 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 19261 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)

$ journalctl -xe
    
Oct 30 04:18: 5 fidely.club systemd[871]: run user-0.mou t: Succeeded.
-- Subject: Unit   cceeded
-- Defined-By  systemd
-- Support: http://www ubuntu.c m/support
--
-- The unit UNIT has successfully entered the 'dead' state.
Oct 30 04:18:35 fidely.club systemd[871]: ru -user-0.mount: Succeeded.
-- Subject: Unit succeed d
-- Defined-By  systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit UNIT has successfully entered the 'dead' state.

note the timestamps of journalctl any previous entries had all succeeded

I then attempted to issue a certificate for a modified default nginx.conf. It failed for certbot, but providing a new set of data

Encountered exception during recovery: certbot.errors.MisconfigurationError: nginx restart failed:
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] still could not bind()
nginx restart failed:
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)
nginx: [emerg] bind() to [::]:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] still could not bind()

Rebooting server. Yes nginx can be soft restarted after reboot.
Ask for a cert. It generates, conf file seems OK. https pages serve up.
Modify a conf file by removing a blank line and the bug re-emerges:

$ sudo reboot
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo certbot --nginx -d demo.saltalafila.online
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for demo.saltalafila.online
Successfully received certificate.
$ sudo vim /etc/nginx/sites-enabled/default
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
$ systemctl status nginx.service
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2021-10-30 05:11:08 UTC; 52s ago
       Docs: man:nginx(8)
    Process: 1316 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 1317 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)

$ journalctl -xe  # had no info of use as the last entry had a timestamp prior to making above change
-- Startup of the manager took 104817 microseconds.
Oct 30 05:07:50 fidely.club sshd[1047]: Received disconnect from 5.171.96.112 port 15399:11: disconnected by user
Oct 30 05:07:50 fidely.club sshd[1047]: Disconnected from user jerdvo 5.171.96.112 port 15399

re-doing the process without requiring certbot intervention

$ sudo reboot
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo vim /etc/nginx/sites-enabled/default
# remove blank line & save
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$

Out of thoroughness, I then attempted to sudo certbot renew --dry-run
One cert passed, the second had some test failures (huh?) log enclosed Processing: letsencrypt.log...
letsencrypt.txt (73.5 KB)
Right, Do that again...
sudo certbot renew --dry-run
and the above stream of "certbot.errors.MisconfigurationError: nginx restart failed:" was presented anew upon attempting to generate the first certificate renewal.

Minmised test. New VPS. only default.conf configuration for nginx

sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
$ sudo certbot --nginx -d testtwo.fidely.club
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for testtwo.fidely.club

Successfully received certificate. [...]
$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo service nginx restart
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.

i.e. without even editing the .conf file anew certbot has touched something improperly.
a reboot allows editing of conf files and soft re-booting of nginx.

1 Like

@dvo Looks like you continue to have this problem in Certbot:

(this is example msg and pid from earlier but same symptoms persist)

Can you show the line in your nginx.conf for pid. It will be something like:
pid /var/run/nginx.pid;
I ask because most of your examples of the conf show the server sections only.

Also, show the contents of the nginx system service file. Location depends on your system. Maybe:
/etc/systemd/system/nginx.service
or
/usr/lib/systemd/system/nginx.service

And, show the pid file with: ls -l (pid file name) (hopefully above both have same name)

With your new fresh VPS, have you tried just two consecutive sudo service nginx restart without any certbot in between? Does that work?

2 Likes

Note, this is a fresh install and nginx.conf was not modified; it is in default form.
pid /run/nginx.pid;

/usr/lib/systemd/system/nginx.service

[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target

$ ls -l /run/nginx.pid
-rw-r--r-- 1 root root 5 Oct 30 08:44 /run/nginx.pid

Yes, sudo service nginx restart does run twice consecutively without certbot invoked. As illustrated above, this issue arises once certbot is invoked.

Right, sorry, forgot you had that in just prior post #23.

Is your nginx still the one built with passenger? It may well not have anything to do with it. But, this kind of failure is not at all common so there is something unusual with your setup / config.

Would you re-run your test series in post #24 but show the running pids for nginx before every sudo command. Something like this but check in case your command format is different: sudo ps -eF | grep -E "nginx|PID"

Something goes wrong with the reload of nginx that Certbot does after the config updates. It looks like it cannot send a signal to the running nginx. And, the running nginx remains in an odd state. Seeing the pids might help unlock this mystery.

Update: Also, important, save the /var/log/letsencrypt/letsencrypt.log for the Certbot command. Show the line in the log that shows the failed reload command and the pid it shows. Probably worth looking at /run/nginx.pid before and after certbot as well. I think we will find something interesting comparing them all especially before and after the Certbot command.

1 Like

One hint that may help shed some light: Certbot doesn't use systemd to reload or start nginx. (There's some history behind this behavior).

It tries to reload nginx using:

nginx -c /etc/nginx/nginx.conf -s reload

and if that fails, it tries to start nginx with:

nginx -c /etc/nginx/nginx.conf

The one behavior (that I'm aware of) which doesn't work so well is if nginx is already stopped when you run Certbot:

  1. Stop nginx
  2. Run certbot --nginx
  3. Try restart nginx with service nginx restart

At (2), nginx fails to send a reload signal using nginx -s reload, so it assumes nginx is not running, and it starts it using nginx -c /etc/nginx/nginx.conf. systemd is not aware of this.

At (3), systemd does not realize that nginx is actually running already (even though the pidfile is present) and tries to start it. nginx fails to start because the ports are occupied by the other instance of nginx. systemd also wipes out the /run/nginx.pid file at (3) which makes the situation worse, since now using nginx to control nginx won't work either, meaning you have to killall nginx to get back to a working state.

It's all a bit ugly and I'm not too sure what is stopping Certbot from controlling nginx with systemd instead, but I will try look into it later in the week.

For now, you can choose from:

  1. (Recommended) Make sure nginx is running when you run Certbot, or
  2. Use nginx -s reload after you make your modifications (this will work whether or not nginx was started with systemd), or
  3. Use --webroot and configure nginx manually.

It's possible also that there is some reason for your trouble other than "nginx was stopped when you ran Certbot" (such as if nginx was crashing during the initial reload attempt), but I think the underlying explanation about systemd would probably still underlie it.

Thanks @rg305 and @MikeMcQ for the big help so far.

5 Likes