Renewal process crashed my production server

This may not be an issue/bug with LetsEncrypt or Certbot, but I wanted to review this issue I ran into here in case this could happen to someone else, or to find out if there might be a more correct way for me to have implemented my process.

  • I’m using certbot 0.21.1

  • My Ubuntu server is running Nginx in front of several Node.js applications.

  • I installed several LetsEncrypt SSL certificates and configured them in Nginx.

  • I created a daily certbot renewal job

All of this has worked great for quite a while now, including the renewal process. (I LOVE LetsEncrypt and certbot!)

But the other day, all my production server’s web sites went down during the daily certificate renewal process.

The problem was that I had a domain name that I no longer used- it was for a test version used before this new platform went live. When this domain renewal failed, the certbot process left Python holding onto port 443, blocking access to all my web sites.

Is this expected behavior?

Here are some log snips for reference:

Final lines of cerbot renewal log:
All renewal attempts failed. The following certs could not be renewed:
/etc/letsencrypt/live/v2.mydomain.com/fullchain.pem (failure)

Nginx errors:
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)

When I tried to restart Nginx, I got this error:

Job for nginx.service failed because the control process exited with error code. See “systemctl status nginx.service” and “journalctl -xe” for details.

When I checked the process that was using port 443, I saw this:
tcp6 0 0 :::443 :::* LISTEN 3080/python2.7

Anyway, I think I solved the issue going forward by using certbot’s fancy “delete” command to remove the certs I no longer use (a great feature!). But it does seem like this could be a problem that could likely strike other folks as well… and the impact of the issue is complete failure of all secure web sites on the server. I just wanted to put this scenario here in case anyone finds any value in it or has any other suggestions for me.

Thanks!
-Sean

The best way to avoid this situation in the future would be to switch to using the --webroot or --nginx plugins instead of --standalone mode. These validation plugins don’t require temporarily stopping your nginx server, so this failure mode would not be possible.

That being said, you’ve definitely uncovered a bug in standalone mode! To figure out what is going on, we would need to see the entire certbot debug log and not just the end. If the forum software won’t let you post it, you can use a service like gist.github.com or pastebin.com and link to it here.

Thanks so much,

I will post the full log file here, followed by my renewal cron job for reference as well (domain name changed). Let me know if there is a different log you might want to see - this is the output from certbot when the job ran it.

Here is the full auto renew log for the day that failed (this morning… I had the same failure on my other server / different domain two days ago for the same reason). I have two domains on this server - the one used now was not ready for renewal today, but the unused one (prefixed with v2.) that failed was up for renewal today:

Thu Feb 1 03:13:01 EST 2018 *** START ***


Processing /etc/letsencrypt/renewal/v2.mydomain.com.conf


Processing /etc/letsencrypt/renewal/mydomain.com.conf


The following certs are not due for renewal yet:
/etc/letsencrypt/live/mydomain.com/fullchain.pem (skipped)
All renewal attempts failed. The following certs could not be renewed:
/etc/letsencrypt/live/v2.mydomain.com/fullchain.pem (failure)

And here is my daily renewal job:

#!/bin/sh
LOGFILE=/root/jobs/certbot.log
echo “$(date) *** START ***” >> $LOGFILE
/root/certbot/certbot-auto renew --pre-hook “systemctl stop nginx” --post-hook “systemctl start nginx” >> $LOGFILE
echo “$(date) *** COMPLETED ***” >> $LOGFILE

certbot logs detailed information to /var/log/letsencrypt/ but the most recent letsencrypt.log file probably refers to your subsequent successful actions, so you may have to look at the other numbered files to find the log from the one that failed.

I have the gist of the detailed logfile created - but is there anything in the log file that should not be made public? I see “protected”, “payload”, and “signature” properties that look like they might be keys or something.

Let me know what you think - thanks.

These are just cryptographic signatures and not keys so it is safe, but they are also of little debugging value so it is okay if you prefer to elide them.

Ok, thanks! Here is the log:

Hi @schoen & @bmw we have a bug (maybe plural) for you! :confounded:

First, certbot runs into trouble performing authorization:

2018-02-01 08:13:09,849:DEBUG:acme.client:Received response:
HTTP 400
Server: nginx
Content-Type: application/problem+json
Content-Length: 142
Boulder-Requester: 21609655
Replay-Nonce: tXscKOWilkFpABBem0gPqlwgnzKCmNCojzl9p9Qj70w
Expires: Thu, 01 Feb 2018 08:13:09 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Thu, 01 Feb 2018 08:13:09 GMT
Connection: close

{
  "type": "urn:acme:error:malformed",
  "detail": "Unable to update challenge :: cannot update a finalized authorization",
  "status": 400
}
2018-02-01 08:13:09,849:DEBUG:acme.client:Storing nonce: tXscKOWilkFpABBem0gPqlwgnzKCmNCojzl9p9Qj70w
2018-02-01 08:13:09,850:WARNING:certbot.renewal:Attempting to renew cert (v2.members.dartconnect.com) from /etc/letsencrypt/renewal/v2.members.dartconnect.com.conf produced an unexpected error: urn:acme:error:malformed :: The request message was malformed :: Unable to update challenge :: cannot update a finalized authorization. Skipping.
2018-02-01 08:13:09,854:DEBUG:certbot.renewal:Traceback was:
Traceback (most recent call last):
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/renewal.py", line 425, in handle_renewal_request
    main.renew_cert(lineage_config, plugins, renewal_candidate)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/main.py", line 1065, in renew_cert
    _get_and_save_cert(le_client, config, lineage=lineage)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/main.py", line 113, in _get_and_save_cert
    renewal.renew_cert(config, domains, le_client, lineage)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/renewal.py", line 297, in renew_cert
    new_certr, new_chain, new_key, _ = le_client.obtain_certificate(domains)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/client.py", line 318, in obtain_certificate
    self.config.allow_subset_of_names)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/auth_handler.py", line 81, in get_authorizations
    self._respond(resp, best_effort)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/auth_handler.py", line 134, in _respond
    resp, chall_update)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/auth_handler.py", line 158, in _send_responses
    self.acme.answer_challenge(achall.challb, resp)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/acme/client.py", line 230, in answer_challenge
    response = self.net.post(challb.uri, response)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/acme/client.py", line 709, in post
    return self._post_once(*args, **kwargs)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/acme/client.py", line 722, in _post_once
    return self._check_response(response, content_type=content_type)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/acme/client.py", line 583, in _check_response
    raise messages.Error.from_json(jobj)
Error: urn:acme:error:malformed :: The request message was malformed :: Unable to update challenge :: cannot update a finalized authorization

This seems like a weird error condition for the simple case of a domain that no longer exists?

Possibly it’s unexpectedness leads to the next error, which is that certbot fails to invoke their hook:

2018-02-01 08:13:12,422:ERROR:certbot.hooks:Hook command "systemctl start nginx" returned error code 1
2018-02-01 08:13:12,423:ERROR:certbot.hooks:Error output from systemctl:
Job for nginx.service failed because the control process exited with error code. See "systemctl status nginx.service" and "journalctl -xe" for details.

Presumably, nginx cannot bind to the port because certbot has not stopped listening in standalone mode due to the above error.

Finally, certbot claims to be exiting abnormally but apparently doesn’t, because the rest of their log is filled with random requests to their server until they manually killed certbot:

2018-02-01 08:13:12,423:DEBUG:certbot.log:Exiting abnormally:
Traceback (most recent call last):
  File "/opt/eff.org/certbot/venv/bin/letsencrypt", line 11, in <module>
    sys.exit(main())
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/main.py", line 1240, in main
    return config.func(config, plugins)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/main.py", line 1142, in renew
    renewal.handle_renewal_request(config)
  File "/opt/eff.org/certbot/venv/local/lib/python2.7/site-packages/certbot/renewal.py", line 443, in handle_renewal_request
    len(renew_failures), len(parse_failures)))
Error: 1 renew failure(s), 0 parse failure(s)
2018-02-01 08:19:06,385:DEBUG:acme.crypto_util:Performing handshake with ('::ffff:66.249.84.8', 37995, 0, 0)
2018-02-01 08:19:06,386:DEBUG:acme.crypto_util:Server name (members.dartconnect.com) not recognized, dropping SSL
[...and so on...]

Maybe because the standalone server is still running on another thread?

So maybe three bugs?

  1. The weird authorization error. (Probably hard to track down at this point.)

  2. Certbot needs to stop standalone mode on a traceback.

  3. Certbot really needs to exit at all costs when things go south. :stuck_out_tongue_winking_eye:

2 Likes

Hi @svbaker, someone in another thread which is linked right above is also experiencing this issue, and they think it may have been introduced in a recent release.

Can you look in your cron logs and find the date of the last certificate you issued successfully before this problem occurred? From your log I can figure out there was one in early November while certbot 0.19.0 was the current version, but if you can find that you were issued a certificate more recently as part of the automatic renewal process in your other logs it may narrow it down a bit further.

Unfortunately, this server only went live in December, so there has not been a renewal of anything other than the v2.members.dartconnect.com that failed. (I did a manual issue of a new cert for members.dartconnect.com, but not an automated renewal.)

I do have other Nginx-fronted web applications that did have successful certbot renewals last year if that might help.

@svbaker thanks for reporting this and @Patches thanks for helping us dig into the problem. I’ll look into Certbot not stopping standalone.

I’m not sure what’s going on with the malformed error from the CA though. According to the Certbot log linked above, Certbot gets back a new authz with both the authz and challenges having a pending status but when Certbot tries to POST to the challenge, it gets:

{
  "type": "urn:acme:error:malformed",
  "detail": "Unable to update challenge :: cannot update a finalized authorization",
  "status": 400
}

@jsha, any ideas?

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.