Best practices to monitor Certbot


#1

What are the best practices to monitor Certbot (as foreground and background process)?

E.g. how to collect monitor information concerning

  • general activities
  • connectivity to ACME server
  • issuance of certificate
  • installation of certificate

Certbot: wanting a more verbose renewal
#2

@lestaff

Take a look plz.


#3

@stevenzhu I don’t think this is a question that merits a staff tag. It seems like something the community can provide guidance on. Please try to reserve the @lestaff mentions for issues that require privileged access or immediate attention. Thanks!

@toc-rox What context are you using Certbot in? For a small number of websites? As part of some kind of larger automation? I think the answer to that question will help inform the responses you get about best practices.


#4

The plan is to use Certbot in a production environment with a lot of servers. It’s very important
that a certificate never becomes invalid. To assure this, it’s necessary to monitor each Certbot instance.
Currrenty it’s not obvious what’s the best practise to achieve this:

  • evaluate the return code
  • evaluate the (newest) logfile
  • evaluate the terminal output
  • use a combination of the above
  • something else

#5

I’m of the view that it’s usually a mistake to micromanage ACME clients. They are not usually built to produced structured/parseable output or logs (Certbot certainly isn’t) and … well, who cares if one renewal attempt finishes with exit code 1 instead of 0, it could have been an intermittent issue (e.g. network or CA degraded) and succeed the next time.

The outcome is what matters (certificates not lapsing or getting so close to lapsing that you’re likely to have rate limiting problems).

Additionally, Certbot’s report of success can be a false negative. There is no guarantee that e.g. your webserver actually loaded and is actively serving the new certificate after it was renewed. Certbot doesn’t check that.

To that end, you have some tools at your disposal:

  • Rely on the CA-based email notifications (least accurate)
  • Rely on a service like Let’s Monitor or Uptime Robot to warn you when a particular live endpoint is observed to be serving a certificate that is close to lapsing
  • Leverage existing monitoring infrastructure (Nagios or a Prometheus exporter) to do the same thing

#6

If you monitor the expiration date of your certificate effectively sent by your web-server then you covers all these issues:
If certbot works, it should never expires in less than X days (X=30 usually).
If any of these errors occurred:

  • Failed to get a new certificate
  • Failed to install the new certificate

Then your monitor will see it.


#7

Hi @toc-rox

then you may create your own client instead of installing certbot. There are a lot of libraries. And ACME isn’t too complicated:

https://tools.ietf.org/html/draft-ietf-acme-acme-14

With an own client, you may

  • redirect all GET requests domainname/.well-known/acme-challenge/1234 to a special server
  • send notification mails if something doesn’t work
  • split the certificate creation / certificate management and the installation / use of certificates

And you can split the creation of a new certificate in small steps with return codes. So if there is an error, you don’t need to restart with a new order, instead repeat the last step.


#8

I agree that bottum-up monitoring isn’t sufficient. I have written a top-down data collector (prototype) which grabs the certificate offered by a service. This makes it possible to evaluate the certificate used by the certificate consumer (service). An alarm good be generated if the remaining lifetime is under a defined threshold. This indicates that something in the renewing chain hasn’t worked.

$ ./moncert www.google.com:443

Connecting to "www.google.com:443" ...

SerialNumber : 8030173536167869905
Subject      : CN=www.google.com,O=Google LLC,L=Mountain View,ST=California,C=US
Issuer       : CN=Google Internet Authority G3,O=Google Trust Services,C=US
NotBefore    : 2018-08-21 08:05:00 +0000 UTC
NotAfter     : 2018-11-13 08:05:00 +0000 UTC
IsCA         : false
DNSNames     : www.google.com

SerialNumber : 149685795415515161014990164765
Subject      : CN=Google Internet Authority G3,O=Google Trust Services,C=US
Issuer       : CN=GlobalSign,OU=GlobalSign Root CA - R2,O=GlobalSign
NotBefore    : 2017-06-15 00:00:42 +0000 UTC
NotAfter     : 2021-12-15 00:00:42 +0000 UTC
IsCA         : true

#9

Why writing a new client? Certbot is an excellent one. Isn’t it better to place a feature request against Certbot. Something like “implement reliable and solid monitor messages”.


#10

Certbot isn’t an api and doesn’t want to be an api. So it’s always a new process required. You said:

  • evaluate the return code
  • evaluate the (newest) logfile

These are additional steps. A library with some functions has direct return- and errorcodes, so it’s not required to parse logfiles. If Certbot changes something, such a validation may not longer work.

And there are other limitations. My own client uses dns-01 validation with *.example.com and http-01-validation with example.com, so I need only one _acme-challenge.example.com - entry, not two. Such “mixed validations” aren’t supported. And I can save the http-01-validation-files in a special directory as

domainname.token.txt

If there is a GET http://domainname/.well-known/acme-challenge/token, the code of the webserver checks, if there is such a file domainname.token.txt - if yes, it is sent, if no, a 404 is sent.

A lot of work is the organization of local informations: Account-key, order-url, validation files, certificate keys, certificate requests and certificates. These can be saved in a database, as files - with own functions you can do what you want. The communication with Letsencrypt is only a small part of the job.

So it’s easy to create additional mail notifications if something doesn’t work.


#11

It may be feasible to run “certbot -q renew” and, if it outputs anything, generate an alert. You will waste some time slogging through emails about transient errors, but maybe not too much time.

You still have to monitor your web servers to make sure everything’s really working, though.

Also, there’s the issue of monitoring revocation status of your certificates. (Which could be tied into an OCSP stapling implementation.)