Certbot etc dashboarding and failure reporting?

I'm exploring the idea of extending the https://certifytheweb.com dashboarding and renewal failures notifications reporting to support Certbot (and possibly acme.sh). It's clear that some users don't always have visibility on what renewals they have and which (if any) are failing.

So a few general questions for people who have a need for this sort of thing:

  • does failure reporting/notifications sound useful to you?
  • What post-request reporting/notifications/webhooks etc do you currently use (if any) and why?
  • Is there an existing open source or commercial solution you are already using and what features does it have that are important to you?
  • Do you require an API to query this information?

Thanks for any input!

13 Likes

Would endpoint monitoring be in scope for this?

For a while I had an acme.sh installation where my post-hook to reload my ZNC server's certificate wasn't doing the job and I only detected it because I was monitoring the endpoint directly with Uptime Robot.

Detecting those kinds of incongruities would be useful for me, because it gives reliable assurances that the certificate is really OK.

10 Likes

Yep, could be. Depends a little on whether the endpoint is internal or public and what type of service it is. General purpose TLS service monitoring/consistency reporting could be interesting.

8 Likes

We played with failure reporting, but didn’t release it yet. We still log the errors but don’t generate the reports. We set things up so failures are centrally contained within the logs, which are sql based, so they are queryable. I think there is a Boolean flag on the final failure records.

What was interesting was to track rate limit errors. The idea was to catch certain errors early - like duplicate certs, too many failures, too many pending authz - and either warn or pause operations. If you hit some of these, something is broken and needs to be fixed and/or the account will be wedged for a bit.

Most of the other errors would chain back to a connectivity issue or acme server outage. I thought the rate limiting stuff had the most potential.

12 Likes

Thanks, yes that's a good point, I guess if you know the error (and the CA) you could pretty much determine when rate limits are likely to happen.

12 Likes

Yeah. My goal was to stay within 80% of rate limits by default, so there is always a bit of extra room to ensure specific operations can successfully complete. So I tried to log every ACME request and error, so we can generate real-time stats and keep each account/ip in a healthy place.

8 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.