I'm exploring the idea of extending the https://certifytheweb.com dashboarding and renewal failures notifications reporting to support Certbot (and possibly acme.sh). It's clear that some users don't always have visibility on what renewals they have and which (if any) are failing.
So a few general questions for people who have a need for this sort of thing:
does failure reporting/notifications sound useful to you?
What post-request reporting/notifications/webhooks etc do you currently use (if any) and why?
Is there an existing open source or commercial solution you are already using and what features does it have that are important to you?
For a while I had an acme.sh installation where my post-hook to reload my ZNC server's certificate wasn't doing the job and I only detected it because I was monitoring the endpoint directly with Uptime Robot.
Detecting those kinds of incongruities would be useful for me, because it gives reliable assurances that the certificate is really OK.
Yep, could be. Depends a little on whether the endpoint is internal or public and what type of service it is. General purpose TLS service monitoring/consistency reporting could be interesting.
We played with failure reporting, but didn’t release it yet. We still log the errors but don’t generate the reports. We set things up so failures are centrally contained within the logs, which are sql based, so they are queryable. I think there is a Boolean flag on the final failure records.
What was interesting was to track rate limit errors. The idea was to catch certain errors early - like duplicate certs, too many failures, too many pending authz - and either warn or pause operations. If you hit some of these, something is broken and needs to be fixed and/or the account will be wedged for a bit.
Most of the other errors would chain back to a connectivity issue or acme server outage. I thought the rate limiting stuff had the most potential.
Yeah. My goal was to stay within 80% of rate limits by default, so there is always a bit of extra room to ensure specific operations can successfully complete. So I tried to log every ACME request and error, so we can generate real-time stats and keep each account/ip in a healthy place.