Monitoring of Let's Encrypt CA

Disclosure - we built this.

We are developing some ACME tools and as we are curious about LetsEncrypt performance, we built online monitoring as a validation tool.

If you find it useful - good. If you think it’s lacking something - even better, especially if you let us know.

We issue and measure 8 certs/minute across 4 locations (prod) and 20 certs / minute across 5 locations (staging). Timing measurements are for each API call separately. The library has an OCSP client but no data collection yet for that.

You can also get weekly email reports.

https://keychest.net/letsencrypt

Update 1: June 27, 16:15UTC - upgraded monitoring servers, improved caching, optimized access to speed up data collection, corrected visualization of downtimes (we initially assumed that downtime = no API response … how silly of us).

Update 2: We have detected the first downtime on 25th - 11 minutes before the official detection.

4 Likes

That’s extremely cool, nice job!

1 Like

Is hammering the ACME server like that allowed according to the user agreement?

As far as I can see, User agreement doesn’t cover technical details. These are enforced by rate limits with which we comply.

Interestingly, we were looking at testing the utilisation of LE API but it sounded too harsh even if it were just a few bursts a day. And you can only do it against a couple of API EPs. It may also be true that this information is better to keep “less public”.

We assumed that variations in the load would be seen in the overall latency. The first weekly report suggests it may be possible but only the time will show. The latency went up by 1000ms on all monitoring stations on Thursday morning. But it may simply coincide with maintenance, although the increased latency lasted much longer.

It looks like downtime table needs improvements - DB shows downtime:
| 2020-06-25 07:17:07 | |
| 2020-06-25 07:17:01 | |
| 2020-06-25 06:53:10 | |
| 2020-06-25 06:52:40 | |
| 2020-06-25 06:52:11 | |

but rendered table is empty. The official LE record is:
Identified 7:04am, resolved 7:18am

Note: we thought, yesterday, that we had a bug in our client code as LE started returning “malformed” / “badPublicKey”.

Cool tool! Just to be clear, the timestamp on letsencrypt.status.io doesn’t reflect the time our alerts go off, it’s the time that we posted the update. You can see in some of our public post-mortems that the incident timeline includes the internal alert notification time which is different than the public status page.

1 Like

thanks @jillian I was not aware of that! I have included a bit more detail from our logs in a blog post at https://blog.keychest.net/keep-an-eye-on-lets-encrypt-performance

1 Like

We have also started producing weekly reports - here’s an extracted chart of the latency over the last week. Each data point represents 100-120 transactions - this particular chart doesn’t show downtimes, it interpolates missing data.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.