I analyzed LE uptime at the end of 2017 - looking at its first 16 or so months. As it’s been a while, I was somewhat curious what changed and did another quick analysis of the https://letsencrypt.status.io data to find out.
The exec summary: using only “full disruptions” (including planned restarts of LE services), the uptime is somewhere around 99.92% - compared to 99.86% in 2017.
If we include partial “incidents”, the uptime goes down to 96.4% - while you probably wouldn’t notice some of the partial service disruptions, it’s hard to further categorize those with the available data.
You can see the full text with charts at: https://keychest.net/stories/lets-encrypt-uptime-2-years-on
(please do chip in if you spot any issue with the results!)
My main question - when I sipped a single malt looking at the charts - was what aspirations are there in terms of reliability and if it can ever achieve the uptime of “commercial” CAs.
Could you explain to me why you’re mixing linear and logarithmic Y-axis between the different graphs?
You’re going to have quantify that in real numbers before anyone even understands there is a difference.
I mean, is there?
Taking into account the frequency in which other CAs renew certs (annually or bi-annually) that means they get 6 to 12 times less use.
And given that they probably service that many less certs… are you really comparing apples to apples?
You're right in terms of their sales to end-users. In my experience though, the bulk of their revenues comes from enterprise users. This means API integration, custom root CAs, running OCSPs, etc.
But I take your point.
purely practical reasons - I started with linear but subsequent charts were losing interesting information because of large outliers. I mention log axis in captions. I suppose, I probably could have come up with a better solution have I had more time.
What’s the uptime if you exclude this week-long “incident”?
Would be nice to have a dedicated graph for OCSP as well, since as you remark, it’s the most important thing to keep online.
97.9% ... there was another partial disruption (timeouts on API) .. excluding that one as well would end up in 99.7%.
OCSP - I could see only 2 incidents - both short 5 and 7 minutes (> 99.99%) but it's not clear here, whether lengths reflect the real downtime - or whether they may not include the time between the first reports and LE starting looking into it.
Ah yeah, this one - 11 days.
If I remember right, it started with the move from Akamai to Cloudflare (New CDN for the Production API), where they went from relying on the CDN to terminate SSL, to doing it themselves.
I was originally going to say that your conclusion was a little overblown, but now that I think about it, that migration could have been a bit smoother .
Edit: another interesting thing is that OCSP is still on Akamai. If it had been migrated as well, things could have gone real bad!
sometimes it's better to avoid touching things that work. Although, I find this rule somewhat ... inflexible
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.