Let's Encrypt stats, breakdown by signature algorithm

Just curious if Let's Encrypt has data that can show any shifts in adoption between RSA and ECDSA as well as key length. If so, could you add it to the stats page, or post it on your blog?

I found radar.cloudflare.com's Data Explorer but it only allows access to the last 12 months of data.

3 Likes

I'd be rather interested in that sort of data as well. But I suspect that the Let's Encrypt staff have higher-urgency things to do with their time. All the data should in theory be retrievable via the Certificate Transparency logs, though I do understand that it's not a trivial undertaking to do so.

Yikes. Just saw the "It’s been a while since we’ve seen jason_s — their last post was 10 years ago" banner above your post. Welcome back to the community forum!

6 Likes

Where are the Certificate Transparency logs? I'm pretty good at parsing files and data processing. Not sure if I want to parse hundreds of millions of certificates, but a statistical sample shouldn't be hard.

nm, found info: Certificate Transparency (CT) Logs - Let's Encrypt

2 Likes

The primary purpose of the CT Logs is to ensure that CAs are following the rules that they're supposed to, so they're structured mostly around their immutable properties and less around making them easy to query. The Cloudflare statistics that you found are based on them. There are some tools to search through the CT logs that I've collected in this thread:

I'm guessing the easiest way to run a traditional data processing search is to use the crt.sh public Postgres database and run some queries, but it's usually pretty overloaded. Maybe there's a better way out there, too.

4 Likes

You may have better luck using something like censys.io

Although, the number of queries are very limited for free plans. And, I find their API takes some practice to get right (but maybe that is just me).

See: Credits for Free and Starter Users

Are you sure you want the signing algo rather than the pubKey algo? Not that long ago Let's Encrypt signed EC leafs with RSA intermediates.

6 Likes

Oh, right, I'd better be careful with the data I care about. Thanks.

3 Likes

OK, so I worked on a Python script last night to query crt.sh for certificates based on a randomly generated ID, as a statistical sample, and sticking them into a local sqlite database so I can analyze the sample. It's been chugging away for almost 24 hours and I have about 6400 certificates so far. (plus another 400 or so of another category which I will mention below.)

Some interesting tidbits:

Rate and date of certificates in the log

Here is a plot of the Not Valid Before date (linear y-axis labeled by year) vs. crt.sh ID (log x-axis). I knew that the number of certificates keeps accelerating, but I wanted to see more precisely how this changed, so that I could sample the certificates more evenly over time.

You can see a couple of things here. (And apologies if most of this stuff is well-known already, it's interesting to me.)

The logs seem to have two types of certificates:

  • contemporary certificates, placed in the log about the same time as their issuance
  • older certificates which were issued much earlier, and for some reason they have been collected after-the-fact into the crt.sh logs.

The way to distinguish this on the graph is that there is a "wavefront" or "vanguard" above which there are no certificates, except for a few anomalous ones logged in mid-2023 which appear to have a Not Valid Before date that is 10-12 months later than they were added to the log. To the right of the graph (ID ≥ 106 or so) this "vanguard" is a more-or-less solid curve. The left side of the graph, the "vanguard" is basically a horizontal line in early 2013. (February?)

Certificate Transparency efforts appear to be in production starting in early 2013, and the first 1-2 million log entries on crt.sh appear to be collecting copies of mainly already existing certificates.

The orange curve was my attempt to approximate the vanguard curve with a function ID = f(u) so that I could generate random IDs that would be relatively equally-distributed in Not Valid Before time. The way this works is fairly simple:

  • generate a uniformly-distributed random number u between 0 and 1 (or some sub-interval), where u=0 represents Jan 1 2008, and u=1 represents mid-2025 (approximately the present), so essentially u = (t - Jan 1 2008) / 17.5 years
  • compute ID = f(u), rounded to the nearest integer.
  • query the crt.sh database for this ID, and cache the resulting certificate, indexed by ID

I'm not sure what to do about the non-contemporary certificates collected after-the-fact, so I filtered them out, basically starting at 2013 and ID = 106, and running through my statistical sample of certificates, tracing the "vanguard" forward in time, with a slew-rate limit of +3 days per sample (this rejects sudden glitches where suddenly there's an old cert from 2009, or one that jumps forward), keeping only the certificates that are within ± 90 days of the vanguard.

That gives us the data plotted below; 1374 certs out of my sample of 6403 were rejected either because the ID is less than 106 or the Not Valid Before date is more than 90 days from the vanguard curve.

Here I've plotted the leaf certificates (blue dots) as well as the precertificates (red x's), which appear to start in the spring of 2018, and make up about 40% of my sample set. (All the root certificates I downloaded are in the low-number batch; I only ran across one intermediate certificate for some reason.)

For the remaining 3000 or so data points, which are leaf certificates, I can do a histogram of public key algorithm by calendar quarter of Not Valid Before date:

(upper subplot = fraction of certificates with each public key type and size; lower subplot = number of samples obtained for each quarter)

This data set is a bit sparse (need more samples!) but you can see that:

  • the fraction of leaf certificates with RSA 2048-bit keys was nearly 100% in 2013, but it has decreased somewhat over time
  • there was one certificate in my sample with a 1024-bit RSA key in 2013 (a few more in the batch of IDs below one million)
  • ECDSA made an early entrance in late 2014 and 2015, but then faded away for a couple of years at a low level, and then started a resurgence around 2023, now making up almost half of the certificates in early 2026
  • 4096-bit RSA keys have been fluctuating over time. There are a few 3072-bit RSA keys. (And I found one strange certificate with a 2432-bit RSA key.)

If I aggregate the data by year rather than by quarter, it's a little less noisy (but less time resolution) since there are more samples per histogram bin:

I'd be interested in querying the crt.sh database for statistics, rather than grabbing samples one by one, but I don't know how I would do the kind of signal processing that I have done on my statistical sample. SQL queries wouldn't be able to reject these outliers... should I have included all the leaf certificates, even the ones collected in the logs far after the fact?

Also I ran into a problem where about 400 certificates from 2013 and earlier were technically malformed and the Python cryptography library cannot read certain fields. 99% of these were issued by GoDaddy or its spinoff Starfield Technologies. (the other few were issued from companies in Spain.) I downloaded the certs but excluded them from my dataset.

12 Likes