I was wondering whether LE provides any API access for academic research purposes? For example, data on when certificates get issued, how often they are renewed, how many first-time certificates are requested, etc.
Can't you use certificate logs for that kind of information? No LE API required for the things you're mentioning now.
Yes, CT logs would work for this, but it requires parsing many logs and a large bulk of data. I know solutions such as censys.io provide database lookups to such information, but it is fairly limited in its number of queries per month, even with researcher access. I was therefore wondering whether LE had a similar type of database (limited to LE certificates of course) to which they provide access.
They provide a few high-level statistics:
But I think for more in-depth studies, your best bet is to do the CT Log scanning and aggregation yourself. Yes, it's many logs and a large bulk of data to trudge through, but then all the data is there for you to analyze however you want.
I'm sure Let's Encrypt has monitoring of some other statistics, and of course they have the raw data too (at least for active certificates), but I'm not aware of any existing "researcher" program to get access to it, and I'm guessing their database is hard enough to manage themselves without giving other people access too. And this kind of analysis is (in part) exactly why Certificate Transparency exists.
We don't have a lot of "data infrastructure" ourselves to make sharing this kind of data easy, and as such we only really have Certificate Transparency available.
Not speaking on behalf of LE or ISRG, just my own opinion: But I do wish this sort of better indexed data set existed, not just for Let's Encrypt but for the entire WebPKI. I've been thinking about how to make this available. The amount of data in all of Certificate Transparency is about 50 - 100 terabytes (by some of my estimations, which may not be quite right). That's well within reason to index in more digestible formats and even store on a single server, but big enough that it would be expensive to host publicly (Expensive enough I wouldn't do it personally). The operational constraints on CT are quite tight, and having people scraping CT servers can be annoying if it leads to too much load.
doesn't external ct monitor practically have to download the log entirely to verify log's soundness, as they need to calculate hash tree of all the certificates in it?
I think the oak (CT log) will have all the certificates LE signed (in 2022) in it.
I wish this was available too. In one of our commercial products, we track the history of domains - IP addresses, whois changes, etc. We were trying to track Certificate history as well - to track a given domain's relation to other domains vis shared certificates - but that required forking/patching a lot of fragile bits in Python's urrlib3... and it's just a pain to keep up.