Rethinking Certbot's preservation of certificate history

schoen · April 15, 2022, 2:13am

This was inspired by a recent discussion in an issue on Certbot's GitHub page.

suggestion - version the certificates with the directory, not the filename

opened 09:25PM - 26 Apr 16 UTC

feature request area: cert management

I'm in the process of a server migration, and only need to move over the latest …certs this is a bit more work that it could be, because the versioning is: - archive/example.com/cert1.pem - archive/example.com/cert2.pem - archive/example.com/cert3.pem - archive/example.com/cert4.pem - archive/example.com/cert5.pem and the links are: cert.pem -> ../../archive/example.com/cert5.pem that is for each domain x 4 (cert, chain, fullchain, privkey) my suggestion is to use versioning in buckets: - archive/example.com/0005/cert.pem - cert.pem -> ../../archive/example.com/0005/cert.pem this would allow people to just grab a single bucket when moving servers (or cleaning up)

I was there when the Certbot team decided to make Certbot keep extensive history related to prior certificate versions and issuance history (which it does in several ways), and I remember a lot of the motivation behind it. I helped design the mechanism that Certbot uses to track these versions.

My recollection of the motivation for keeping old certificates, keys, etc.

When we first started working on Certbot, we thought sysadmins would often not want to use what we now call an installer plugin, and would often want to configure their sites partially manually. We also thought they would commonly want to manually inspect newly-issued certificates before starting to use them.

All of these intuitions derived from our understanding of prior practice with previous certificate authorities, and also some feedback about preferences of sysadmins who preferred to take a more hands-on approach. (Indeed, a small minority of users have continued to vocally complain about how much Certbot attempts to automate for them, up to the present day.)

For some reason, we thought it was possible that sysadmins inspecting their new certificates would decide that the new certificates were not correct or not what was intended, and would then want to delay "deployment" of the new certificates, or even roll back a deployment to a previous version.

(In fact, we even originally expected a two-phase "obtain certificate" and "deploy certificate" process, where what we now think of as authenticators and installers might be used separately in separate invocations of certbot! And, with automated renewal flows, their timing might be separated by a significant amount of time—measured in multiple days. The new certificate would then be present on the user's disk for the entire period between when it was obtained and when it was deployed, remaining deliberately unused during that interval.)

In light of how few elements of the certificate Let's Encrypt actually allows users to control, and how reliably the system as a whole has worked, this now seems like a vanishingly rare situation, and the only case in which it seems to occur in practice is when people accidentally remove domain name coverage that they didn't mean to. But that has been mitigated a bit in other ways and may still be mitigated in additional ways in the future.

We literally thought at the outset that there might be a common use case for people to say "I don't actually like version 7 of my certificate; let's roll back to version 5". But, roughly speaking, nobody ever asks how to do this.

Some problems caused by the current system

The current system (which, again, I helped design and bear quite a bit of responsibility for) uses a fair amount of disk space. It also keeps old private keys around indefinitely, which is a rapidly decreasing security threat because of the huge rise in PFS ciphersuites, but which does make individual Certbot installations a target for someone who wants to compromise historical TLS traffic that was encrypted with a non-PFS ciphersuite. I don't know what percentage of sessions today end up negotiating such a ciphersuite.

The biggest challenges with the current versioning mechanism, though, are

It's kind of brittle with regard to referential integrity. Many users don't understand that they shouldn't rename any of the files under /etc/letsencrypt (even though there is a README file warning them not to), and the result of renaming these files is often that Certbot refuses to run at all, plus a family of bugs (much less common nowadays) where Certbot attempts a renewal every time it's run because it doesn't save renewed certificates in the place it's expecting to find them afterward.
Users often don't understand what it's for.
People seem to have a hard time making working backups using symlinks, because they often use backup methods that don't preserve them.
The symlinks have also been a problem to some extent for the Windows port, where people are even less familiar with symlinks.

A possible alternative mechanism

Maybe there could be a new directory called /etc/letsencrypt/old or /etc/letsencrypt/backups which contains (only) the three most recent versions of each privkey.pem, chain.pem, fullchain.pem, and cert.pem for each certificate lineage, not as symlinks but as regular files, kind of on the model of logrotate keeping backups of recent old log files in /var/log. For example, there might be

/etc/letsencrypt/old/example.com/privkey.pem.1
/etc/letsencrypt/old/example.com/chain.pem.1
/etc/letsencrypt/old/example.com/fullchain.pem.1
/etc/letsencrypt/old/example.com/cert.pem.1

and also .2 and .3, but no more. The corresponding /etc/letsencrypt/live/example.com/privkey.pem and so on would still exist at their existing names and locations (especially to make existing web server configurations and documentation continue to be correct), but would now be regular files instead of symlinks into ../../archive/. The /etc/letsencrypt/archive directory would be deprecated and would contain a README file stating that it is no longer used, and that older versions of certificates could be found in /etc/letsencrypt/old (or /etc/letsencrypt/backups).

There would still be a referential integrity issue about what happens if someone edited or renamed the renewal configuration file for a lineage without also changing the corresponding live directory name, but there would no longer be any issues at all about broken symlinks or symlinks pointing to the wrong archive directory.

The storage.py logic would become significantly simpler overall, although there is still a question about atomicity and consistency of updates during a renewal.

Cc @certbot-devs. (I'm not trying to saddle you with work that's not part of your roadmap or anything; I might also make an experimental PR to demonstrate this approach if anyone is interested.)

MikeMcQ · April 15, 2022, 2:32am

I like your alternative mechanism. As a refinement, I'd suggest keeping older cert sets only until they expire. Perhaps even using a date/time stamp as the final extension instead of a serial number. This is only to avoid worry of ever-increasing sequence numbers (not for use in purge selection).

In a well-run stable system there would only be one set in the backup. For others maybe quite a few more with various combinations of domain names. In either case, the number is limited and only contains possibly useful cert sets.

schoen · April 15, 2022, 6:18am

@MikeMcQ That suggestion makes a lot of sense to me, but I would worry a little bit about what might happen if a server had its clock set incorrectly. Perhaps there ought to be some other backstop as well.

9peppe · April 15, 2022, 8:43am

More than "set incorrectly" we should think of servers that do not have an RTC and save (or not) the last poweroff time, then update the clock via NTP at powerup: think Raspberry Pi, which is very popular with people hosting at home.

MikeMcQ · April 15, 2022, 1:22pm

The system clock is not required. The purge cycle could only run when issuing any new cert and that new cert notBefore date could be the reference for looking at "old" cert expirations.

And, the stamp instead of integer could be the notBefore or After as well. I'd prefer notBefore but not strong preference.

If concerned about notBefore dates set in the future there are also time stamps in the ACME flows that could be used. Even using the Date from the http response headers is possible.

The serial number with fixed number is fine too as long as the number is suitably large. We have seen many people with a mess of certs issued in a short time and it would be nice to have a simpler way of using an prior one. When testing I have gotten a lot of certs from staging mixed with the production certs in archive and it has been handy to retrieve a valid cert from that history. I have done similar for posters in this forum (not often though).

It seems the purpose (now) of the old sets are to help people with odd problems. Well run stable systems don't need it and probably have good backups anyway.

9peppe · April 15, 2022, 1:26pm

It kinda is, tho. If you want to check if a certificate is valid, if you want to validate an OCSP stapled response...

It doesn't have to be ultra-precise, but at least one day accuracy is -- I'd say -- required.

MikeMcQ · April 15, 2022, 1:55pm

Don't have to check OCSP response for backup rotations. Just looking at x509 dates to see if has expired. If it precedes a just-issued cert's notBefore date then it has expired.

As I noted, there are other ways of determining "current time" than system clock if using notBefore is a concern.

jvanasco · April 15, 2022, 2:20pm

@schoen I think that generally looks fine EXCEPT for the versioning being based on the suffix. I still strongly believe that versioning should occur on the directory name.

While I've never been a big fan of suffix based versioning on log files/etc - which then requires rotation - I understand it's utility and history in linux based systems.

When it comes to LetsEncrypt certificates though, that's a different story. With these certificates, users often need to backup or migrate the files across servers. Even if you limit the versions to 3, you still have 12 different filenames that must be manually sorted, reassembled, and often renamed. Users in these forums are often confused and intimidated by this.

If the versioning happens on the directory level, the filenames will always be the same. A directory based versioning model would make it incredibly easier to handle the backups and server migrations. I'm not just guessing this may happen - I actually utilize some Fabric (https://www.fabfile.org/) routines to do this resorting for cloud archiving, because rebuilding/migrating has proven to be exponentially easier when using the directory based versions.

While I’m at it… another feature request I had on Certbot was to put a txt file with metadata about the certificates in it, such as a simple listing of the domain names and the not before/not after dates. That would take up little space, greatly simplify searching for certificates for advanced users, and make the archives much more usable for novices who do not understand basic OpenSSL commands — which are the majority of certbot users. While this information isn’t particularly useful for single domain certs, multi domain certs cause issues because the domain name is not necessarily reflected in the file path. Putting this in plaintext would allow for a simple operating system commands and searches to surface the correct certificate, without a need to invoke OpenSSL. This also solves issues where domains no longer appear in a renewal config because the domains in a particular lineage changed.

Osiris · April 15, 2022, 5:13pm

What's the difference with versionating directories instead of the files themselves? The entire path would change anyway. I don't really care if it's the filename or the directory? Pretty much the same?

bmw · April 15, 2022, 8:08pm

Thanks for thinking about this everyone.

I'm personally going to stay out of the conversation for now, but if you all largely reach a consensus about a better design here, please let me know and I'll take a look. This is especially true if someone is interested in helping us implement it.

griffin · April 16, 2022, 3:55pm

Just a note that you would need to add an hour to the notBefore date when comparing since Let's Encrypt backdates by an hour.

petercooperjr · April 16, 2022, 11:19pm

Well, not all CAs that one can use Certbot with necessarily backdate notBefore the same way.

jvanasco · April 18, 2022, 3:43pm

Versioning based on the filename is absolutely not the same as versioning on the filepath. With filename versioning, much work needs to be done to re-standardize the filename (via symlinks, etc) to the expected name, each step creating an opportunity for accident. With directory versioning, the actual filenames always stay the same.

Certbot automates this for it's own usage, by symlinking a versioned suffix file to the standardized, expected, filename. For example, while the archived file is cert47.pem, the standard expectation for usage is simply cert.pem.

Some, but not all, reasons include:

When archiving/migrating each set:

Perhaps you are lucky and using a program command that respects *; as in cp /path/to/old/*14.* /path/to/new. This isn't supported by all executables, nor is it apparent to novice users. You still end up with a versioned filename, which must be eventually renamed (or symlinked to a standardized name) for usage.
Most users will end up addressing each of 4 files via 4 single commands: cp /path/to/old/cert{version}.pem /path/to/new; cp /path/to/old/chain{version}.pem /path/to/new; etc
Many users use GUIs, which involves lots of scrolling and selecting multiple files.
Most operating systems and computer languages implement machine sorting - not human sorting, so a sorted list/window will be 11,12,13,14,15,16,17,18,19,1,21,21,... not 1,2,3,4,5,6,7,8,9,10,11,12,...
If the certbot archive format changes - new format or new files - it is hard for humans to visually detect that, and third-party documents are often out-of-date. a single directory archive more easily supports that, and versioning information can be put in there, as the versioned fileset could be:
- cert.pem
- chain.pem
- fullchain.pem
- privkey.pem
- meta.txt
- v2.txt

In the above example, one can instantly know by looking at the filenames alone, the certbot payload has changed to v2. with filename versioning (1) this information would appear below the scroll window, and (2) many archiving/backup utilities would need to be updated to reference these files. by putting everything within a versioned directory, the payload can change without requiring any updates to code -- as programs would be targeting the parent directory.

As you should know from your experience on this forum, a large number of users frequently lose certificates and break installations directly due to these nuances and intricacies of naming and sorting. Users often try to delete old certificates, but mistakingly delete still relevant files due to the above complexities, which are compounded by a machine sort listing all the certs, then all the chains, fullchains and private keys.

Grouping the files by directory will help avoid many (IMHO, most) of these mistakes.

Then, when enabling an archived/migrated set:

A file based versioning requires multiple commands, as they must be renamed:
ln -s /path/to/archive/{lineage}/cert{version}.pem cert.pem
ln -s /path/to/archive/{lineage}/chain{version}.pem chain.pem
ln -s /path/to/archive/{lineage}/privkey{version}.pem privkey.pem
ln -s /path/to/archive/{lineage}/fullchain{version}.pem fullchain.pem
A directory based versioning can be done in a single command:
ln -s /path/to/archive/{lineage}/{version} .

A single command works, because the actual filename is unversioned and exists as the exact name as expected by programs that consume the files.

There are a handful of other reasons as well, I believe these are the most apparent and simplest.

Adding: My perspective on this is largely influenced by a professional shift into Product from Engineering. For seasoned developers, such as Certbot engineers, there is little difference between the two formats. For novice developers and barely-technical users, the difference between the two formats is substantial.

system · May 18, 2022, 3:44pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.