Reading old private key and new certificate mid renewal

Hi, Am I right in thinking there's a risk that a program reading the symlinks in live/example.com might come along mid renewal and get half old and half new data given there are multiple files which need reading? I think so because files are updated individually.

An alternative would be for live/example.com to be a symlink so it can be switched atomically between directories. Readers would first open the directory and then access relative files within it using openat(2) or similar. Then all the files accessed would be the same vintage.

  • If this is an issue then is the current workaround to use pre/post-hooks to block readers during renewal? This leads to possibly a long blockage for what would otherwise be a program which runs often but for very little time.
  • Is this documented anywhere? If you'd like an issue open, let me know.

certbot code of relevent part: but post / renewal hooks are run after this so if called by hooks won't have this problem

2 Likes

That code seems to suggest a separate problem as the new symlink isn't moved over the old one atomically but instead there is a time when the symlink doesn't exist at all: at 604½.

Your observation theoretically is pertinent. However, the problem has low chance to happen in practice.

3 Likes

Thanks, bruncsak. I assume you're talking about the original mixed-read issue. I think it does happen in practice and more likely with high load given observations here. With the scale of Let's Encrypt deployment, low chance times a big number equals happenings. Perhaps they just result in failures which are re-tried, aren't logged, or aren't investigated. Whichever, it's a bug. Is it a won't-fix one?

Elsewhere, there are comments saying don't do just-in-time certificates because validation isn't guaranteed to be a second or so in the future. That means a pre-hook to halt serving to avoid the mixed read could cause quite a bit of down time. To mitigate, the pre-hook would have to arrange for a stall where the work is queued for processing on post-hook. Tedious. An atomic install on renewal avoids all this and looks like the 'proper' fix.

1 Like

think it 'll be self corrected by renewal/post hook reloading the webserver: as it happened after every other thing certbot does

4 Likes

I agree the mixed read doesn't persist but it does cause glitches which can be puzzling.

I've thought of a better workaround. Have no existing server use the /live/example.com files as they're unreliable. Instead, a deploy hook can atomically update copies elsewhere and it is these which are used by different servers. If anyone sees a flaw with this, please say.

anyway its client side bug: server side ends when it hands der certificate back to client(they don't know privkey) let's make a issue at Issues · certbot/certbot · GitHub

3 Likes

I am just curious ... what kind of programs are doing this?

Because, at least with the Certbot client, the server referencing the /live symlinks would be reloaded after the certs and symlinks are placed. The --nginx and --apache plugins do this or done with --webroot and a --deploy-hook (like systemctl reload nginx).

Are you describing a specific concern or a general one of syncing "paired" files?

"just in time"? No, you should not wait until seconds before cert expiry to request a fresh cert as your CA might be down, comms problems, or other issues preventing issuance at that moment.

But, plenty of systems setup certs as they are added to servers. Like caddy or Apache mod_md. But, yes, there may be delays getting the first cert. This isn't much different than other delays when starting a new system.

4 Likes

So, for instance postfix will create and destroy service processes regularly, and so it's possible that it will launch a new process that will read an updated file even without being sent a reload command. And in theory, if at that exact moment the key file has been updated and the cert file isn't yet (or vice versa), then it would have the wrong key for the cert that it's trying to use. So, postfix recommends giving it the key and cert all in one file, so that it can be swapped out atomicly and always having the right key for the right cert.

So my deploy hook (I'm not using certbot actually, but doing a similar concept) includes something like

cat /mnt/data/certs/privkey.pem <(echo) /mnt/data/certs/fullchain.pem > /mnt/data/certs/privandchain.pem.new && mv /mnt/data/certs/privandchain.pem.new /mnt/data/certs/privandchain.pem

Where my postfix config includes

smtpd_tls_chain_files =
    /mnt/data/certs/privandchain.pem

I haven't personally seen this sort of expectation in anything other than postfix, though.

5 Likes

The child processes of Postfix are usually run under the postfix user while the master application, which is run as root, doesn't get destroyed and recreated as far as I know. Child processes won't have access to the private key, only master has. At least not on my server.

Same goes for Apache, which also has a root run master process and non-root childs.

5 Likes

You are of course correct. So now I'm a bit confused as to why Postfix recommends having the key and cert in the same file.

5 Likes

I've asked myself that same question many times. Needless to say I don't do that :stuck_out_tongue:

Strictly speaking of course it could happen that both things are happening at the same time, but usually one has their setup configured in such a way, the rollover only happens after a renewal. So I'm not very afraid of some weird setups where it MIGHT happen. Not applicable to my situation anyway :slight_smile:

5 Likes

Hi, Am I right in thinking there's a risk...

There is a theoretical risk with Certbot's design, but it is infinitesimally small. There is no real risk in production.

An alternative would be for live/example.com to be a symlink so it can be switched atomically between directories.

Many people would prefer storage and versioning to happen via a directory for a variety of reasons (myself included). The Certbot team is not interested in doing this for a variety of reasons.

I assume you're talking about the original mixed-read issue. I think it does happen in practice and more likely with high load given observations here. With the scale of Let's Encrypt deployment, low chance times a big number equals happenings. Perhaps they just result in failures which are re-tried, aren't logged, or aren't investigated. Whichever, it's a bug. Is it a won't-fix one?

Changing the storage design is a wont-fix. The argument you surface has been brought up before. Like everything else it was not persuasive to the devs, and has been deemed wontfix.

I think you are overstating the potential for this issue and overcomplicating the possible fix.

Certbot renewals happen on-demand and on-schedule. The files being read which you mention typically happen on-demand and on a reboot. The race condition you are talking about is essentially a concern over a service like SMTP restarting in the middle of a Certbot renewal, and perfectly timing an overlap of microseconds. This is extremely unlikely to start with, and the result would be the service failing to (re)start due to an incompatible certificate/key pairing. The failed restart is important, because it notifies the server administrator - which is a useful notification but also an avenue which would have created endless issues filed against Certbot if it happened.

Certbot uses a lock file to only run one copy at a time. I will note a possible issue here is the location of the lock file, as one might use a single certbot with multiple archive/log/work paths- and i'm not sure how that situation would be impacted by the lock file. If someone were concerned about race conditions, they could simply check to see if Certbot is currently running (via the lockfile) to guard against this particular race condition. I think that would be a lot easier than trying to use the deploy hooks to copy stuff around. Personally, i use the deploy hooks to copy stuff around - but I do it for particular reasons, not guarding against this potential race condition.

So to summarize:

  • It's just not common for ancillary services to restart during renewal
  • The window for this potential race condition is microseconds
  • The condition could be avoided by not restarting services when Certbot is running.

I'd love to see Cerbot implement versioning on directories instead of files. I think I'm the biggest fan of that storage design, but their team is simply not interested in it.

4 Likes

I thought Postfix will not reread the configuration file without a reload command, and all daemons inherit from that on fork. IIRC, changes within a mapping file do not need a reload but declaring or moving the mapping file does.

3 Likes

Hi jvanasco,

The race condition you are talking about is essentially a concern over a service like SMTP restarting in the middle of a Certbot renewal, and perfectly timing an overlap of microseconds.

Not all programs are long lived. Some start on socket activation, quickly do their thing, and quit. So the paired files are read often. There may be many domains so many paired files and the socket determines the pair to be read.

On a heavily loaded server, the time between Python statements resulting in the system calls needed to get all the symlinks made can be milliseconds. Even if the race condition is to remain, the Python code could be bettered structured to minimise the window's size.

Having chewed it over since yesterday, it seems clear to me that /live/example.com shouldn't be read by servers. Instead, a deploy hook should atomically update what they do read and the issue is then moot. What's lacking is Certbot documenting the issue and suggesting a workaround. Following orangepizza's advice, I've opened live/example.com is not updated atomically · Issue #9900 · certbot/certbot · GitHub. It may not result in any change but it is at least a marker should someone else search in the future.

Thanks for the discussion, everyone.

1 Like

Hi MickMcQ,

When I said there are comments about just-in-time certificates, I was thinking of ones like this by jsha:

We don't guarantee that issuance will happen in seconds. We may very well add validation processes in the future that take longer, which would break the assumptions of this style of clients.

in Possible race condition observed with autocert - #20 by jsha

Yeah, he was warning that you can't assume a cert will be issued instantly. He gave a couple possibilities including the LE servers being down or slow. I noted there could also be delays or failures due to comms which any system using comms is vulnerable to.

There are just-in-time ACME clients today that LE even recommends. They should do cert acquisition in a separate thread so don't block other activities. And, properly educate the users of their system that the first issuance may be delayed.

The race condition you describe and the one in that thread are very different.

2 Likes

You could build in "cert/key" validation before production deployment.

  • read files from symlink [copy them elsewhere (call this location "staging")]
  • validate them [to each other]
  • if valid then copy them from "staging" to "production" location
  • if not valid retry from step #1 [count the retries]
    [if too many retires, something has gone wrong - halt and complain to humans that wrote the "loop"]
1 Like

The bit of Postfix documentation referred to is a little distant from the anchor so I'll quote it here.

You can also store the keys separately from their certificates, again provided each is listed before the corresponding certificate chain. Storing a key and its associated certificate chain in separate files is not recommended, because this is prone to race conditions during key rollover, as there is no way to update multiple files atomically.

1 Like