Hi, Am I right in thinking there's a risk...
There is a theoretical risk with Certbot's design, but it is infinitesimally small. There is no real risk in production.
An alternative would be for live/example.com to be a symlink so it can be switched atomically between directories.
Many people would prefer storage and versioning to happen via a directory for a variety of reasons (myself included). The Certbot team is not interested in doing this for a variety of reasons.
I assume you're talking about the original mixed-read issue. I think it does happen in practice and more likely with high load given observations here. With the scale of Let's Encrypt deployment, low chance times a big number equals happenings. Perhaps they just result in failures which are re-tried, aren't logged, or aren't investigated. Whichever, it's a bug. Is it a won't-fix one?
Changing the storage design is a wont-fix. The argument you surface has been brought up before. Like everything else it was not persuasive to the devs, and has been deemed wontfix.
I think you are overstating the potential for this issue and overcomplicating the possible fix.
Certbot renewals happen on-demand and on-schedule. The files being read which you mention typically happen on-demand and on a reboot. The race condition you are talking about is essentially a concern over a service like SMTP restarting in the middle of a Certbot renewal, and perfectly timing an overlap of microseconds. This is extremely unlikely to start with, and the result would be the service failing to (re)start due to an incompatible certificate/key pairing. The failed restart is important, because it notifies the server administrator - which is a useful notification but also an avenue which would have created endless issues filed against Certbot if it happened.
Certbot uses a lock file to only run one copy at a time. I will note a possible issue here is the location of the lock file, as one might use a single certbot with multiple archive/log/work paths- and i'm not sure how that situation would be impacted by the lock file. If someone were concerned about race conditions, they could simply check to see if Certbot is currently running (via the lockfile) to guard against this particular race condition. I think that would be a lot easier than trying to use the deploy hooks to copy stuff around. Personally, i use the deploy hooks to copy stuff around - but I do it for particular reasons, not guarding against this potential race condition.
So to summarize:
- It's just not common for ancillary services to restart during renewal
- The window for this potential race condition is microseconds
- The condition could be avoided by not restarting services when Certbot is running.
I'd love to see Cerbot implement versioning on directories instead of files. I think I'm the biggest fan of that storage design, but their team is simply not interested in it.