TL;DR: since about 25.8.2023 we're seeing some of our renewals fail in an interesting new way - the issued certificate does not match the key that was used to sign the CSR.
Hi there! I was wondering whether anyone else has seen this as well: since 25.8.2023, generally between 23:00 and 03:00 UTC, once per day there'll be a period of about 30 minutes, during which a percentage of issued certificates will not match their private key. We'll try to renew ~100 certificates within an hour, and about 15-30 of those attempts will result in a "mismatched" certificate. Outside of this period of time, the renewals don't run into any such issues.
We've detected this just yesterday, so I've went back and scanned through our database and found 150 certificates thus issued within the last week - but none older than 25.8.2023 - and renewing them again worked just fine (i.e. resulted in a working certificate).
I've tried mitigating this to a degree by checking the issued certificate before writing it to our database, so it's not a pressing issue (other than the wasted API requests), but I'd be interested to know whether I'm the only one seeing this.
Full disclosure though - we are rolling our own LE "client" that's tightly coupled to our domain management framework, and I'm not discounting the possibility that it is to blame for this. Granted, there haven't been any changes to it for several months now, but I'll continue to investigate it in any case.
I haven't heard of anything like that, so my guess is that's is something on your end. I suppose it could be your custom client hitting things in a way differently than most others in some weird way, though. When you get a certificate that doesn't match the key you're expecting, can you determine which key it does match? Like, is it for another one of your domains, or for the previously-used key for the same domain if it's a renewal, or something like that? Certainly around midnight UTC can be some of Let's Encrypt's busiest times; maybe the response is just taking enough longer then to trigger some sort of race condition in your client?
Can you maybe post some more specifics, of whatever logs you have of the ACME transaction, the CSRs being sent, and the returned certificates that don't match?
Does the issued certificate match another key in your system? I would audit your database/results/logs for that. I would not be surprised if your custom LE client has some bug that has caused confusion between two ACME orders. A lot of clients have surfaced race conditions, fatal ratelimits, or just really weird bugs when they run at peak volumes.
For RSA keys I like to track the SPKI sha256 hash in my system - both by logging and also storing in the database. It's not many bytes, and you can cross-reference all the CSRs, Certificates and PrivateKeys through them very quickly. There is a similar method for ECDSA keys.
Great idea regarding storing/logging the SPKI hash - unfortunately we didn't, not until now at least. Would definitely make the cross-checks I made earlier faster and easier
We do use several threads for batched renewals, and if there's a culprit, it's likely this. While it hasn't given us any grief for the past 12 months it has been running multi-threaded, that doesn't necessarily mean a bug wasn't waiting for an edge case to hit. I'll work on the assumption there's an issue on my end, extend logging to include the hashes, and see what shakes out.
Make sure your database connections and file descriptors are opened after forking/threading and not in the master process. I do a lot of work with SqlAlchemy (Python) and new users often experience issues similar to yours because they established a db connection for application configuration, then failed to "dispose" of it prior to threading/forking -- causing the connections to be (re)used by multiple processes simultaneously. Sometimes this defect is sitting in code for years before the system load causes people to notice it.
Turns out it was something much more mundane than that.
Some background: we run two types of renewal tasks:
The "primary" of the two runs once per hour, trying to renew a set number of certificates before they expire. It only considers certificates that haven't previously failed renewal for whichever reason (DNS failures being the most common one, for example).
The other one also retries previously failed certificates, but in a far shorter interval before expiration (up to about 2 days before expiry). It also runs only a couple of times during the night (this honestly should have tipped me off, but I... forgot about it).
And so it came to pass, that due to the first task failing to keep up with the renewals due to an increasing number of extant certificates in the database, a couple of weeks ago these two tasks begun to have an overlap in the database rows they tried to touch; and LE API sensibly responded to both of them with the result of whichever one of them came first.
A simple SQL query tweak fixed that and I've seen no mismatches in the logs since yesterday.
This would have been a nightmare to debug without your help, and I would once again like to thank you for your helpful tips and your time.
Well, at the moment our database contains just shy of a quarter-million certificates with an expiration date set in the future, spread over 217447 domains. This means we have to renew about 2800 certificates each day, 120 per hour, assuming none of them fail (which is where the second task in my previous post comes in - we don't want to waste resources trying to renew a certificate whose domain no longer exists).
We try to process them up to 24 days before expiration, but it's typically much later than that - the long interval we're working with is there more to "level out" any irregularities in the number of certificates that expire on a given day, as the task only processes a fixed amount of certificates each hour.
Aside from LetsEncrypt's order rate-limit and the load-leveling aspect, one other reason for the fixed amount is that the set up/propagation/tear down of TXT records needed for DNS-01 verification for wildcard certificates can take the better part of a minute in our case (thus allowing for about 60 certificates per hour, if all of them happen to be wildcards).
To keep this up, we make use of two process threads that each have their own LE account.
I'd also be interested to hear from others that handle a similar (or perhaps even much greater) amount of certificates.
Hence the two process threads, that are mentioned a little later
But aside from bombarding LE API with requests, there's also the load on our own DNS API/infrastructure to consider here - after all, I'm not the only one using it. In the future, we're likely to look at having a separate DNS server instance that only handles verification TXT records linked by a CNAME.
Excellent, yes I'm working on very similar problems, with the addition of allowing fallback to compatible CAs if problems arise. Our current target is 1m certs over 90 days from a single instance, which is achievable but can be quite hard to test! I've found available self-hosted CAs like smallstep and vault can produce spurious results if you pound them for long enough (which leads to large scale failures in a short period). At that level things like logging, notifications, deployment etc can become a resource problem and it's a race to keep renewals going if something falls over. HTTP validation is definitely the cheapest challenge response method, but we do have a cloud-based acme-dns (like) implementation as well.
We have the same problem of prioritising failures with back-off while still getting to the other certs that still need to renew (or new ones), and also looking at the concept of automated renewal scattering so that clusters of planned renewals are identified and don't all become due at the same time.
To tackle failing domains I'm also looking at integrating RDAP queries as an early warning for domains that customers haven't renewed, as permanent failures are a drain on renewal resources.
Have you thought of doing the first round of renewals via one authentication method [like HTTP-01] and doing the second round of renewals [those that failed round one] via another authentication method [like DNS-01] ?
I don't thinks we've ever considered it, to be honest.
My thinking is thus: if the http-01 challenge has failed, it essentially means the domain's A/AAAA records don't point to our web servers (anymore). Silently issuing a certificate using dns-01 (assuming the domain is still registered and uses our DNS servers) and subsequently serving it from our web servers would then be counterproductive, as the website loads from elsewhere. This could then potentially catch the customer unawares (and result in a helpdesk ticket in case the "live" certificate expires).
So we stick with http-01 where possible and either fail early (whenever the customer tries to issue a new certificate) or notify them via e-mail (on failed renewals). If the customer remedies the issue prior to expiry, we'll renew the certificate without requiring any further input from them.
However I'm sure there are contexts where it would be a useful fallback.
An increasingly common cause of these failures is network routing issues. They've probably been around forever, but LetsEncrypt adoption is surfacing them in this context. Sometimes, often times, ISPs and Data Centers f-up for a few hours and validation servers can't connect.
If you're working on scale, I like to route all traffic under /.well-known/acme-challenge to a dedicated ACME system. This can be wonky with Hosting Providers (seems like you are one) as you may want to failover back to the customer's domain.
The CNAME system along with an acme-dns like system is preferable when possible. This is difficult when your customer has many domains and you don't control their DNS, as they'll need to point two records to you (domain and challenge).
You can download Boulder - LetsEncrypt's system - from github and run that as your test server.
Cloudflare does an interesting thing - they get a secondary backup certificate from an alternate CA so they can immediately rollover clients if there is a mass revocation. At your volume, you should be looking into that as well.
Wikipedia actually takes this a step further, and runs different CAs in different data centers, so that they know that both certificates are really actually "good" and work with live traffic. Then, if one of the CAs has an OCSP failure (or other problem like mass revocation) they just run a script to switch all data centers to using the other CA's certificate (all the data centers have the certificates "ready" and collect valid OCSP responses to staple for all the certs, not just their "active" one). Not many organizations work at Wikipedia's scale, though.