Not sure if it's possible, but it seems like it would be useful information to know ballpark numbers around how many times more than 1 cert is generated in the 30 day cached window. I know in the past, one of the strategies recommended for people with small'ish server farms was to explicitly take advantage of cached validations to allow multiple servers to more easily get their own certs instead of synchronizing a shared one and stagger renewal times over days as a safety net against renewal issues or revocation. I'd be curious how many folks are doing that in practice (I'm not). But I suppose it might get conflated with the folks just getting started and issuing multiple certs by accident that they lose and never use. Though in those cases, they probably run up against rate limits within the first 7 hours. In any case, the reduced load from not having the CAA re-check infra might be somewhat offset by the additional validation load.
As for the actual reduced cache window. I think the number of DNS providers that require serially processing validations because they don't support adding multiple TXT records for the same FQDN is vanishingly small. Of the ~70 DNS providers that Posh-ACME supports, only 1 actually requires this but only because it's an artificially imposed limitation on a free dynamic DNS provider (as opposed to some technical limitation). I'd guess the number that have propagation times on the order of hours is also vanishingly small. Remember, we're only talking about propagation to the authoritative nameservers of the zone, not all of the recursive resolver caches on the Internet.
It can, but the the majority of commercial providers I've seen replicate in under 5 minutes (e.g. Cloudflare appears to happily replicate in under 10 seconds. AWS and Google Cloud DNS are under 60 seconds). It would perhaps be easier to document the DNS providers that are reliably slow so people can consider avoiding them.
While this is true, I felt like this proposal wasn't so much about reducing load (as shortening cache times seems counterintuitive to this), but rather about removing complexity. That means less room for certificate misissuance (CAA rechecking was broken in the past), less code to maintain, less things that can break. This may be worth much more to LE than minor changes in validation load.
(Of course LE also cannot afford to ignore performance, both sides have to be considered)
Yes, but there's also nothing preventing someone from setting their TTL to 1 minute. Therefore we ignore the TTL and just use the BRs-man dated cache time of at most 8 hours. We can't rely on the goodness of subscriber's hearts to set TTLs that make life easy for us
There's a difference between not catching validations (i.e. not allowing a future new order to reuse an already-validated authorization from a previous order) and what I'm proposing here. Here I'm saying that even within the context of a single order, finalization will have to occur within 7 hours of validation. I'm not sure if ZeroSSL has any limitations like that. (They may! I'm truly not sure.)
Yeah, I'm working on getting related numbers now. Sometime in early January I should know how old orders are when they're finalized (to know if people regularly take more than 7 hours to complete issuance) and I'll know how old authorizations are when they're attached to a new order (to know how often people rely on authorization reuse outside that 7-hour window). These aren't perfect numbers, but they should give us a broad picture.
Not long ago, our secondaries were provided by routing.net, and I regularly experienced propagation times of several hours, and more than once several days, which already caused me headaches because in the beginning, I serially handled certificate renewals – so yes, there are problematic installations out there besides big or free DNS providers.
Concerning the problem of multiple SANs: Some ACME clients (I'm using dehydrated) handle SANs sequentially IIRC, i.e. get a challenge, propagate TXT RRs, validate this challenge, then get the next. This could be improved if they were changed to get all the challenges, update all the TXTs and then validate all of them at once – but I don't know how many ACME clients would have to be changed and how much work this would be …
I've never encountered that, but I have encountered many people struggling with weird caching issues between API/ControlPanel updates and DNS server responses that often take more than 7hours because of common workarounds with DNS-01 for multi-domain certificates.
User uses Certbot to request a Certificate for 10 Domains
Certbot processes a DNS challenge for Domain1
User has their update-hook script sleep for 60m+1second between updating the DNS and returning to Certbot. Why? After updating their vendor's internal system DNS record, they may have issues with old records cached in DNS servers or application caches. Most users wait 301 or 601 seconds, but many wait 3601.
As an example, many years ago Namecheap appeared to have this behavior:
Their DNS servers respected the TTL, but used a read-through application layer cache to load records. That application cache did not respect the TTL and seemed to have a 5minute expiry.
Updating DNS records via their Control Panel or API would update their database backed internal systems, but would not clear the Application Layer Cache (it was only read-through, and lacked a write-through hook on the admin side).
Old records would often get cached into the application-layer cache - either by failed attempts, clients/users checking to see the record was set successfully, or perhaps some internal cache loading logic the vendor implemented. The only way around this was to sleep for 501 seconds.
I don't know if people still deal with this regularly or if this was an artifact of people wanting to stuff the SAN with as many domains as possible in the early days of LetsEncrypt. There are probably a few dozen posts on this forum about issues like this.
I'm 100% for the shortened time, but I think 24 hours would be preferable for this use-case.
I think a shortened time would actually solve another common problem – when people get confused if a cached validation is used for an order, while a new challenge fails. The logging/printing of ACME clients tend to conceal what is going on, so people start digging themselves into the wrong troubleshooting hole.
Yeah, to be clear, suggestions along the lines of "what about 24 hours" are interesting to read but unlikely to be considered in this case -- if we don't shorten validation document reuse to the same time as CAA to allow ourselves to simplify services, it is unlikely that we will shorted validation document reuse at all, as doing so has the potential to be disruptive with little concrete benefit.
If Let's Encrypt ever experience challenge validation delays above 90-100 seconds, shortening the authorization lifetime could have an impact.
While looking at this issue I realized that 3 popular clients will all bail out of an order if challenges sit in pending/processing for more than 100 seconds.
With a long authorization lifetime (over 12-24 hours so that the authzs are still valid at the next scheduled client run), the certificate issuance will probably succeed due to already-valid authorizations sitting in the account.
If the lifetime is reduced to 7 hours, any clients relying on that behavior won't be able to complete their orders.
1: What if my DNS provider has really slow propagation, and also only lets me set one TXT record at a time. Then how can I issue a cert with multiple names?
For which my answer is: yes, this is unfortunate, and probably a real issue for a tiny fraction of people, but there are other solutions here, such as using a CNAME to delegate your ACME DNS entries to a DNS provider with better behavior.
2: What if my validations are really slow, so my issuance relies on timing out while trying to issue one day, and then re-using those eventually-successful validations the next day?
For which my answer is twofold: first, this should show up in the new "age of authz when attached to a new order" histogram that will be deployed next week; and second, it seems unlikely to me that we'll have 100-second delays validating the same name over and over again. I'd expect that delays (but not failures!) that take that long would be more likely to be transient.
Finally, there's another concern, which is simply: what if a large subscriber relies on validation reuse to be able to quickly reissue certificates? (For example, one could imagine that a hosting provider just constantly maintains authorizations for all names they host, and dynamically issue 100-name certs each time a customer joins or leaves their platform.)
For which my answer is: well, we're going to start gathering data, and then talk to these subscribers, and see what can be done