Yeah I think this is fine, for comparison ZeroSSL doesn't cache validated auth at all (it may have done in the past but it doesn't reliably do so now, mind you it also takes a serious amount of time to complete http validation etc) so clients already have to cope with that sort of thing.
Not sure if it's possible, but it seems like it would be useful information to know ballpark numbers around how many times more than 1 cert is generated in the 30 day cached window. I know in the past, one of the strategies recommended for people with small'ish server farms was to explicitly take advantage of cached validations to allow multiple servers to more easily get their own certs instead of synchronizing a shared one and stagger renewal times over days as a safety net against renewal issues or revocation. I'd be curious how many folks are doing that in practice (I'm not). But I suppose it might get conflated with the folks just getting started and issuing multiple certs by accident that they lose and never use. Though in those cases, they probably run up against rate limits within the first 7 hours. In any case, the reduced load from not having the CAA re-check infra might be somewhat offset by the additional validation load.
As for the actual reduced cache window. I think the number of DNS providers that require serially processing validations because they don't support adding multiple TXT records for the same FQDN is vanishingly small. Of the ~70 DNS providers that Posh-ACME supports, only 1 actually requires this but only because it's an artificially imposed limitation on a free dynamic DNS provider (as opposed to some technical limitation). I'd guess the number that have propagation times on the order of hours is also vanishingly small. Remember, we're only talking about propagation to the authoritative nameservers of the zone, not all of the recursive resolver caches on the Internet.
With anycast, even propagation amongst (an unknown amount of) authorative nameservers can be a thing.
It can, but the the majority of commercial providers I've seen replicate in under 5 minutes (e.g. Cloudflare appears to happily replicate in under 10 seconds. AWS and Google Cloud DNS are under 60 seconds). It would perhaps be easier to document the DNS providers that are reliably slow so people can consider avoiding them.
I agree it's not common, but I just wanted to raise awareness
Yandex is reliably slow needing 1-2H wait for DNS challenge. I've seen several examples of this in recent months (one here).
While this is true, I felt like this proposal wasn't so much about reducing load (as shortening cache times seems counterintuitive to this), but rather about removing complexity. That means less room for certificate misissuance (CAA rechecking was broken in the past), less code to maintain, less things that can break. This may be worth much more to LE than minor changes in validation load.
(Of course LE also cannot afford to ignore performance, both sides have to be considered)
Yes, but there's also nothing preventing someone from setting their TTL to 1 minute. Therefore we ignore the TTL and just use the BRs-man dated cache time of at most 8 hours. We can't rely on the goodness of subscriber's hearts to set TTLs that make life easy for us
There's a difference between not catching validations (i.e. not allowing a future new order to reuse an already-validated authorization from a previous order) and what I'm proposing here. Here I'm saying that even within the context of a single order, finalization will have to occur within 7 hours of validation. I'm not sure if ZeroSSL has any limitations like that. (They may! I'm truly not sure.)
Yeah, I'm working on getting related numbers now. Sometime in early January I should know how old orders are when they're finalized (to know if people regularly take more than 7 hours to complete issuance) and I'll know how old authorizations are when they're attached to a new order (to know how often people rely on authorization reuse outside that 7-hour window). These aren't perfect numbers, but they should give us a broad picture.
Not long ago, our secondaries were provided by routing.net, and I regularly experienced propagation times of several hours, and more than once several days, which already caused me headaches because in the beginning, I serially handled certificate renewals – so yes, there are problematic installations out there besides big or free DNS providers.
Concerning the problem of multiple SANs: Some ACME clients (I'm using dehydrated) handle SANs sequentially IIRC, i.e. get a challenge, propagate TXT RRs, validate this challenge, then get the next. This could be improved if they were changed to get all the challenges, update all the TXTs and then validate all of them at once – but I don't know how many ACME clients would have to be changed and how much work this would be …
I've never encountered that, but I have encountered many people struggling with weird caching issues between API/ControlPanel updates and DNS server responses that often take more than 7hours because of common workarounds with DNS-01 for multi-domain certificates.
Example:
- User uses Certbot to request a Certificate for 10 Domains
- Certbot processes a DNS challenge for Domain1
- User has their update-hook script
sleep
for 60m+1second between updating the DNS and returning to Certbot. Why? After updating their vendor's internal system DNS record, they may have issues with old records cached in DNS servers or application caches. Most users wait 301 or 601 seconds, but many wait 3601.
As an example, many years ago Namecheap appeared to have this behavior:
- Their DNS servers respected the TTL, but used a read-through application layer cache to load records. That application cache did not respect the TTL and seemed to have a 5minute expiry.
- Updating DNS records via their Control Panel or API would update their database backed internal systems, but would not clear the Application Layer Cache (it was only read-through, and lacked a write-through hook on the admin side).
Old records would often get cached into the application-layer cache - either by failed attempts, clients/users checking to see the record was set successfully, or perhaps some internal cache loading logic the vendor implemented. The only way around this was to sleep
for 501 seconds.
I don't know if people still deal with this regularly or if this was an artifact of people wanting to stuff the SAN with as many domains as possible in the early days of LetsEncrypt. There are probably a few dozen posts on this forum about issues like this.
I'm 100% for the shortened time, but I think 24 hours would be preferable for this use-case.
I think a shortened time would actually solve another common problem – when people get confused if a cached validation is used for an order, while a new challenge fails. The logging/printing of ACME clients tend to conceal what is going on, so people start digging themselves into the wrong troubleshooting hole.
But that fails to meet the desired requirement of
Seems like if is exceeding the TTL it would be time to find better DNS name severs and support providers.
Note the use of the past tense in my words
True, but maybe there is a way to drop or significantly reduce those services in a shortened time.
I don’t know how many users rely on the pattern I mentioned, but it was once very common for DNS-01 users.
Yeah, to be clear, suggestions along the lines of "what about 24 hours" are interesting to read but unlikely to be considered in this case -- if we don't shorten validation document reuse to the same time as CAA to allow ourselves to simplify services, it is unlikely that we will shorted validation document reuse at all, as doing so has the potential to be disruptive with little concrete benefit.
Is it possible to track how many authorisations are currently reused between 7 hours and 30 days?
Yes, we will collect that data soon: Add histogram to track authz reuse ages by aarongable · Pull Request #6554 · letsencrypt/boulder · GitHub
It would be interesting if that could track usage by challenge type, but that would require a considerable amount of new code.
If Let's Encrypt ever experience challenge validation delays above 90-100 seconds, shortening the authorization lifetime could have an impact.
While looking at this issue I realized that 3 popular clients will all bail out of an order if challenges sit in pending
/processing
for more than 100 seconds.
With a long authorization lifetime (over 12-24 hours so that the authzs are still valid at the next scheduled client run), the certificate issuance will probably succeed due to already-valid authorizations sitting in the account.
If the lifetime is reduced to 7 hours, any clients relying on that behavior won't be able to complete their orders.