Soliciting feedback on shortening authorization lifetimes to 7 hours

rmbolger · December 17, 2022, 6:09am

Not sure if it's possible, but it seems like it would be useful information to know ballpark numbers around how many times more than 1 cert is generated in the 30 day cached window. I know in the past, one of the strategies recommended for people with small'ish server farms was to explicitly take advantage of cached validations to allow multiple servers to more easily get their own certs instead of synchronizing a shared one and stagger renewal times over days as a safety net against renewal issues or revocation. I'd be curious how many folks are doing that in practice (I'm not). But I suppose it might get conflated with the folks just getting started and issuing multiple certs by accident that they lose and never use. Though in those cases, they probably run up against rate limits within the first 7 hours. In any case, the reduced load from not having the CAA re-check infra might be somewhat offset by the additional validation load.

As for the actual reduced cache window. I think the number of DNS providers that require serially processing validations because they don't support adding multiple TXT records for the same FQDN is vanishingly small. Of the ~70 DNS providers that Posh-ACME supports, only 1 actually requires this but only because it's an artificially imposed limitation on a free dynamic DNS provider (as opposed to some technical limitation). I'd guess the number that have propagation times on the order of hours is also vanishingly small. Remember, we're only talking about propagation to the authoritative nameservers of the zone, not all of the recursive resolver caches on the Internet.

Osiris · December 17, 2022, 10:00am

With anycast, even propagation amongst (an unknown amount of) authorative nameservers can be a thing.

webprofusion · December 17, 2022, 1:54pm

It can, but the the majority of commercial providers I've seen replicate in under 5 minutes (e.g. Cloudflare appears to happily replicate in under 10 seconds. AWS and Google Cloud DNS are under 60 seconds). It would perhaps be easier to document the DNS providers that are reliably slow so people can consider avoiding them.

Osiris · December 17, 2022, 1:55pm

I agree it's not common, but I just wanted to raise awareness

MikeMcQ · December 17, 2022, 2:26pm

Yandex is reliably slow needing 1-2H wait for DNS challenge. I've seen several examples of this in recent months (one here).

Nummer378 · December 17, 2022, 2:45pm

While this is true, I felt like this proposal wasn't so much about reducing load (as shortening cache times seems counterintuitive to this), but rather about removing complexity. That means less room for certificate misissuance (CAA rechecking was broken in the past), less code to maintain, less things that can break. This may be worth much more to LE than minor changes in validation load.

(Of course LE also cannot afford to ignore performance, both sides have to be considered)

aarongable · December 17, 2022, 3:53pm

Yes, but there's also nothing preventing someone from setting their TTL to 1 minute. Therefore we ignore the TTL and just use the BRs-man dated cache time of at most 8 hours. We can't rely on the goodness of subscriber's hearts to set TTLs that make life easy for us

There's a difference between not catching validations (i.e. not allowing a future new order to reuse an already-validated authorization from a previous order) and what I'm proposing here. Here I'm saying that even within the context of a single order, finalization will have to occur within 7 hours of validation. I'm not sure if ZeroSSL has any limitations like that. (They may! I'm truly not sure.)

Yeah, I'm working on getting related numbers now. Sometime in early January I should know how old orders are when they're finalized (to know if people regularly take more than 7 hours to complete issuance) and I'll know how old authorizations are when they're attached to a new order (to know how often people rely on authorization reuse outside that 7-hour window). These aren't perfect numbers, but they should give us a broad picture.

RealAndy · December 28, 2022, 3:23pm

Not long ago, our secondaries were provided by routing.net, and I regularly experienced propagation times of several hours, and more than once several days, which already caused me headaches because in the beginning, I serially handled certificate renewals – so yes, there are problematic installations out there besides big or free DNS providers.

Concerning the problem of multiple SANs: Some ACME clients (I'm using dehydrated) handle SANs sequentially IIRC, i.e. get a challenge, propagate TXT RRs, validate this challenge, then get the next. This could be improved if they were changed to get all the challenges, update all the TXTs and then validate all of them at once – but I don't know how many ACME clients would have to be changed and how much work this would be …

jvanasco · December 28, 2022, 4:54pm

I've never encountered that, but I have encountered many people struggling with weird caching issues between API/ControlPanel updates and DNS server responses that often take more than 7hours because of common workarounds with DNS-01 for multi-domain certificates.

Example:

User uses Certbot to request a Certificate for 10 Domains
Certbot processes a DNS challenge for Domain1
User has their update-hook script sleep for 60m+1second between updating the DNS and returning to Certbot. Why? After updating their vendor's internal system DNS record, they may have issues with old records cached in DNS servers or application caches. Most users wait 301 or 601 seconds, but many wait 3601.

As an example, many years ago Namecheap appeared to have this behavior:

Their DNS servers respected the TTL, but used a read-through application layer cache to load records. That application cache did not respect the TTL and seemed to have a 5minute expiry.
Updating DNS records via their Control Panel or API would update their database backed internal systems, but would not clear the Application Layer Cache (it was only read-through, and lacked a write-through hook on the admin side).

Old records would often get cached into the application-layer cache - either by failed attempts, clients/users checking to see the record was set successfully, or perhaps some internal cache loading logic the vendor implemented. The only way around this was to sleep for 501 seconds.

I don't know if people still deal with this regularly or if this was an artifact of people wanting to stuff the SAN with as many domains as possible in the early days of LetsEncrypt. There are probably a few dozen posts on this forum about issues like this.

I'm 100% for the shortened time, but I think 24 hours would be preferable for this use-case.

I think a shortened time would actually solve another common problem – when people get confused if a cached validation is used for an order, while a new challenge fails. The logging/printing of ACME clients tend to conceal what is going on, so people start digging themselves into the wrong troubleshooting hole.

barf7709 · December 28, 2022, 5:14pm

But that fails to meet the desired requirement of

barf7709 · December 28, 2022, 5:20pm

Seems like if is exceeding the TTL it would be time to find better DNS name severs and support providers.

RealAndy · December 28, 2022, 6:32pm

Note the use of the past tense in my words

barf7709 · December 28, 2022, 6:58pm

Sorry @RealAndy, I missed that subtly.

jvanasco · December 29, 2022, 4:37pm

True, but maybe there is a way to drop or significantly reduce those services in a shortened time.

I don’t know how many users rely on the pattern I mentioned, but it was once very common for DNS-01 users.

aarongable · December 31, 2022, 5:27pm

Yeah, to be clear, suggestions along the lines of "what about 24 hours" are interesting to read but unlikely to be considered in this case -- if we don't shorten validation document reuse to the same time as CAA to allow ourselves to simplify services, it is unlikely that we will shorted validation document reuse at all, as doing so has the potential to be disruptive with little concrete benefit.

maxh · January 1, 2023, 12:55am

Is it possible to track how many authorisations are currently reused between 7 hours and 30 days?

mcpherrinm · January 1, 2023, 3:12am

Yes, we will collect that data soon: Add histogram to track authz reuse ages by aarongable · Pull Request #6554 · letsencrypt/boulder · GitHub

jvanasco · January 1, 2023, 5:32pm

It would be interesting if that could track usage by challenge type, but that would require a considerable amount of new code.

_az · January 1, 2023, 9:38pm

If Let's Encrypt ever experience challenge validation delays above 90-100 seconds, shortening the authorization lifetime could have an impact.

While looking at this issue I realized that 3 popular clients will all bail out of an order if challenges sit in pending/processing for more than 100 seconds.

With a long authorization lifetime (over 12-24 hours so that the authzs are still valid at the next scheduled client run), the certificate issuance will probably succeed due to already-valid authorizations sitting in the account.

If the lifetime is reduced to 7 hours, any clients relying on that behavior won't be able to complete their orders.

aarongable · January 3, 2023, 9:24pm

Ooh, that's a really interesting point, @_az.

So far, the two main concerns I see are:

1: What if my DNS provider has really slow propagation, and also only lets me set one TXT record at a time. Then how can I issue a cert with multiple names?

For which my answer is: yes, this is unfortunate, and probably a real issue for a tiny fraction of people, but there are other solutions here, such as using a CNAME to delegate your ACME DNS entries to a DNS provider with better behavior.

2: What if my validations are really slow, so my issuance relies on timing out while trying to issue one day, and then re-using those eventually-successful validations the next day?

For which my answer is twofold: first, this should show up in the new "age of authz when attached to a new order" histogram that will be deployed next week; and second, it seems unlikely to me that we'll have 100-second delays validating the same name over and over again. I'd expect that delays (but not failures!) that take that long would be more likely to be transient.

Finally, there's another concern, which is simply: what if a large subscriber relies on validation reuse to be able to quickly reissue certificates? (For example, one could imagine that a hosting provider just constantly maintains authorizations for all names they host, and dynamically issue 100-name certs each time a customer joins or leaves their platform.)

For which my answer is: well, we're going to start gathering data, and then talk to these subscribers, and see what can be done

Topic		Replies	Views
Rechecking CAA at issuance time for some authorizations API Announcements	1	2942	August 31, 2017
2020.02.29 CAA Rechecking Bug Incidents	3	138351	March 7, 2020
Expiry of valid authorizations reduced from 60 days to 30 days Issuance Policy	1	3135	May 27, 2017
Automatic recycling of pending authorizations API Announcements	0	4215	August 31, 2017
Pending Order expiration Issuance Tech	4	2002	June 29, 2019

Soliciting feedback on shortening authorization lifetimes to 7 hours

Related topics