Soliciting feedback on shortening authorization lifetimes to 7 hours

Essentially working around the lack of Pre-Authorization support in Boulder?

8 Likes

So at 0301 PST My client (acme.sh) goes out to update a cert for one of my clients. I got a message from the system that the cert was now overdue for renewing. So I looked at the logs and basically it shows a "Incorrect txt record" Using DNS-01 validation. I saved the log file.

I was talking with my client and stated that tonight the cert would be successfully obtained... After his/Her challenge to that comment, I ran the update manually in his/her presence. It succeeded. And I again saved that log file... So I ran a "diff" on the two files and the only difference was the fact that the second attempt succeeded where the first one failed.

SO if it matters, my txt record is set for

_acme-challenge.yachats.photos 	TXT 	@ 	1m

But I Seriously doubt that the record propagates in one minute.
There is no sleep time on the update script... (maybe there should be) Or maybe I need to get a grip on the TTL for the txt record.

I have experimented with different TTL's to no avail. But a 2 minute sleep time sometimes seems to help.
No matter what I do, the second validation always succeeds. So until this thread was posted I just took it with a grain of salt so to speak. Just not always on the first attempt.

So I will share the logs if anyone is interested. Or not.
Is there such a thing as "Best Practice" for TTL on_acme-challenge.yournamehere.com?
Is there such a thing as "Best Practice" for sleep time to allow for propagation of the record?

I think this is pertinent to this discussion.
Please advise.

5 Likes

as LE's VA ignore TTL higher than a minute and ask directly to authoritative server, I think its safe to update as soon as change applied to main DNS server I think

6 Likes

Yes, This is a problem I could think of too.

If the dns provider only supports one txt record, the client(acme.sh) would have to run twice (with the same commandline parameters) to request a wildard cert(including the root domain as well).
The first time, one of the txt record is validated and then cached, the second time, the other txt record is validated.

It's not a big problem for the user to first time issue the cert manually.
He just needs to run the commands twice.

However, it will be a problem when the cert is about to renew automatically by the cronjob.
On the 60th day, it tries to renew the cert, of cause it will fail because it can validate one txt record this time.
So, the next day(the 61th day), it will try again. The first txt record was validated, and cached, so it will succeed to the validate the second txt record.

Usually the rewnewal cronjob is set to once-a-day.

So, I would suggest the lifetime is no shorter than 48 hours.

That's my cent.

Thanks
-Neil

7 Likes

Thanks @Neilpang This is exactly what I am experiencing with acme.sh.
I'll check the certbot logs from another client's site to compare the results.

5 Likes

I guess it's because of the dns propagation time. the client(acme.sh) uses the cloudflare(1.1.1.1) to check if the txt record has already propagated. If True, it will request to validate the domain.
However, this is not always working as expected. It only means the txt record is propagated to cloudflare. Sometimes, the letsencrypt validation will fail.

If your dns provider has a max propagation time limit, let's say 10 mins(600 seconds). you can use the --dns-sleep 600 parameters to force acme.sh to wait 600 seconds before requesting validation.

like:

acme.sh  --issue  -d xxxxxxx  --dns xxxxx   --dns-sleep 600
6 Likes

Not really. As @orangepizza wrote, LE queries the authoritative DNS server directly.

The common practice for TTL in general over the past few years has become 60 seconds, unless you are a very large company with a dedicated dev/ops or SRE team to deal with the mess longer TTLs have.

Yes, but it depends on the DNS setup.

For most users, this isn't really needed.

Things can get dicey in a few setups. Two that come to mind are:

  1. The DNS provider utilizes an application level cache. I described this in a comment above. You need to sleep to get around the proprietary cache, which is independent from the TTL. TTL really just tells their DNS servers to reload on expiry, but they may be reloading from a cache that has an expiry longer than the TTL. (Yes, that sounds like a bug to me too... but vendors do that!)

  2. Multiple authoritative DNS servers. You need to sleep until every one of them is updated. This may happen on write, on read, or be affected by the TTL. Every setup/provider is different. There are supposed to be primary and secondary roles, but some people like to play dangerously and configure everything as a primary. LetsEncrypt will attempt multiple validations, and may use any of these nameservers for each attempt, so you need to ensure each one of them serves the correct record. There are a handful of posts on the forum covering this situation.

8 Likes

This is true. I notice that my DNS provider is slow in propagation. But @Neilpang response is accurate in my (and many acme.sh users) use case scenarios. So far, it is always the second update attempt that succeeds.
EDIT: I still need to evaluate the logs from certbot. (which I like very much)

5 Likes

That's way too close to the top of the hour for me.

5 Likes

Limiting authorizations to 7 hours also makes manual error recovery for renewals harder.
If a routine renewal fails in the night the support staff is quite likely to look at it 7 hours later, because routine renewals are likely no going to on-call staff (properly done they have a huge margin before becoming critical).
So if one of many authorizations has failed, just fixing that one authorization is no longer enough, but the renewal has to be restarted completely.
Furthermore, it might make analyzing the cause harder, because things are not actually the same as in the initial renewal processing.

A shorter lifetime is actually more likely to make analysis easier, and more likely to replicate the initial renewal processing. I mentioned this above in passing, as it is the reason why I support shorter authorization lifetimes:

Under the current system of 30day valid authorization lifetimes, if a renewal fails overnight, the next day support team will only be able to troubleshoot the failed authorizations in an order, as the successful domains will be cached, so new errors will not surface.

Historically, this has created at-least these two problems during troubleshooting:

  • Users think a particular "fix" was partially successful, while it had no effect, because they did not understand some domains were relying on cached authorizations. This often causes people to waste many hours, then ultimately come to this forum as very confused that a fix worked on one domain but not another – while LetsEncrypt only attempted an authorization against the "fix" on the failed domain.
  • Any changes staff makes to the platform/host while troubleshooting may actually break the previously valid domains or their configurations, but those errors will never be surfaced due to the use of previous authorizations.

If the authorizations were no longer valid in the morning due to 7 hours, on the first attempt, an authorization challenge would be required for every single domain in the order. This is more likely to surface errors and aid in analysis; at the very least there will be fresh logs against the active configuration.

Unfortunately, any successful authorizations at that time would then be cached for another 7 hours – so users can only enjoy the benefit of this behavior on the initial retry.

8 Likes

Problems should be assessed on the staging environment while disabling cached autz.

4 Likes

A lot of the feedback here is for workarounds due to inadequate challenge response infrastructure (slow DNS propagation, manual processes etc). I'd argue this is as a much a problem as users who are still on TLS1.1 [or have hard-coded intermediates] etc and that they need to maintain their own systems or just not participate. Pandering to people who "can't fix" their own stuff ultimately does them a disservice in the long run and while that can seem a bit mean it's important that people take responsibility for their own systems if they choose to operate them.

While Let's Encrypt caches validations, some other ACME providers do not - if you can't use those providers because they don't cache validations then there is something wrong with your system that you need to fix.

I'd advocate for gradually moving to short auth lifetimes over a fairly long period of time (e.g. several renewals), for instance certainly move to 48hrs first, measure the impact then possibly reduce further.

9 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.