Possible new feature: paused ACME accounts

That's not always possible on (semi-)embedded systems :frowning_face:

3 Likes

It's weird to think of this, but since the CA usually knows which client is being used from the User-Agent, it would be possible to have @jsha's idea only apply to accounts that have been verified to most recently use a client that is proactively confirmed to display the ACME error to the user (!). I bet just confirming this for the top 10 clients or something would catch a huge portion of the volume in question.

It is unfair in a certain regard, but also constructive—always aimed at getting users to improve their configuration in a way that it appears they'll be able to hear about.

5 Likes

This, of course, assumes that the client has actually implemented said header.

Note to self: implement said header.

3 Likes

RFC 8555, section 6.1 says "ACME clients MUST send a User-Agent header field".

Of course the spec saying so that doesn't mean that they do so correctly.

Seems really weird to have different behavior based on the user-agent, since there's still a lot of issues out there where web sites treat browsers differently and those of us using less-popular web browsers get weird messages even though things would just work if they tried serving the same pages as everything else. I might suggest adding some kind of header or something where a client could say "I have a way to give error messages to an end-user" and base it on that. (And for old clients that don't send the header, and are causing problems, just disable the account or the like.)

I have some crazier thoughts that I'm working on typing up too. :slight_smile:

5 Likes

Offtopic: that could very well be the header set by the underlying lower level HTTPS client used. The specs don't say anything about the contents of the header, so the default cURL header would suffice, being quite useless.

4 Likes

Just to think outside the box a bit, I think the underlying problem here is that the current rate limits (and perhaps the whole concept of rate limits) seems to cause both (1) a lot of support "costs" of people not understanding what they're doing wrong, and yet (2) doesn't seem to actually be doing enough to stop "bad" use of Let's Encrypt's systems (as 80% of HTTP-01 challenges fail). I mean, it's not like I'd claim every challenge failure is "abuse", but having that big of a failure rate implies that there's not enough safeguards to ensure that challenges are likely to work before actually being attempted.

My first thought is that some kind of Hashcash-type system, where a client needs to do some computation difficult to do but easy to verify that it passes along as a header (or as part of the challenge token or something?). Requests that include such a request could have less strict rate limits. (So, in order to make it backward-compatible, make rate limits slightly stricter for requests without such a header, but more generous for requests with a header, and maybe even some stuff like the 5-duplicate-cert limit could be overridden entirely (sometimes) if one submitted the request with a more-difficult header that indicated it took a few minutes of computation or the like. I think something like that should be enough to keep it from being used in ephemeral/disposable environments, though I'm probably overestimating people's understanding of the systems they're setting up.)

But beyond that, is there some way to request clients to "check" that a challenge is likely to work, and submit some kind of "proof" of it somehow? It's kind of tricky since of course it needs to be tried from outside their own network, but maybe some kind of signature over the DNS record at least? Like, to do an HTTP-01 challenge, the client needs to retrieve the A/AAAA records, thereby confirm that the name exists and isn't in private IP space, and sign the values they get and send it to the server. Then if they match what Let's Encrypt's DNS lookup says, different rate limits could apply? Or to really check from the outside, maybe some kind of partnership with Google/Akamai/Cloudflare/whomever, where they could do a check and say that a challenge looks good and cryptographically sign it (and the signature is sent by the client in the challenge request), so then Let's Encrypt knows that it's much more likely to work and can apply looser rate limits? And then requests without these kinds of checks could be much more limited in some way?

I'm just trying to brainstorm, I'm hoping that a better idea comes out of this, but I kind of feel like the original request here (while I don't really object to it in principle) is just "trying to make rate limits better" and so maybe some brainstorming of other ways to prevent Let's Encrypt's resources from being used up unnecessarily might be helpful? If there's some way to offload more work to the client somehow, to ensure that only "reasonable" requests are likely to get checked, it seems like it's worth exploring, though obviously tricky to try to apply in a backward-compatible way,

5 Likes

Probably a good idea to follow the style of the ACME v1 deprecation and do progressive rolling blackouts, see if you can catch any good actors to fix their clients/installations, and any impact on support channels.

2 Likes

Hm, good point. This is how the failed validations limit is implemented. A good argument in favor of implementing a two-level failed validations limit (short term and long term) before implementing a pause feature.

Good catch, yes, the account should be at least 180 days old.

Actually this is another possible use case for account pausing that I didn't mention in the original message. We're sending out biweekly emails to ACMEv1 users to get them to move off. The numbers are slowly (very slowly) dwindling, but there are some folks who didn't set an email address and we can't reach. Those people will have a bad time when we finally turn off ACMEv1 entirely. We could give them a slightly softer landing if we used the turnoff date to "pause" all ACMEv1 accounts for 90 days, so folks whose certificate expired could go unpause, issue another certificate, and then get to work on upgrading.

5 Likes

Two birds. I like it. Always great to get parallel usage out of your dev budget. :slightly_smiling_face:

4 Likes

Actually, that ties in to some other crazy thoughts I've had (and this probably isn't the way to go, but maybe sparks more brainstorming from others): One of the fundamental problems here is people starting clients that set up their own automated renewal, but then not maintaining or following up on them (meaning that they continually fail validation but nobody seems to care). Once ACME v1 is gone, you'll no longer need to spend resources validating challenges for systems pointed to the v1 API but have since been abandoned. Can we just… do it again? Like, plan a migration to a "v3" URL over some number of years (even though this endpoint would actually speak the same ACME protocol), any systems that have been staying up-to-date will get updated to it over that time, and when you turn off the v2 endpoint all those abandoned clients will be much easier to handle. And then, maybe have this be an expectation in clients that the endpoint would change every few years? (And for an extra-crazy idea, have it be tied to the intermediates? So each new intermediate-generation every few years would have a new endpoint?)

Just to be clear, I don't think the above idea is actually good, but it might be a starting point for some other better ideas from others.


Another thought is that this whole problem is really trying to work around the client-side not noticing continual failures. While it wouldn't help with the existing zombies, might it make sense to include in the Integration Guide (and otherwise encourage in the most popular clients) that clients should, if they receive failures renewing for some extended period of time (like, a good month past certificate expiration maybe), disable their scheduled task for that name?

5 Likes

Hi,

I like the idea of a paused account or domain with manual re-activation in general. My biggest concern with using Let's Encrypt is the risk that a bug in my setup might cause a one week outage because of a rate limit. I would not mind much tighter rate limits, if I could trust that there would be a button available for me to say "I fixed my bug, now please try again".

5 Likes

I ran some numbers to estimate the size of impact. So far just looking at validation logs. From 03-21 to 03-28, we had:

  • 216M validation attempts
  • 179M validation failures (83%)
  • 97M of those failures came from accounts that had 0 validation successes of the course of that week. Removing those failures would bring the failure rate down to 38%. (edited)

Here are some numbers bucketed by how many validation attempts a given account had during the week. A "total failure" is an account that had 0 validation successes; these are the candidates for pausing (if they also had no issuances for X days). I summed up the error counts from them.

bucket accounts validation attempts errors errors from total failures error rate
1 786,802 786,802 185,146 185,146 0.23531
2-5 1,129,112 3,126,791 636,033 473,726 0.20341
6-25 547,993 6,865,347 4,637,927 3,622,346 0.67556
26-625 521,949 74,910,336 69,896,477 49,091,267 0.93307
626-3125 45,053 55,769,849 52,059,549 25,932,861 0.93347
3126-15625 6,053 35,594,192 32,814,899 13,122,730 0.92192
15625+ 696 39,353,654 19,274,601 4,582,055 0.48978

Interesting that the error rate starts out very low, in the buckets with few attempts. In buckets with larger number of attempts, we see the error rates get much higher. Presumably this is a matter of clients that retry faster than they should.

Since these are actual validation attempts, they don't include requests that were stopped by the rate limits. So for instance, a client that was retrying failed validation as fast as allowed by the rate limits (5 failed validations / hostname / hour) would have 840 attempts for a single hostname.

6 Likes

I would suggest not to pause the whole account. Just pause the failed domain for the account.

There may be several certs issued by the account, for some reason, if some of the domains are failed when renewal. The others domain may be still valid.

So, please just disable the failed domains for the account. Don't diable the whole account.

When the account is requesting a challenge for the disabled domains, just return an error to it.

4 Likes

I thought the idea was to look 180 days back to see if the account had any succesfull activity. That's twice the lifetime of a LE cert. I'm not sure why not to disable the entire account then?

3 Likes

Think about a case:
what if I register an account first, and try to issue a cert 181 days later?

4 Likes

Will that account have failed validations in the mean time?

3 Likes

No, nothing. Just sleep 181 days.

4 Likes

In that case LE could easily check if there is a reason to block the account: no invalid validations, no block necessary for example.

3 Likes

I feel like this all comes down to the determination of the definitions of a "zombie" (unmaintained) account, which as a deductive process that will likely result in tradeoffs. Analyzing the ROC curve will hopefully minimize the false positives while maintaining efficacy.

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.