Possible new feature: paused ACME accounts

jsha · March 26, 2021, 5:50pm

Hi all,

I wanted to get community feedback on a new Boulder feature we're considering: paused ACME accounts that return an error until the user takes some action to un-pause them.

Since ACME clients are so automated, over time "zombie clients" accumulate. These keep trying to renew for domains that no longer exist, and they fail every time. We could block such clients outright, but some fraction of those people will eventually fix their domains. We'd prefer to not have to manually unblock each one.

To pause an account we would have some automated system that sets a bit in our account storage. It would probably trigger on a criterion close to this one: More than 5 failed validations in the last week, and zero successful issuance for 180 days.

When that bit is set, all ACME requests would return an error of type userActionRequired, with the text "Your ACME account has been paused and cannot request certificates. To un-pause it so you can request certificates, click here: ". The would be to a subdomain of letsencrypt.org, and would contain an HMAC binding that account ID to the URL and a timestamp.

To avoid systems automatically un-pausing (intentionally or no), the target page would require clicking a button; just loading the page would not be sufficient to un-pause the account.

Since this has an impact on how people receive help, I want to make sure to run it by all of you. What do you think? See any improvements? Obstacles?

Thanks,
Jacob

griffin · March 26, 2021, 6:05pm

How will this "freeze" scale with large integrations? I'm concerned that a fixed number might not scale well.

If I've successfully issued 200 certificates in the last 180 days and am failing for 3000 certificates, then what?

jsha · March 26, 2021, 6:17pm

This would not match because of the "zero successful issuance for 180 days" criterion.

griffin · March 26, 2021, 6:25pm

That's what I was considering. Is there a way (or desire) to stop "partial" zombies (which I strongly believe outnumber "full" zombies by many, many fold)?

jsha · March 26, 2021, 6:33pm

I'm not terribly worried about these for now. We do occasionally see big integrators that don't keep track of when a domain stops pointing at them, but we usually reach out via email and ask them to implement some sort of automated offboarding.

petercooperjr · March 26, 2021, 6:57pm

I don't have a sense of the scale of the problem you're trying to solve here. (Which may be in part that it's hard to wrap one's head around the huge scale of Let's Encrypt in general.) Do you have any rough estimates of how many accounts you're talking about, how much load this would save off of Let's Encrypt's servers, and what portion of the time one of these "zombie clients" ends up starting to have valid orders again?

It sounds like you're kind of talking about another rate limit similar to the existing "failed validations" limit, where enough failures mean that additional requests won't even get tested. But I'm not understanding why you'd be okay in general with 5 failed validation per hour, but now want to restrict on 5 failed validations in a week. How many failed validations per week are typical of these zombies?

One major issue I see with the proposed trigger (if it's literally as you wrote it of "More than 5 failed validations in the last week, and zero successful issuance for 180 days") is that it'd hit anybody having issues trying to set up their first domain, whereas now they only have to wait an hour in the future they'd need to hit this URL, assuming that their client shows it to them.

Do we have any understanding of how many clients (especially those integrated-with-hosting-solutions type where when they ask questions here they're confused by the "What client are you using" question) surface these kinds of errors to the end user (assuming that there's an end user to surface them to at all)?

You mention that for some problems you reach out via email, I'd think that adding an automated email to the contact on file would make sense at the same time as (or instead of) pausing the ACME account, to let them know that they've got some system somewhere (and tell them the IP, user agent, or maybe other headers), would work even better to alleviate problematic traffic.

jsha · March 26, 2021, 7:15pm

Good point, I didn't elaborate on the problem statement! The main issue I'd like to solve is this: about 80% of HTTP-01 validations currently fail: currently approximate 110 rps of errors. That means we are spending a lot of resources (storage, bandwidth, CPU) on unneeded work. But also it means it's hard to evaluate when slowness in various validation systems (particularly DNS) are due to a problem on our end vs a particularly large influx of traffic from someone with failing validations.

I haven't done this calculation yet, but this is a good idea.

This is a good point, that having multiple levels of rate limit would solve this in another way. For instance, one can imagine adding to the "5 failed validations per hour" a "25 failed validations per week" and "50 failed validations per month".

A couple of reasons to prefer setting a bit on the account:

In a lot of cases, the client will really never succeed again and is completely forgotten. Spending resources on even 50 failed validations per month adds up, and doesn't benefit anyone.
If someone does notice that their account has been paused, they can unpause it right away rather than wait for the rate limit to expire.
Right now, calculating rate limits is somewhat expensive for us, but we do hope to improve on that.

Also keep in mind this is not quite the same as the failed validations limit: It's failed validations, combined with a long period of no successful issuance. It might also make sense to express the threshold over a long period, for instance 100 failed validations over the course of 90 days and no successful issuance in 180 days.

There are a couple of common cases. There are some that attempt renewal once every day. These are not a problem on their own, but when there are many of them, particularly if they are all using a stock VM image with a cron job set to a particular time of day, they can be noticeable.

Then there are some with buggy software that goes off the rails and hits us many times per second. Right now we block these when it gets bad enough, and we usually try to notify the maintainer of the software. But this is a very manual process. And some buggy clients make it hard to set an email (or discourage it by not showing how in the examples), so for some of the offenders we have no way to get in touch.

I don't. Though in those cases the hosting provider in theory is in charge of handling errors. There's definitely some nuance here in whether the hosting provider creates an account per-user or on account for all their users.

This is a good idea! We should do this too.

jvanasco · March 26, 2021, 7:24pm

I think 120 days would be long enough. That's 60 days past the last successful certificate's expiry. Even 115 would be good. I would err on launching with the shorter timeframe, and making it longer if needed. IMHO, that will make less noise than shortening it in the future.

In terms of notifications:

Many clients will not surface this to the end user at all. I don't know of many built to handle userActionRequired OR ANY OTHER Error. I think an email is necessary.

I also think the error payload should contain a combination of human and machine readable text, since clients that DO not correctly support errors but do surface this sort of information to users will often bury it in json payloads and logging.

For example:

bad:

{"error": "Your ACME account has been paused and cannot request certificates. To un-pause it so you can request certificates, click here: "}

better:

{"error": "acme-pause-error-code", "error:human": "

IMPORTANT !!
IMPORTANT !!
IMPORTANT !!

Your ACME account has been paused and cannot request certificates. To un-pause it so you can request certificates, click here:

IMPORTANT !!
IMPORTANT !!
IMPORTANT !!

petercooperjr · March 26, 2021, 7:46pm

Honestly, to take this a bit off-topic, I wouldn't mind being able to opt-in to get more frequent emails from Let's Encrypt (certificate renewed, validation failed, rate limit close to getting reached, account key was changed, account email was changed, account was deactivated, and probably others). Obviously being able to specify what alerts one wanted would need to happen somehow, and probably isn't readily shoehorned into ACME. But this general problem of it's hard for people to know what problems their accounts are causing, and other stuff like the renewal emails still being more simplified than is probably ideal, may be better resolved by working toward a better alerting system in general. And while I love email, other people might prefer other contact methods. (Though I understand that all this is much easier for me to suggest than for you to implement.)

And just another random thought: If the issue is domain names no longer pointing to a server that's running a client, might it make more sense for this "pausing" to be name-specific rather than account-specific? I think that if @griffin's thought that "partial zombies" are more prevalent is true, that you'd get more bang for your buck by doing it that way. Maybe the unlock button unpauses all names for an account, to handle the case where one account has a lot of names.

griffin · March 26, 2021, 8:52pm

Just had another thought. It's probably obvious, but I do feel like it's a rather important elephant in the room...

If I'm a new subscriber (or just have a new ACME account) and thus have never had any certificates issued (and therefore none in the last 180 days), is my account going to get paused when I screw up my initial attempts at acquiring certificates (even though I should be using the staging environment)?

Perhaps require the account to be at least 180 days old?

Speaking of the staging environment, should there be a message somewhere in the mix after the pause pointing a hard, bony index finger at the staging environment documentation for those managing the brokenness who want to correct the situation?

Osiris · March 26, 2021, 9:10pm

I agree with my peers above that this new feature shouldn't cause "false positives" for people trying to get a certificate for the first time. Although, they should have used the staging server for experimenting.. Also, the e-mail is a good idea too.

That said, I think it's a great idea. If something like this isn't implemented, it'll only get worse! The world wide web is big, so lots of room of dysfunctional ACME clients.. And 80 % of http-01 auth attempts failing is... mindblowing.. I support this feature 100 %!

schoen · March 26, 2021, 9:26pm

I think this is all a very good idea, but I'd like to suggest having the target page directly include at least some amount of translations into numerous languages, because some people will have clicked it without substantively understanding the English error that led them there, and some who did understand that message may still not understand a more detailed explanation that might be present (or linked) on the "re-activate account" page.

I would include this under a broader category of "errors and failures that prompt people to get important information about their use of Let's Encrypt services", including the intentional ACMEv1 outages, and @griffin's idea which I've recent restated that there should be additional stricter issuance rate limits with a much shorter timeout, so that people who are repeating the same failed issuance strategy learn about the problem sooner, before the consequences are as significant.

I think it would be worth brainstorming other possible things in this category.

griffin · March 26, 2021, 9:40pm

@jsha

What @schoen is referencing here is connected to the Add an Hourly Duplicate Certificate Rate Limit #5210 issue in Boulder on GitHub that you and I were just discussing.

The original thread in the community is:

_az · March 26, 2021, 9:41pm

I don't know how prevalent this is in the ecosystem, but some clients just drop ACME error message on the ground and show a generic message in its place. My concerns there are:

It becomes another opaque confounding factor when doing troubleshooting/forum support.
The lack of escape hatch for the user means that they will have to go digging on how to delete the ACME account from their client, if they figure out what is happening.

The situation is certainly the client's fault, but exacerbating it should be avoided, if possible.

Big agree with @schoen about messaging and translation. I think the most recent thing that comes to mind is how poorly understood and received the "certbot-auto is no longer supported by your system." message was. Having text that is accessible and really hard to misinterpret would be great (this isn't a commentary on the current text).

Osiris · March 26, 2021, 9:52pm

I'm pretty sure the impact for the Community is minimal. Just think about it: how many users will notice the change from an ACME client continuously failing already with authentication errors to an ACME client failing b/c a paused account? Note that the account doesn't have any certificate issued in the past $to-be-determined days. We'll probably learn the bad clients such as your example client and will probably give the user the advice to reset the whole client

griffin · March 26, 2021, 9:54pm

or replace the whole client...

Osiris · March 26, 2021, 10:01pm

That's not always possible on (semi-)embedded systems

schoen · March 26, 2021, 10:09pm

It's weird to think of this, but since the CA usually knows which client is being used from the User-Agent, it would be possible to have @jsha's idea only apply to accounts that have been verified to most recently use a client that is proactively confirmed to display the ACME error to the user (!). I bet just confirming this for the top 10 clients or something would catch a huge portion of the volume in question.

It is unfair in a certain regard, but also constructive—always aimed at getting users to improve their configuration in a way that it appears they'll be able to hear about.

griffin · March 26, 2021, 10:12pm

This, of course, assumes that the client has actually implemented said header.

Note to self: implement said header.

petercooperjr · March 26, 2021, 10:16pm

RFC 8555, section 6.1 says "ACME clients MUST send a User-Agent header field".

Of course the spec saying so that doesn't mean that they do so correctly.

Seems really weird to have different behavior based on the user-agent, since there's still a lot of issues out there where web sites treat browsers differently and those of us using less-popular web browsers get weird messages even though things would just work if they tried serving the same pages as everything else. I might suggest adding some kind of header or something where a client could say "I have a way to give error messages to an end-user" and base it on that. (And for old clients that don't send the header, and are causing problems, just disable the account or the like.)

I have some crazier thoughts that I'm working on typing up too.

Topic		Replies	Views
Failed Challenges Rate Limit/Prevention - Hosting Provider Issuance Tech	29	4787	June 6, 2017
Lets encrypt refusing to connect to my domain on renew , and now it is saying limit reached Help	31	5292	July 21, 2017
Clear Pending Authorizations Client dev	29	14857	December 29, 2016
Public beta rate limits Issuance Tech	131	63333	December 22, 2016
New Authz rate limit in Staging Help	13	2856	June 8, 2018

Possible new feature: paused ACME accounts

Related topics