Feedback needed for our new account "pausing" feature and self-service "unpause" portal

beautifulentropy · July 31, 2024, 8:44pm

Background

Today, a significant percentage of orders are generated by under 1M accounts that never successfully complete validation. The majority of those come from clients that never succeed, or at least have not succeeded in a long time (and likely never will again). Common failure scenarios include the domain name expiring, or the hostname now pointing to a different host. A significant portion of our resources (compute, database utilization, and network) are currently consumed by such "zombie clients" and "zombie hostnames”. While we can identify the accounts belonging to these clients, we cannot deactivate them; the clients would register new accounts and continue making the same requests. Instead, we should “pause” issuance for specific (account, domain) pairs and offer our Subscribers the option to "unpause" themselves once they've gotten the issue sorted out.

This may sound familiar to some of you. That's because back in April 2021 we proposed a very similar idea and thanks to your input we were able to refine that proposal into the new feature that we're sharing with all of you today.

Overview

The development team has implemented a mechanism to “pause” issuance for individual (account, identifier) pairs. Accompanying this feature is a new Self-Service Portal that allows subscribers to unpause and resume issuance for all paused identifiers associated with their account. The Self-Service Portal is accessible through a URL provided in the error message returned by the CA when a new order is placed for a paused identifier.

What We Need From You

We invite our valued community members to provide their insights and suggestions to improve the self-service unpause experience. We are looking for feedback on the following areas:

Line edits for content and accuracy: clarity, typos, grammar, punctuation, etc.
HTML/CSS styling suggestions: color, layout, etc.
Identification of additional error cases

What follows is a collection of scenarios and possible outcomes. For each of these I have provided a short description and a screenshot.

Scenario 1

This is the scenario that we expect most Subscribers will encounter.

Step 1. Subscriber receives their initial notification

While investigating why issuance has failed for one of their domains the Subscriber finds the following log line:

Step 2. Subscriber visits the link

Step 3. Subscriber clicks the "Please Unpause My Account" button

Scenario 2

Some Subscribers will encounter this error if they miss any of the characters while copying and pasting the link from their logs. When attempting to access the unpause form they'll get the following message:

Scenario 3

Some Subscribers will encounter this error if they visit the first link in their logs, unpause, then they see another link and attempt to unpause again.

Scenario 4

Our unpause links have a lifetime of 2 weeks. Some subscribers may find that they've been paused and click an unpause link that was is older than that lifetime.

Scenario 5

Some Subscribers are large integrators who may accidentally ship a software bug that breaks validation for an extended period for >50,000 domains. They'll access the Unpause form, click the "Please Unpause My Account" button, and see the following message:

Scenario 6

The following message will only be shown if the unpause operation results in an error. Instances where Subscribers see this message should be exceedingly rare.

jvanasco · July 31, 2024, 9:16pm

I strongly suggest swapping out the long kwarg based url for a significantly shorter one, like through a link shortener. That will eliminate most of the copy/paste errors (which are assured to be popular due to the line wrapping).

If scoped to the context of this pause/unpause system, there shouldn't be much of a security concern. The concerns would greatly go down if the form does not allow a "pause", and only unsets that setting.

Nummer378 · July 31, 2024, 9:57pm

Is this for production only, or also for staging? I'm asking because I have accounts in staging (for example Let's Debug) that are supposed to have a 100% failure rate and will never succeed. I imagine that the failure rate in staging in general is also even higher than production as well?

As for form feedback, it seems like the "Step 2" will always show only a subset of (one? a few?) identifiers that are going to be unpaused. While limiting the number of identifiers is good to avoid an excessively large list, it might cause confusion if subscribers don't find the identifier they're looking for in that list (e.g., in your example the failing domain is 15d2ef.com, while step 2 shows 15db88.com). How is the list of identifiers shown to the subscriber selected?

mcpherrinm · July 31, 2024, 10:33pm

Pausing is going to be manually done at first, but eventually automated, including in staging. We can make sure some dedicated test accounts are never paused in staging.

aarongable · July 31, 2024, 10:46pm

Understandable, but the long kwarg is a JSON Web Token used both to carry data about the account to be unpaused, and to ensure that the unpause request is genuine. It can't be meaningfully compressed.

For the sake of simplicity, it is simply the first fifteen identifiers returned by the database.

webprofusion · August 1, 2024, 3:21am

Great idea. I know my client has at least 10K zombies who are very frequently attempting renewal (software bug) and ignoring failure notifications and ignoring prompt (and emails) to upgrade. Often they setup something, it doesn't work and they forget about it forever while the app doggedly keeps trying to get that cert. [As aside, we plan to introduce paused renewals for items with high failures, but that requires users to update the app]

What would the specific trigger be, a certain number of failures or failing for a certain period?

mcpherrinm · August 1, 2024, 4:27am

We haven’t automated this yet, but we are thinking roughly: N failures and 0 successes in the last X days.

The initial batch is 50 or more failed validations per day for 90 days with 0 successes.

Bachsau · August 1, 2024, 9:30pm

I would still consider that abuse. If you know it's not going to work, you don't need to try it.

Why is that important? I also don't think you constantly have to expire and generate new URLs if the old one hasn't been used. More system load for no benefit.

Bruce5051 · August 1, 2024, 9:31pm

@Nummer378 is the owner of Let's Debug, which to me seems to be a good reason.

Bruce5051 · August 1, 2024, 9:35pm

Testing against the real production and staging environments seems necessary.

Nummer378 · August 1, 2024, 9:38pm

This is incorrect. For one, I'm running negative CAA tests to verify that the CAA records are properly validated. If you don't do this test, you have no data if CAA checking worked, since there's no feedback in the positive validation case.

Next, Let's Debug does staging validations to get data from Let's Encrypt that's impossible to retrieve otherwise. This includes, but is not limited to, checking whether a domain is on Let's Encrypts blocked domain list, whether Let's Encrypts validation sites agree with what Let's Debug sees (as a cross-verification, which helps to detect geo-blocking) and if staging is operational.

I never run these negative tests against production, as staging should generally be sufficient to collect this data.

petercooperjr · August 1, 2024, 10:04pm

The Unpause URL

I assume boulder.service.consul is some sort of placeholder/testing text, and the real domain would be something that lives under letsencrypt.org? (.consul doesn't seem to be a TLD)

In addition to the concerns about it being a long URL and the variety of ways that clients expose (or don't) the errors they get, I'm a little concerned that the URL is the only "password" used. It may not be that big a deal in practice, but any system that gets access to the log may automatically hit the address (such as when it gets pasted into this forum and the forum tries to figure out if there's a fancy boxed description to make for it). That is, URLs tend to become "public" (and crawled by search engines and whatnot), and while I don't know if there are any real security implications, I wonder if having a separate simple password that goes with the URL might be better, to ensure that the user actually has the whole log entry and is intending to use the page? (Along the lines of Please visit: https://letsencrypt.org.example/unpause/Abc…xyz and enter password 123456) Might just be overkill and add more confusion than it would solve; just brainstorming.

If the URL needs to be that long, I don't know if having some sort of delimiter around it (quotes or angle brackets or whatnot) would be better or worse.

Emailing the account holder?

Before an account goes onto the pause list, would an email would get sent to the contact? I'm sure in most cases they're not checking their email any more than they're checking their zombie client, but some other kind of contact might be a good plan in addition to shutting off access for the account to request authorizations. (Though as I said in the thread a few years ago, I'd like more emails from Let's Encrypt in general, but I'm probably odd in that way.)

Referring people to the community

While I understand that this is the only place the "Get Help" link on letsencrypt.org points to, I'm not really sure what we're supposed to be able to do to help people with some of these messages? If there's a "Scenario 2" with somebody who can't figure out how to copy/paste the URL (maybe their client is truncating the error message, or maybe they just are still learning how to use a terminal program), it may be tough for us to give much help in some cases. It's also not clear just from the error message just how "private" the URL is supposed to be, or what access it might give over their account. (And maybe that's another reason to separate out a "password", so that maybe it's clearer that the URL can't do anything to the account without the password.)

I don't really have a better plan, though.

Thank you for going over all of this and soliciting feedback! I know I may come across negatively sometimes, but I do very much appreciate all the work you all do, and I hope this feature rolls out smoothly and helps take a lot of load off your poor burdened servers.

griffin · August 1, 2024, 10:12pm

Captcha anyone?

aarongable · August 1, 2024, 10:31pm

Correct -- that URL is from our CI test environment; in production it will be something like (don't quote me on this!) sfe.letsencrypt.org/v1/unpause.

This is why unpausing is a multi-step process, and why the JWT expires. Simply visiting the page doesn't result in any action; the user must visit the page and from there click the button which submits a POST to a different URL. Is it still possible for a system to automate this? Yes. But simple systems which just curl the URL from the logs won't work.

First, a clarification: accounts don't get paused, individual account+identifier pairs get paused. So an account responsible for thousands of domains will continue to be able to issue for domains for which validation is succeeding, even while their perma-failing domain names get paused. Unpausing works on the whole-account level, so that people with many paused domain names don't have to click thousands of individual unpause links.

So while we may decide to email folks who get paused, we're not yet convinced that we have to. Partly because many many accounts simply don't have an email address associated with them, and partly because we don't want to be sending many near-duplicate emails as each domain name gets paused. Are there engineering solutions to this? Of course, but they also require engineering effort that may be better spent elsewhere.

We've considered this, and believe that it won't be necessary. But if we detect significant automation and abuse of unpausing, then this is certainly a direction we can go.

griffin · August 1, 2024, 10:34pm

Firstly, I'm loving the discussion and suggestions from my fellow community members!

I both second that notion and would like to say that I, for one, rarely think you come across as negative, @petercooperjr.

As for the password/confirmation scenario, to me, from a security standpoint, this is a classic "key distribution" problem of sorts. My captcha suggestion wasn't entirely meant to be tongue-in-cheek. Ensuring human will here seems fundamental (unless we're trying to take into account AI account managers).

Edit: I cross-posted with you about captcha, @aarongable. This post was written before I saw yours above.

griffin · August 1, 2024, 10:38pm

It's not so much deliberate abuse/automation that is my concern, but oopsy/side-effect like what @petercooperjr was considering.

What does this link someone posted/shared do?

Click.

Edit: Dah. I'm also just recalling that it goes to a page with a confirmation. Not an instant problem then. I initially saw it as more akin to the problem of renewing someone else's certificate that I initially encountered when designing the public interface for CertSage, which prompted me to add a password that is generated and exposed in private.

See my own copy of CertSage as an example:

https://griffin.software/certsage.php

The challenge in that case is that without a password someone could force generation of a degenerate certificate (without the full list of SANs) then force installation of said degenerate certificate. Unpausing someone else's account isn't nearly as risky.

Edit 2:

By the by, the shared seed value between the SFE and WFE directly parallels my password solution in CertSage:

// *** GENERATE RANDOM PASSWORD ***

$this->password = $this->encodeBase64(openssl_random_pseudo_bytes(15));

vs

The SFE and WFE should share a 32 byte seed value e.g. the output of openssl rand -hex 16

github.com/letsencrypt/boulder

sfe: Implement self-service frontend for account pausing/unpausing (#7500)

committed 02:52PM - 10 Jul 24 UTC

pgporada

+2140 -39

Adds a new boulder component named `sfe` aka the Self-service FrontEnd which is… dedicated to non-ACME related Subscriber functions. This change implements one such function which is a web interface and handlers for account unpausing. When paused, an ACME client receives a log line URL with a JWT parameter from the WFE. For the observant Subscriber, manually clicking the link opens their web browser and displays a page with a pre-filled HTML form. Upon clicking the form button, the SFE sends an HTTP POST back to itself and either validates the JWT and issues an RA gRPC request to unpause the account, or returns an HTML error page. The SFE and WFE should share a 32 byte seed value e.g. the output of `openssl rand -hex 16` which will be used as a go-jose symmetric signer using the HS256 algorithm. The SFE will check various [RFC 7519](https://datatracker.ietf.org/doc/html/rfc7519) claims on the JWT such as the `iss`, `aud`, `nbf`, `exp`, `iat`, and a custom `apiVersion` claim. The SFE should not yet be relied upon or deployed to staging/production environments. It is very much a work in progress, but this change is big enough as-is. Related to https://github.com/letsencrypt/boulder/issues/7406 Part of https://github.com/letsencrypt/boulder/issues/7499

petercooperjr · August 1, 2024, 10:48pm

Yeah, and the more I think about it really isn't so much about "Let's Encrypt servers might need to serve a page to a bot", but more "I was told to go to this URL, and I don't know what it does, what would happen if I click it or if my anti-virus scans it, or whether it's 'safe' to post on a forum." Really the only call to action is "Please visit", and the user doesn't know why they need to visit it, or what their credentials will be to log in. We already get people here looking for help with finding the non-existent portal for people to manage their Let's Encrypt certificates, now that there actually will be a web interface for a tiny piece of things, people will get even get more confused about what they're supposed to log into.

Again, there probably isn't a better way. Just something to be aware of. I wonder if there's better text than "Please visit", but anything more descriptive and wordy may not actually be an improvement.

petercooperjr · August 1, 2024, 10:51pm

Oh! And on the thought of improving text, will this be translated to multiple languages? Will the user get to pick a language, like on the documentation pages?

griffin · August 1, 2024, 10:52pm

You are a true technical horror-writer, @petercooperjr.

Topic		Replies	Views
Automatic Pausing of Zombie Clients API Announcements	2	938	December 5, 2024
Possible new feature: paused ACME accounts Issuance Tech	35	2696	May 8, 2021
Questions: Automatic Pausing of Zombie Clients Issuance Tech	19	473	January 30, 2025
Error 400 - urn:acme:error:rejectedIdentifier - Policy forbids issuing for name Help	5	1888	July 22, 2017
Process for "Policy forbids issuing for name" Errors? Issuance Policy	8	2117	March 15, 2017