Automatic Pausing of Zombie Clients

When will this change be deployed?

The Let’s Encrypt’s Staging Environment will enable automatic zombie client pausing on 2024-12-04 followed by the Production Environment change on 2024-12-05.

Background

Today, a significant percentage of orders are generated by accounts that never successfully complete validation. The majority of those come from clients that have not succeeded in a long time (and likely never will). Common failure scenarios include the domain name either expiring or now pointing to a different host. A significant portion of our resources (compute, database utilization, and network) are consumed by such "zombie clients" and "zombie domain names”. While we can identify the accounts belonging to these clients, we cannot deactivate them; the clients would register new accounts and continue making the same requests. Previously we used this system to manually pause the worst offenders. From now on, boulder will automatically do it.

Overview

Let's Encrypt has implemented a mechanism to “pause” issuance for (account ID, ACME identifier) pairs such as (123456, "example.com"). Accompanying this mechanism is a new Self-Service Portal that allows subscribers to unpause and resume issuance for all paused identifiers associated with their account. The Self-Service Portal is accessible through a URL provided in an error message.

When does pausing occur?

The rate of authorization failures determines how quickly a given identifier will be paused. If an authorization for the identifier is ever successfully validated, the count of failed validations resets to zero. Once the number of authorization failures reaches our threshold for pausing, the subscriber will receive an error message for all new orders containing that identifier.

The error message contains a unique link to the Self-Service Portal that will expire after two weeks. This should allow the Subscriber ample time to check their logs and address any issues, but if not, the Subscriber will be instructed to re-attempt issuance to receive a fresh URL. If the Subscriber clicks the link and completes the manual action, the identifier will be unpaused and enter a two-week grace period during which it cannot be re-paused. After the grace period ends, the identifier becomes eligible for pausing again if authorization failures continue.

Failures Per Account Per Domain/Day Time Elapsed Before Pausing
1/day Never paused
2/day 3600.00 days (~118.27 months, ~9.86 years)
5/day 900.00 days (~29.57 months, ~2.46 years)
10/day 400.00 days (~13.14 months, ~1.10 years)
40/day 92.31 days (~3.03 months, ~0.25 years)
120/day 30.25 days (~0.99 months, ~0.08 years)

What should I do?

If you are a website/domain operator:

  • Ensure that your ACME client is producing and storing logs.
  • Subscribe to a certificate expiration monitoring service, some of which can be found here.
  • If you have received a pause URL, click it, and review that page for more information.

If you are a large integrator of websites/domains:

  • We currently unpause up to 50,000 identifiers at a time. To unpause more than that, perform additional unpause procedures using a link containing a different JWT. To generate a new link should yours have expired, use your ACME client to attempt producing a new order.

If you are an ACME client author:

  • Ensure that your client returns errors from the API during every attempted certificate issuance. We intend for a human to interact with this system rather than automation. In the event that automated unpausing traffic increases by ACME clients, we will implement a CAPTCHA system.
18 Likes

The AutomaticallyPauseZombieClients feature flag has been deployed to the staging environment.

10 Likes

The AutomaticallyPauseZombieClients feature flag has been deployed to the production environment.

10 Likes