Offtopic: that could very well be the header set by the underlying lower level HTTPS client used. The specs don't say anything about the contents of the header, so the default cURL header would suffice, being quite useless.
Just to think outside the box a bit, I think the underlying problem here is that the current rate limits (and perhaps the whole concept of rate limits) seems to cause both (1) a lot of support "costs" of people not understanding what they're doing wrong, and yet (2) doesn't seem to actually be doing enough to stop "bad" use of Let's Encrypt's systems (as 80% of HTTP-01 challenges fail). I mean, it's not like I'd claim every challenge failure is "abuse", but having that big of a failure rate implies that there's not enough safeguards to ensure that challenges are likely to work before actually being attempted.
My first thought is that some kind of Hashcash-type system, where a client needs to do some computation difficult to do but easy to verify that it passes along as a header (or as part of the challenge token or something?). Requests that include such a request could have less strict rate limits. (So, in order to make it backward-compatible, make rate limits slightly stricter for requests without such a header, but more generous for requests with a header, and maybe even some stuff like the 5-duplicate-cert limit could be overridden entirely (sometimes) if one submitted the request with a more-difficult header that indicated it took a few minutes of computation or the like. I think something like that should be enough to keep it from being used in ephemeral/disposable environments, though I'm probably overestimating people's understanding of the systems they're setting up.)
But beyond that, is there some way to request clients to "check" that a challenge is likely to work, and submit some kind of "proof" of it somehow? It's kind of tricky since of course it needs to be tried from outside their own network, but maybe some kind of signature over the DNS record at least? Like, to do an HTTP-01 challenge, the client needs to retrieve the A/AAAA records, thereby confirm that the name exists and isn't in private IP space, and sign the values they get and send it to the server. Then if they match what Let's Encrypt's DNS lookup says, different rate limits could apply? Or to really check from the outside, maybe some kind of partnership with Google/Akamai/Cloudflare/whomever, where they could do a check and say that a challenge looks good and cryptographically sign it (and the signature is sent by the client in the challenge request), so then Let's Encrypt knows that it's much more likely to work and can apply looser rate limits? And then requests without these kinds of checks could be much more limited in some way?
I'm just trying to brainstorm, I'm hoping that a better idea comes out of this, but I kind of feel like the original request here (while I don't really object to it in principle) is just "trying to make rate limits better" and so maybe some brainstorming of other ways to prevent Let's Encrypt's resources from being used up unnecessarily might be helpful? If there's some way to offload more work to the client somehow, to ensure that only "reasonable" requests are likely to get checked, it seems like it's worth exploring, though obviously tricky to try to apply in a backward-compatible way,
Probably a good idea to follow the style of the ACME v1 deprecation and do progressive rolling blackouts, see if you can catch any good actors to fix their clients/installations, and any impact on support channels.
Hm, good point. This is how the failed validations limit is implemented. A good argument in favor of implementing a two-level failed validations limit (short term and long term) before implementing a pause feature.
Good catch, yes, the account should be at least 180 days old.
Actually this is another possible use case for account pausing that I didn't mention in the original message. We're sending out biweekly emails to ACMEv1 users to get them to move off. The numbers are slowly (very slowly) dwindling, but there are some folks who didn't set an email address and we can't reach. Those people will have a bad time when we finally turn off ACMEv1 entirely. We could give them a slightly softer landing if we used the turnoff date to "pause" all ACMEv1 accounts for 90 days, so folks whose certificate expired could go unpause, issue another certificate, and then get to work on upgrading.
Two birds. I like it. Always great to get parallel usage out of your dev budget.
Actually, that ties in to some other crazy thoughts I've had (and this probably isn't the way to go, but maybe sparks more brainstorming from others): One of the fundamental problems here is people starting clients that set up their own automated renewal, but then not maintaining or following up on them (meaning that they continually fail validation but nobody seems to care). Once ACME v1 is gone, you'll no longer need to spend resources validating challenges for systems pointed to the v1 API but have since been abandoned. Can we just… do it again? Like, plan a migration to a "v3" URL over some number of years (even though this endpoint would actually speak the same ACME protocol), any systems that have been staying up-to-date will get updated to it over that time, and when you turn off the v2 endpoint all those abandoned clients will be much easier to handle. And then, maybe have this be an expectation in clients that the endpoint would change every few years? (And for an extra-crazy idea, have it be tied to the intermediates? So each new intermediate-generation every few years would have a new endpoint?)
Just to be clear, I don't think the above idea is actually good, but it might be a starting point for some other better ideas from others.
Another thought is that this whole problem is really trying to work around the client-side not noticing continual failures. While it wouldn't help with the existing zombies, might it make sense to include in the Integration Guide (and otherwise encourage in the most popular clients) that clients should, if they receive failures renewing for some extended period of time (like, a good month past certificate expiration maybe), disable their scheduled task for that name?
I like the idea of a paused account or domain with manual re-activation in general. My biggest concern with using Let's Encrypt is the risk that a bug in my setup might cause a one week outage because of a rate limit. I would not mind much tighter rate limits, if I could trust that there would be a button available for me to say "I fixed my bug, now please try again".
I ran some numbers to estimate the size of impact. So far just looking at validation logs. From 03-21 to 03-28, we had:
- 216M validation attempts
- 179M validation failures (83%)
- 97M of those failures came from accounts that had 0 validation successes of the course of that week. Removing those failures would bring the failure rate down to 38%. (edited)
Here are some numbers bucketed by how many validation attempts a given account had during the week. A "total failure" is an account that had 0 validation successes; these are the candidates for pausing (if they also had no issuances for X days). I summed up the error counts from them.
|bucket||accounts||validation attempts||errors||errors from total failures||error rate|
Interesting that the error rate starts out very low, in the buckets with few attempts. In buckets with larger number of attempts, we see the error rates get much higher. Presumably this is a matter of clients that retry faster than they should.
Since these are actual validation attempts, they don't include requests that were stopped by the rate limits. So for instance, a client that was retrying failed validation as fast as allowed by the rate limits (5 failed validations / hostname / hour) would have 840 attempts for a single hostname.
I would suggest not to pause the whole account. Just pause the failed domain for the account.
There may be several certs issued by the account, for some reason, if some of the domains are failed when renewal. The others domain may be still valid.
So, please just disable the failed domains for the account. Don't diable the whole account.
When the account is requesting a challenge for the disabled domains, just return an error to it.
I thought the idea was to look 180 days back to see if the account had any succesfull activity. That's twice the lifetime of a LE cert. I'm not sure why not to disable the entire account then?
Think about a case:
what if I register an account first, and try to issue a cert 181 days later?
Will that account have failed validations in the mean time?
No, nothing. Just sleep 181 days.
In that case LE could easily check if there is a reason to block the account: no invalid validations, no block necessary for example.
I feel like this all comes down to the determination of the definitions of a "zombie" (unmaintained) account, which as a deductive process that will likely result in tradeoffs. Analyzing the ROC curve will hopefully minimize the false positives while maintaining efficacy.