I can't find the earlier forum thread about this (and I've searched for it), but I remember that someone had proposed that there should be a more aggressive duplicate certificates limit with a 1-hour or 1-day reset, so that people would hit it much sooner when doing a wasteful reissuance (and hopefully thereby find out about the rate limits much sooner and with less overall frustration).
As I continue to notice the pattern of "I created my certificates inside a container/ephemeral VPS and deleted them over and over again, and now I'm rate limited" as a question on the forum lately, I would love to see a way to make people aware of this sooner and with slightly smaller consequences.
For example, perhaps there could be a 3 duplicate certificates per hour limit, 4 per day, and 5 per week, with the most restrictive relevant limit applying at any time. Then the people showing up on the forum asking about rate limits could more often be told "you'll need to save your certificates in persistent storage, and you'll have to wait an hour before trying again" rather than "you'll need to save your certificates in persistent storage, and you'll have to wait a week before trying again". They could then use that unexpected hour of downtime to research how to save their certificates in persistent storage.
I feel like the frequency of people hitting rate limits because of ephemeral instances has been increasing. So, if my perception is right, I feel like this proposal is increasing in relevance with time.
I'm perhaps wandering off topic but I've long felt that ephemeral instances (and many more permanent internal systems) shouldn't really be acquiring certs from the CA directly (especially if DNS validation is involved) and instead should be going via centralised cert management so that fancy validation, stored credentials and issuance controls can be centrally managed - there are of course several systems that do this already but I've no idea what the uptake is.
I'm currently adding a centralised service for Certify The Web (currently on docker/linux or windows) so that authorised app/service instances can pull their latest cert via an API and the cert service takes care of keeping them fresh. It's certainly not a new idea but I think as a strategy it could benefit from increased usability and it removes the issue of individual instances struggling to maintain their certs.
I've mentioned another idea a few times that I've had kicking around, that there should be an HTTP header to warn about "APIs you shouldn't use directly in an ephemeral instance" (I was calling it Was-Expensive, although I haven't written up a spec). In that case if ephemeral instances could set some kind of environment variable to indicate that they are ephemeral, their HTTP libraries could maybe start generating warnings about this... or something?
Cool, that's great!
There are some older pre-ACME protocols that I think are oriented around this kind of use case.
I wonder if any of them would be useful for this today, or if it makes more sense for most of these users to have an ACME proxy, or just a sort of trivial download from a known location.
Thanks! In the first instance I'm going for trivial download from a known API using pre-shared app/service specific credentials/tokens (likely with an option for mutual tls if API requests could happen over the public internet).
I thought about an ACME proxy and actually did build a working prototype about a year ago but I couldn't see a way around who should control the private key (unless it's pre-shared again). ACME doesn't pass private keys around but acquiring the original cert needs it for the CSR etc.
Pulling latest via an API is super simple and fast enough to achieve during app/service startup if the cert has been pre-prepared. Clients can pull from the API or a secrets store/vault that the central service has already published to. Some CTW users already publish their certs to Azure Key Vault or Hashicorp Vault etc via Deployment Tasks, so we'll extend that as well because that's generally very easy to do and a pretty good separation of concerns.
I opens-ourced our API Driven ACME Client/Manager a while back -- Peter SSLers. Our own use-case is to support an unknown number of domains, running on an unknown number of servers, in an unknown number of locations. To accomplish that, I wrote a tiered caching system for OpenResty (nginx variant) that loads certs during the SSL Handshake from worker-mem, shared-mem, redis, and finally an API server.
My gut reaction since day 1 has been that, while this is the right approach, the people who need these systems require quite a bit of customization - but don't have the resources for it. So they just default to "bad behaviors".
I've gotten a handful of private emails from companies wanting a specific enterprise feature built in, but they're never interested in contributing a PR or funding development of the feature. Based on some exchanges, their rationals are generally because of budget restraints ("we can't spend money on this!") and sprintable hour constraints ("we already have to allocate x hours for integration/management, we can't allocate y hours for development").
Thanks! We are definitely aimed at completely different use cases -- my project is aimed at simplifying "internet scale" deployments like PAAS, SAAS, Whitelabel Tools, etc and programmatic usage. The UX is an afterthought, for bugfixing. It is overkill for 99.9% of use cases, which is why I even use Certbot for our own certs.
Your project, in contrast, has amazing UX and is simple and enjoyable to use.
So, it looks to me like in Boulder you can only have active one rate limit of each kind.
For example, in
you could change the details of certificatesPerFQDNSet, but you couldn't add a secondcertificatesPerFQDNSet with different details.
So the easy way to implement @griffin's original idea would be to create new kinds of rate limits, like certificatesPerFQDNSetLarge, certificatesPerFQDNSetMedium, and certificatesPerFQDNSetSmall, or something, and update all of the code that refers to rate limit types in boulder/ratelimit and boulder/ra, so that all three of them can be checked. But this might be less elegant than making the rate limits allow some kind of multiplicity of a rate limit policy, which I don't think the current code can handle.
I have to say I'm kind of curious, if without changing code but just changing parameters one completely replaced the 5-per-week limit and changed it to a 1-per-hour limit, if it might still have a net effect of reducing duplicates and load on Let's Encrypt's servers.
Might be fun to experiment with, but of course that's easy for me to say when I'm not the one running the servers.
Of course, something like 2-per-4-hours or even 3-per-day might work better. I just don't know if the 5-per-week is based on some actual evidence from early in Let's Encrypt history, or experience with other CAs, or based on known limits in their signing capacity, or if it was just a wild guess based on what they thought would help conserve their resources best. That's why I suggested it might be fun to experiment with, though I would certainly understand hesitation to do so in production. But even if a second level of rate limit were added, it's not clear to me what the right level would be to set it at beyond an intuition of "somewhere around 1 or 2 in the span of an hour or two".
Makes sense. Based on @schoen's analysis above, it looks like there might be some level of effort involved with implementing an hourly limit. I still believe it would probably be worth it though from a load-reduction standpoint by slowing down the less informed and thwarting bad practices with ephemeral instances.
edit: I am meaning to have a dual limit (one to three per hour and five per week).
I figured there might be be some, but are there really a lot that request continually, rather than just twice-a-day when they try to renew or whatever? Yikes. I guess we do need more rate limits then, rather then just tweaking the one we have. (Though even with a "short" limit in place, it might be worth looking at other options for what the reset of the "long" one should be, if 5-per-week was just pulled from a hat. And maybe even make some more limits to help stop the request-a-new-cert-every-chance-they-get clients.)