Why are the limits so low and the bans so long?

cgag · November 4, 2016, 9:41pm

I’m confused as to why the limits are so strict. I just did a bunch of testing converting infrastructure over to use kube-lego (https://github.com/jetstack/kube-lego) using the staging environment. When I switched to prod, I believe I actually hit a bug relating to switching environments, and ended up hitting the limit immediately.

My question main is, why am I banned for a week? If it’s about server load, I would think banning me even for a few minutes would be enough to avoid hammering the server. If it’s about something else, can you explain it so I can feel less bad over the coming week?

My second question would be, why 20 per week? If the limits were removed, what would the bottleneck be? Related to the first question, why would the limit not be in terms of requests per second or minute? I’m pretty sure my thinking in terms of reqs/sec is misguided but I don’t know why.

TCM · November 4, 2016, 10:20pm

The costly operation here is signing the cert and creating OCSP responses for as long as that cert is valid. These operations need access to the hardware module that holds the CA key, from what I understand.

pfg · November 4, 2016, 11:16pm

@TCM’s reply mirrors my understanding of the reasoning behind those rate limits.

There’s been some discussion about switching to a different rate-limiting algorithm (like token bucket) in the future to avoid these long bans - the signing capacity per domain wouldn’t change significantly (if you look at it over a long period of time), but you’d hit the limits sooner and they’d reset sooner, which seems like a good thing for broken automation. However, no concrete plans for that have been made yet (and it might never happen).

jsha · November 5, 2016, 12:15am

There are a number of resources we need to manage:

HSM capacity to sign OCSP responses. This was a big factor in our initial capacity planning but, so far, is not turning out to be the major bottleneck as we expected it to be.
Database capacity, in terms of overall storage and CPU cycles. This is turning out to be one of our big capacity gates, and we’re actively working to improve efficiency here. But we do expect this will always be a limiting factor.
CT log capacity. We log all of our certificates to public CT logs, operated by third parties. Some of these logs have already started to show signs of trouble keeping up with the volume, so we try to keep issuance volume to just what is needed and not too much more.

It sounds like you were doing everything right, testing on staging before switching to prod. I’m sorry you hit the limit so fast. As @pfg says, we’re considering alternate schemes, but they take time and effort that we need to prioritize against other engineering work.

It’s also worth mentioning that the most common type of bug is repeatedly issuing for the exact same set of names (the Duplicate Certificate limit of 5 per week). If you’ve hit that limit, you can continue to issue up to your limit of 20 certificates per registered domain per week, if you add another hostname.

system · December 5, 2016, 12:15am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.