API to help detect improper use *before* triggering rate limit

In my experience its pretty easy to hit the rate limit if you’re doing cloud (re-)deployments.

It is best of course to save the certificate generated for a cloud host to a persistent storage and restore that when a host is wiped and recreated. But when that is not implemented or not working correctly, the only way you will find out about it is when you are suddenly locked out.

It would be nice if there was a way such a deployment could detect that an unexpected amount of requests has been done for a given domain already. Tracking it locally is not an option, as the problem happens exactly when local state is not being preserved correctly.

So I’m thinking maybe an API that returns info about the progress toward various rate limits for a given domain. Then we can build a warning into deployments when we see we have done more requests than expected, with a low default that can be bumped up as needed for deployments where it is expected.

This does not have to be perfect. The important things are:

  • Stateless from the point of view of the client
  • Be able to detect when we are doing more requests than expected for a domain
  • Do it in the deployment itself. A cron job that polls crt.sh or an email notification is unlikely to reach the right person fast enough, and it makes it hard to tune warning limits per deployment, which is necessary to ensure the warnings are not ignored.
1 Like

You could use crt.sh as an API (e.g. https://crt.sh/?q=example.com&output=json) inside the critical path of your deployment tool, but the problem there is the lag that affects any log aggregator. If somebody is constantly re-running docker-compose up without a volume mount on their laptop, it’s not gonna save them.

Let’s Encrypt could make a non-standard extension in Boulder for this, I guess. The existing queries it already performs seem well suited to the job, all the required information is accessible (the window size, the limit granted to your ACME account and the actual count in the current window).

I feel like there’s a problem though, that ACME clients and servers somehow have to coordinate on an understanding of what the rate limits are. There is no interoperability between ACME implementations, and there’s not even a guarantee of interoperability between different releases of Boulder, because rate limits might get added, removed, or have their semantics slightly adjusted - like when the renewal exemption was added.

There’s also a problem that one operation (e.g. new order) is constrained by multiple rate limits that have different windows (new orders = 3 hours, certificates = 1 week), so I’m not sure how you would model that as an API query/response.

I’ve often thought about this problem (this was one experiment I crated in tracking rate limits as a third party), but nothing ever stood out to me as a very good solution, theoretically.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.