Rate limit issues with Kubernetes, Cert-Manager, and lots of domains to auth


#1

We use nginx-ingress and cert-manager in GKE, and are having rate limiting issues with DNS authorizations during our first renewal process. We have around 900 ingresses, each has around 80 total domains (40 root and 40 www via SNI) bringing us to around 72,000 domains that need DNS auth. During the initial certificate requests we worked with Let’s Encrypt to raise some of the rate limits on our account to accommodate and we were able to script the deployment of all 900 ingresses and creation of the corresponding 900 certificates.

Now we are nearing the first renewal (i am seeing 200ish hours till expires in the logs) and cert-manager appears to be trying to pre-authorize via DNS (we use google cloud dns auth) and we are again getting “429 urn:ietf:params:acme:error:rateLimited: Rate limit for ‘/acme’ reached” errors. and “429 urn:acme:error:rateLimited: Error creating new authz :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/

Looking at our logs, it looks like cert-manager is trying to prepare a 2-3 certificates a second and all if not a large portion are getting one of the two 429 errors.

My domain is:
www.technician.jobs (one of +72,000 domains managed using this stack)

My web server is (include version):
nginx-ingress-controller:0.19.0
cert-manager-controller:v0.2.3

The operating system my web server runs on is (include version):
Kubernetes: 1.10.6-gke.2

My hosting provider, if applicable, is:
Google Cloud GKE

I can login to a root shell on my machine (yes or no, or I don’t know): yes


#2

Hey, thanks for creating the topic and the detailed explanation of what you’re seeing.

So first of all, it is worth noting that cert-manager has not been tested at quite this scale yet, and I can envision a number of issues that you may encounter when running so many domains.

As you’ve observed, cert-manager is trying to renew these domains roughly at the same time. This is a consequence of all of the certificates being initially issued at around the same time too. Currently, we trigger a renewal 30d before expiry of the certificate (and so if many are expiring at a similar time, you’ll see many being renewed at a similar time).

We’ve got a few things in the pipeline that will really help with this, and allow us to cap how many validations we run at once (as well as ‘intelligently’ scheduling renewals). You can see a proposal for this here: https://github.com/jetstack/cert-manager/pull/809, along with an initial implementation: https://github.com/jetstack/cert-manager/pull/788

However this implementation does not currently ‘rate limit’ in a way that would be useful to you (this is something that will be added before it is merged, however).

So I think right now, you will need to keep these raised limits in place if you want to be able to reliably renew your certificates, as they will all be attempted at approximately the same time.

So the rate limit for the /acme endpoint here sounds like cert-manager is just hitting the LE API too hard with directory lookups etc. This is something that requires changes to how we handle domains in order to fix.

“too many currently pending authorizations” this is also something that will be resolved with the changes I mentioned above. In the meantime, you will either need to manually manage the times these renewals happen (potentially by deleting the Certificate resources for those other domains temporarily), OR get your limits increased again, OR wait and hope that you are able to fit in the required number of authorizations in the given 30d window you have for renewal.

As an aside, and as an immediately actionable point, you should update your cert-manager version to v0.5.0. v0.2.3 is now quite old, and there have been a number of critical fixes made to cert-manager since. Specifically, from v0.3.0 onwards we now exclusively use the Let’s Encrypt v2 API. This will require you to update the endpoint on your Issuer resource accordingly (please see our upgrading notes in the docs for more details).
From what I recall of the v1->v2 Let’s Encrypt transition, this will invalidate all previously held authorizations (although your existing certificates will obviously continue to be valid).

It’d be great if we can work more closely with you to try and make sure that our future designs and roadmap accommodate your requirements and usage. You are clearly at the extreme end of usage for cert-manager, and I’d really like to ensure we can support you! You can reach me by email on ‘james AT jetstack.io’.


#3

Thanks so much, James! Excited to see the fixes in future cert-manager versions as well.


#4

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.