I am building a product where we rent easy-to-use cloud-VMs to customers. Sovereignty is a core feature so each VM is secured by its own certificate. They are requested by a Traefik reverse proxy on each VM and issued by Let's encrypt. Each VM has a unique domain looking like this: <random-string>.p.getportal.org. So for each new customer that is onboarded, a new certificate is issued.
We are still at an early stage but at particularly busy times we hit the rate limit for new certificates. The error looks like this: acme: error: 429 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:rateLimited :: Error creating new order :: too many certificates already issued for \"getportal.org\". Retry after 2023-01-03T17:00:00Z
I requested a rate limit increase using this form but it was denied and I am not sure why. According to the email, reasons may include:
providing incomplete information and leaving out the account ID or requested override type/value, or requesting an override for a rate limit that does not allow overrides, or having a client request pattern that indicates a rate limit override is not the solution.
I did not provide an account ID because I need an override per domain. Might that be the problem? What else is needed to get the adjustment approved?
I don't know why LetsEncrypt denied this and someone else is likely able to give better guidance to get this approved. Without having seen your application, I would not be surprised if "having a client request pattern that indicates a rate limit override is not the solution" is involved. IIRC, this can happen for projects that are still at a volume where pre-registering domains is a better pattern.
Something that pops out to me about this model is the security risk when it comes to several elements of web browsing (see https://wiki.mozilla.org/Public_Suffix_List/Uses) . IMHO you should file a request to have the domain added to the Public Suffix List to ensure browser security sandboxing between customers and your organization (see Guidelines · publicsuffix/list Wiki · GitHub), especially for Cookies and Safe Browsing configuration. When filing that ticket, make sure to note this is to ensure Browser Security Sandboxing and not Rate Limits.
While the PSL would exempt you from the rate-limits, it is not a workaround to your problem. The process can take several months to complete onboarding when you're not a TLD, and then you must wait for LetsEncrypt to sync to those changes.
Your immediate fix is to have a proper/compliant ratelimit request with LetsEncrypt.
I wasn't aware of the security sandboxing aspect, thank you for bringing this to my attention! I will definitely request to have p.getportal.org added to the PSL. When successfull, this should solve the problem with rate limiting.
In the meantime: what is this pre-registering domains pattern? I can't find more information on it. Can I use this as an interim solution while I wait for the ratelimit request to go through?
Of course, I would still be grateful for hints on how to file the ratelimit request properly.
Write a script to pre-generate 1000 FQDNs that match <random-string>.p.getportal.org to a database table.
Whenever your rate-limit is clear, have another script request as many certificates as possible to match these domains. This can run once a day or more often. The certificate-per-domain limit is based on a rolling timeperiod and only applies to new certificates -- not renewals.
When you onboard clients, instead of requesting each certificate on-demand, you just assign one of the pre-registered domains to the client, marking it as used (or deleting it), and then add a new unique domain to the database table so it is queued for pre-generation.
This method does not bypass the ratelimit, but lets you maximize the amount of unique certificates you can get with a ratelimit.
The majority of users who are in your situation (unique domains for customers on an early-stage platform) are not hitting the ratelimits because of actual overall usage, but because of timing pileups due to marketing or media coverage. This solution bootstraps them. Usually people start doing this a few months before launch. Some users have bought multiple registered domains to generate larger amounts.
If your design allows you to decrypt SSL on the gateway, you can write a service that utilizes a *.p.getportal.org wildcard as the fallback for domains that do not have a certificate or automatically loads the correct certificate. This is relatively easy to do on OpenResty (an Nginx fork), which supports Lua scripting during the TLS handshake. I open sourced our tool that does this a while back (Nginx/Lua Plugin, Python Certificate Manager).
If your design doesn't allow you to decrypt SSL on the gateway, you'd need to ensure VM users can not access the fallback certificate's private key.
In addition to some of the things mentioned above, you may also want to try using multiple CAs. Now that there are several that use the ACME protocol, you may be able to split your workload among them. This can also give you some breathing room in case one of the CAs is unable to issue certificates for a while, whether due to their rate limits or an incident that requires them to halt issuance. Kind of like the pre-registering, you may want to make sure that bringing new customers onto your system isn't dependent on the uptime of some external entity (like a CA) that you don't control.
That certainly is an interesting and useful method. I'll evaluate it further. One problem I see is that I spin up VMs only during onboarding, so I would have pre-register domains, create certificates and then load them on the VMs after spin-up. It kind of violates the rule that private keys should not leave the machine on which they were generated, but perhaps that is ok. I'll look into this.
I do not have a gateway in front of all VMs and I am not comfortable with having the wildcard cert's private key on customer-owned VM. So it seems like that idea won't work unfortunately.
Great suggestions in this thread! I’d just like to add a note that although it’s best to keep your PSL submissions as specific as possible, for LE rate limits, we can only handle domains (eTLD+1) and /not/ subdomains. A rate limit exception will also apply to subdomains but cannot be added or limited to them.
Hmm. I would have expected there to be a good list somewhere, but I'm not sure where. Wikipedia lists a few, and Google does some with their CA. Let's Encrypt is all I've tried for my personal stuff. Other people here could probably rattle them off from the top of their head.
Ok, so maybe that is why my request was denied? Maybe I have better luck when requesting a ratelimit increase for getportal.org instead of p.getportal.org? That would work just fine for my purpose. I'll try that, thanks for the hint!
That is the ideal model from the resource use perspective, however you could also request a new Certificate/PrivateKey in this model after deployment. That would be considered a renewal that counts against the weekly Duplicate Certificate limit.
That being said... I do see a potential issue with your design which you may have already thought of – you need to ensure that each VM generates it's own AccountKey and that any exemption is based on the Domain, not Account.
Who wrote that rule? It's a fairly old principle. These days, cloud storage and internal datastores are very common and even recommended for Certificate/Key combos. If you have admin access to those VMs, there is no substantial difference in security.
The only important/required rules regarding the PrivateKey are in the Subscriber TOS and CA/B Baseline Rules. As the platform/network operator, you would generally be considered a privileged custodian of that information.
I am not sure I understand? Right now, no VM is generating any AccountKey, why would they after switching to the pre-registering pattern?
Assuming that with exemption you mean the ratelimit increase, that surely should be based on the domain. At least that was my clear impression. I don't even have a Let's Encrypt account. Never thought I needed one.
As a side note: I thought about the specific steps for pre-registering domains. And I am no expert with the protocols involved but for each subdomain I want to pre-register I would have to 1) create one or more DNS entries, 2) attach some enpoint to the entries, e.g. another VM with the ACME client 3) let the protocol run its course 4) get all the material (private key, certificate) and save in somewhere, 5) (optional) remove the DNS entries/VM again.
Is there another method that is more straightforward? If I can prove that I own the domain, maybe proof of ownership for the subdomains is no longer needed?
Yes, that is true. I guess it was stuck in my head and I attached too much significance to it.
ACME Certificate orders are bound to an ACME (LetsEncrypt) account, which is identified by an "AccountKey". The Account and AccountKey is automatically generated by most ACME clients, including Certbot. Most Clients support multiple accounts, though some only support one. This usually all happens behind the scenes. Your domain currently has a LetsEncrypt Certificate, so there exists a LetsEncrypt account and someone authorized it, likely as part of a clickwrap agreement in the initial setup.
LetsEncrypt offers exemptions to RateLimits based on the Account or the Registered Domain.
If your VM users have access to the FileSystem, you must ensure that each VM (or user with multiple VMs) utilizes its' own Account/AccountKey with the ACME CA(s), and that the AccountKeys are properly isolated. i.e. No user can access another user's AccountKey, and no user can access your organization's AccountKey(s). Why do the Terms of Service require this? Possession of an AccountKey can be used to revoke any certificate(s) issued to that account.
If your plan is to just have a fresh acme installed and configure itself from scratch - there is no issue. Some users often have think they should install an organizational AccountKey on client devices/vms to help manage any issues - but that is both a security risk and TOS violation.
As a platform/operator, the ISRG TOS and CA/B Baseline Rules allow you to be a privileged custodian or have access of your customer's keys - so you are allowed to access the keys on their machine without triggering the requirement to report a Key Compromise. You can not, however, let your customers access your organization's keys.
HTTP-01 Challenge - route dns for all "unused" domains to a single IP/VM that handles that stuff.
DNS-01 Challenge - possibly easier. prove ownership of all required domains simply with dns entries. you can use the acme-dns system for that. a few years ago they merged a PR I submitted which allows you to pre-register accounts/domains into their stock system with a simple script. I can share that if wanted.
It might be a good idea to specify that on the rate limit form. Even though Rate Limits - Let's Encrypt specifies the definition of a "Registered Domain" and the form mentions "Override Request for Certificates per Registered Domain", it doesn't literally say subdomains are not allowed. Also, the rate limit documentation page states that the PSL is used to "to calculate the registered domain". So one might argue that if foo.example.com is on the PSL, it should be possible to get a rate limit exemption, even if it's a subdomain of example.com.
Also, in the form the sentence "If you specify both, your request will be processed without a response." does not specify the outcome of the "processing" mentioned. I assume it's "denied", but it's not literally stated."