Failed Challenges Rate Limit/Prevention - Hosting Provider


#1

We’re a hosting provider with several thousand sites, and up until this point we’ve created separate Let’s Encrypt accounts for each of our customers. We recently updated to a single account as a hosting provider, but ran into an issue with rate limits on failed challenges early this morning during a migration.

Early this morning I was switching from AWS to GCP. It involved a name server change, which I believe caused a percentage of DNS challenges to fail. Most would retry and go through.

However, we don’t save/reuse challenges, so I believe the number of unsolved challenges has grown in our account to the point where we hit a rate limit wall. I looked at the rate limit increase form for hosting providers, and it mentions allowing up to 300 unsolved challenges if you operate over 250,000 FQDN.

I believe this is the limit we hit based on what I saw in the logs:

We recently (April 2017) introduced a Failed Validation limit of 5 failures per account, per hostname, per hour. This limit will be higher on staging so you can use staging to debug connectivity problems.

But at some point, all certificates stopped being issued. It sounds like the above rate limit is limited to hostname so I’m not sure why no new certificates were able to be issued.

I’m curious about:

  1. Is there a way to coordinate with the Let’s Encrypt team on the specific account and understand specifically what went wrong?

  2. Is the way to prevent the number of challenges from ballooning by saving + reusing challenges? Maybe a basic question but it didn’t occur to me when going through the updated integration.

  3. Do those unsolved challenges ever expire? I.e. let’s say we accidentally send over a wrong domain (typo/bug) that never gets resolved. Some of our customers have domains with registrars that aren’t pointed to us and it causes the DNS to fail (the customer may end up never pointing to us, leaving us with an unresolvable challenge). These are rare I just want to understand the best practices.

  4. How can I best resolve the current situation to get out from under these rate limits quickly? It may depend on #1 and #2, but not sure how to finish this migration that is currently stalled or prevent this from happening in the future.

Thanks!

Cristian


#2

@jsha @cpu, could you please take a look at this question?


#3

Hi Cristian,

Thanks for writing. Can you tell us what software you’re using to issue, and some example failed domains? Ideally your software should surface error details so you can tell us what kind of limits you are hitting.

If you’re hitting the failed validation limit, that is specific to a given hostname. So failing validation 5 times in an hour for example.com won’t prevent you from completing validation for other.example.net. One common thing we see with hosting providers is that their database of hosts continually falls out of sync with what’s in DNS. For instance, as I’m sure you’re aware, a customer’s domain can expire, or they can point it at a different hosting provider. That means that issuance for that domain will continually fail until you remove or suspend the customer’s domain in your database. This will look like a high error rate from the Let’s Encrypt API. My recommendation, if that’s the case, is to implement backoff for failed validations, and possible use Let’s Encrypt validation failures to trigger any internal re-validation systems you have and possibly suspend inactive domains.

It’s also possible you’re hitting the “pending authorizations” rate limit, which usually means your client is creating authorizations, then failing to “complete” them by POSTing to the authorization URL. This is usually a client-specific problem, but if we know what client you’re using we might be able to give advice.


#4

Hi Jacob,

We are using the xenolf/lego library. We previously had some domains pointing to us but before this migration required all of our clients to transfer those to our registrar so we’re in control of the nameservers. I was asking about some other scenarios to gain a better understanding of those pending challenges.

Lego has a default timeout after 60 seconds, and I’m seeing NXDOMAIN errors. In looking through the logs, I am seeing we had identical authz urls on subsequent DNS challenge retries so it doesn’t seem like new challenges are generated each time. I think what may have happened is the following:

  • After queueing up domains in small increments, I queued up 1500 since things looked good. We have throttling in place so at most would have 20 concurrent requests (which I increased for this migration).
  • In enough cases the xenolf/lego timeout of 60 seconds occurred before Let’s Encrypt could verify all of the challenges, so the domain(s) go back on the message queue to be retried in ~5-10 minutes.
  • While those domains are being retried, new domains started the auth process and while some may have made it through I’m assuming most encountered the same cycle until we accumulated 300 pending authorizations.

This is the error I saw after stepping away for a brief nap:
acme: Error 429 - urn:acme:error:rateLimited - Error creating new authz :: too many currently pending authorizations

This is one of the domains that hit the rate limit: thediscounters-place.com

If that sounds plausible to you, I think it makes sense to check the number of domains in the authorization phase and ensure that we don’t start the “obtain certificate” flow unless that number is below 250 let’s say.

Lastly, is there a way for you to provide the list of domains with pending authorizations on the account via email? If I can determine the 300 that are pending, I can run those through then proceed with the rest using the strategy above.

Thanks so much for the prompt and detailed replies!


#5

This might be part of the issue. Boulder (the Let’s Encrypt server) expects clients to either complete each authorization or deactivate it. If there are some code paths where a client can crash out without POSTing to one of an authorization’s challenges, that can lead to “leaking” authzs. We have had another hosting provider report problems with xenolf/lego doing just this.

If your client is consistently leaking authorizations, this won’t help, since you’ll quickly hit 250 and be stuck. You need to have some way of ensuring that each authorization is completed somehow. This commit purports to help by logging authorization URLs and providing a method to deactivate failed authorizations. It’s not perfect; xenolf/lego should still be fixed so it doesn’t leak authorizations, even in error cases. But it may help with your specific situation.

Given that you are still in an outage situation, it probably makes sense to add some throttling, and then proceed with issuance using a new account, which won’t currently be rate limited under the pending authz limit. I don’t recommend this in general, especially since there’s a good chance you’ll hit the limit again, but it’s worth a try in this case.


#6

Can you post what version of xenolf/acme you are using?


#7

Also, are you using multi-domain certificates?


#8

Ah that makes sense. I may need to increase the timeout then and look into destroying the authorization if it doesn’t succeeed. Still digging into your responses, but wanted to get you this info in the meantime:

xenolf/lego version: 5dfe609afb1ebe9da97c9846d97a55415e5a5ccd

We are using SAN certificates but each san is limited to a single TLD not multi-domains. So [example.com, www.example.com] will appear on the SAN, but not [example.com, www.example.com, example.net]


#9

FYI, I also posted an issue on xenolf/lego to try and reproduce: https://github.com/xenolf/lego/issues/383.


#10

Could you provide some clarity on what is counted as a pending authorization, as well as some more color on when to revoke them?

I’m doing some testing with a deferred revocation and came across this error when revoking a failed authorization:
Error deactivating authorization :: only valid and pending authorizations can be deactivated

Does pending mean an authorization was requested but never accepted? Or is it any non-completed or failed/invalid authorization? Based on the error message it seems I should only be deactivating authorizations if I’m not able to attempt to complete them (pending) – or after I have completed them, but in the issue that led to the rate limit attempts were made (I just kept hitting the timeout during the polling).

Also, is there any way to retrieve pending authorizations in order to monitor/resolve?

I may be missing an important detail somewhere just want to ensure I structure the client in a way to avoid accumulating pending authorizations.

Thanks in advance!


#11

Failed authorizations do not count against the limit of 300 pending authorizations per account. Those would not be causing any issues.

The authorization/challenge mechanism in ACME works roughly like this:

  1. Client requests an authorization for a domain
  2. Server provides a set of challenges for the client to complete, something like serving a HTTP request with a specific response under the /.well-known/acme-challenge path.
  3. Client makes sure it is able to complete the challenge (for example by writing the file to its web server’s DocumentRoot) and tells the server it is ready for the challenge verification.
  4. Server checks the submitted challenge and changes the authorization status to either “valid” or “invalid”.

A pending authorization is one where step 3 (and 4) never happened - in other words, the client never told the server to go ahead and check the challenge. You’re limited to 300 such authorizations per account.

Unfortunately, there’s no such API endpoint. The only way to keep track of authorizations is for the client to store them somewhere (i.e. in logs or a db).

As this rate-limit is account-specific, fixing the underlying issue causing the authorizations to “leak” (or ensuring they’re stored/logged, so that you can deactivate them if you run into rate limits) and then migrating to a new ACME account (key) would be a good workaround.


#12

Thanks, Patrick. This is very helpful. One final question:

In step 3, where I make a post to the challenge URI to accept it (after the DNS records have been created) – If that POST completes to ACME with no error, can I safely assume the authorization will not need to be revoked? ACME has triggered a guaranteed validation that will either change the authz status to valid or invalid? I’m currently revoking all authorizations up until that point in the flow is reached.

I will add a db store as you suggest.

Thanks again, this is very helpful.


#13

What is the timeout on those?


#14

That’s correct. Once the challenge has been submitted, the authz status, short of any server-side issue, is guaranteed to become “valid” or “invalid”, and won’t count against the pending authorizations limit anymore.

The default lifetime for pending authorizations is set to 7 days in the CA software. I haven’t looked at any pending authorizations lately, so it’s possible Let’s Encrypt is using a different value in production nowadays, but I don’t recall any announcements on that matter, so it’s probably still a week.


#15

Wow. That’s long. I expected it to be a few hours.

I understand why a DNS challenge might take several days, but for HTTP auth? wow.


#16

Assuming it hasn’t changed in the last two weeks, and the “expires” field in the authz response doesn’t mean something else, it’s still 7 days in both prod and staging.


#17

Quick update in case anyone in this situation finds this helpful:

  • We switched to golang.org/x/crypto/acme. It was quick to implement from what we had with github.com/xenolf/lego but allowed for revoking authorizations as well as writing custom logic to solve challenges that met our needs
  • We keep track of all authorizations and authorization states in the database (pending, valid, invalid, revoked/deactivated). We revoke all authorizations unless we have successfully notified Boulder that we have accepted the challenge
  • We check the number of pending, unexpired authorizations and ensure it’s less than a certain threshold before beginning the process of obtaining a certificate
  • We throttled our workers at a concurrency level that minimized 429s (from new authz requests)

I was able to queue up several thousand domains in our pipeline and get everything issued in just a few hours. Appreciate everyone’s help.

I am just running into one final issue: occasionally the DNS validation will fail, even though we are verifying the record on each of the authoritative name servers, and then adding a 5 second delay before notifying Boulder we are accepting the challenge. While not the majority of the time, in monitoring the logs I saw a noticeable amount of times where one of the two authorizations for a TLD succeeded while the other failed (i.e. example.com fails while www.example.com succeeds).

@jsha @pfg Are there any specific recommendations or changes in process above I can implement to prevent DNS authorization failures?


#18

Not currently; it sounds like you are doing the right thing. You’re not the only person who has reported problems with DNS authorization failures even after checking with all authoritative nameservers, though. We’ve been meaning to look into ways to make the DNS challenge more reliable.

I assume the authoritative nameservers either are not anycast, or if they are anycast, you checked all the individual instances? Can you give some examples of FQDNs that have failed, and the errors you got?


#19

We are using Google Cloud DNS verifying each of the name servers used in the zone. Here are two recent failures:

www.skinharbor.com
www.green-premiumcoffee.com

The authz error for both is DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.green-premiumcoffee.com.

For skinharbor.com, the internal DNS validation verified www.skinharbor.com four seconds before skinharbor.com. The challenges were then submitted to Boulder four seconds apart (www.skinharbor.com first, which failed).

With green-premiumcoffee.com however, both were verified internally at almost the same time and submitted to Boulder for verification of the challenges milliseconds apart (Boulder submission times: 2017-04-24T10:24:39.069836175Z for www and 2017-04-24T10:24:39.269422722Z for the apex). The www validation failed but the validation for the apex somehow succeeded.


#20

How are you verifying that the changes have propagated to all of Google’s servers worldwide?

I don’t use Google Cloud DNS, but the changes API will say whether a change is “pending” or “done”, and i think it means done propagating.

https://cloud.google.com/dns/monitoring
https://cloud.google.com/dns/api/v1/changes