New Authz rate limit in Staging

We are currently rate limited in Staging, which is a problem for us as we run all of our authorization attempts first in staging to prevent hitting rate limits in production.

The rate limit we hit now is:

There were too many requests of a given type :: Error creating new authz :: too many currently pending authorizations

Now, looking here, I cannot tell which rate limit this is that I’ve actually hit. How long will we have to wait to begin running authorizations again?

I’ve waited over an hour and there was no improvement. I’m really fearful this will be a week long rate limit like it is in production? If so, that would be devastating for us.

Looks like this one:

You can have a maximum of 300 Pending Authorizations on your account. Hitting this rate limit is rare, and happens most often when developing ACME clients. It usually means that your client is creating authorizations and not fulfilling them. Please utilize our staging environment if you’re developing an ACME client.

The rate limits on staging are the same as on production, except for the ones listed on the page you linked to.

If your client is leaking authorizations it probably needs to be fixed anyway, but as for possible temporary workarounds... If you have sufficient logs of your authorization attempts you can clear out the pending authorizations directly, for example something like this script which is based on certbot's logs (though as written it points to production). Alternatively, the description seems to imply that this particular limit is account-specific so maybe you could circumvent it with a new account on staging?

(I don't know how much of the above is also true of ACMEv2, so if you're using that, maybe someone else can fill in the gaps.)

1 Like

Oh noooo this is terrible.

We have hundreds of expired domains right this minute, and hundreds more over the next several days. We need that rate limit cleared!

If your client is leaking authorizations it probably needs to be fixed anyway

That's how we hit the rate-limit in the first place is trying to fix the cause, but doing so while trying to keep production certs afloat. I have a separate forum thread going where we're making progress figuring out the root cause of our "authz leak."

We have to do this while keeping our production certs afloat, which absolutely depend on us using staging... Is there anybody out there who could just clear out all of our pending authz for us?

PS - I do see the script you linked. I don't know GO, and don't know which of the thousands of domains caused this, so I'm skeptical that I'll do anything but spin my wheels trying to clear those out :fearful:

Hi @lancedolan,

I think we've been here before back in April: Multiple accounts to temporarily get around new authz limit? - #6 What was the cause of the leaked authorizations back then? How did you resolve the problem?

Between your history of posts, the current pending authz problem, and the malformed request problem you're troubleshooting in the other thread I think there is mounting evidence you need to revisit your architecture.

It sounds like your needs as a large integrator are perhaps not a great fit for Certbot. Have you considered using Certbot's Python ACME module directly instead of trying to script invocations of Certbot? There are also other ACME clients that can also be used programatically as a library.

I'm not sure I understand. Why are production certificate renewals affected by the staging rate limit?

Do you have access to your Certbot logs? If you do then you can likely identify pending authorization IDs and affected domains within the log retention period. If you don't have logs you should definitely fix this with a high priority!

Hey CPU!

I'm afraid you've gotten the wrong impression, and I'm afraid it will cause other folks to not take our problem seriously. Please let me clarify quickly

I think we’ve been here before back in April ...

The link you provided is from April 2017, over a year ago, and it happened during development when we were building this system and working the bugs out.

How did you resolve the problem?

The solution was to finish development. We hit that rate limit while developing and testing, learning how your system works, which was reasonable and expected, and not evidence against our overall architecture. It's ran very well for a year now.

Since then we've been fine. Last autumn we hit CAA issues after Let's Encrypt introduced some changes around CAA. Our original solution didn't account for CAA at all, I had never heard of CAA before and it was a learning experience. That was our only issue before this. There have been no other issues. That one issue during development and the CAA enhancement last autumn are what you're generalizing as a "history."

The current pending authz problem, and the malformed request problem you’re troubleshooting in the other thread I think there is mounting evidence....

What you're saying here I think is unfair: this problem is one and the same, not mounting evidence. The "malformed request problem" caused the newauthz rate limit. We're trying to solve the "malformed request problem." It's not fair to point to the 2 symptoms of our 1 current issue and cast them as multiple evidence that we're just unstable. I have the fear, though you haven't said it explicitly and so I could be wrong, that you're thinking we're not worth helping through this problem because you're seeing us as just generally unstable perhaps?

It sounds like your needs as a large integrator are perhaps not a great fit for Certbot...

... you need to revisit your architecture

Our solution has worked fine for a year. Also it's our eventual goal to re-write the solution entirely to use DNS auth and acquire wild-card certs, so any architectural changes or refactoring we'd do now would be sort of a waste.

A general criticism of our perceived stability isn't something we can improve on or take action on, at least not right now. Our problem right this minute is that, for the first time since launching, we have hundreds of expired customers calling in complaining, possibly losing customers, and a thousand more expiring in the next week before the newauthz limit dies. This is terrible situation for us. What we need is to clear out the new authz limit immediately.

Do you have access to your Certbot logs?

Yes, of course.

If I understand correctly, there's a process I can follow to clear out some of these authz myself, which would look something like this:

  • Read through logs, figure out which logs are indicating an auth failure (not sure what to look for)
  • Gather up specific information from those logs (not sure which information), which I'm finding out from you now is the "pending authorization IDs"
  • Write or use a script to hit some endpoint that I'm not yet familiar with, which will clear the newauthz limit, and my only information about how to hit that endpoint has come in the form of a script provided kindly by jmorahan, in a language I'm unfortunately not familiar with at the moment.

I can do this process, and will begin on it when I'm done with this post, but I estimate it could take a day or more, and was really just hoping that there would either be a more explicit, simpler process or maybe even that since this is just staging, and we're seriously hurting right now, an LE admin could be kind enough to just clear our staging rate limit for us. We need to fix the malformed issue, and if we test our fix in production we risk getting rate limited there as well, which I believe occurs much sooner than staging.

Working on using the script provided by jmorahan…

clear-authz.go

I’ve never written go, but believe this is a command line tool, that would be used something like this:
go run main.go /path/to/certbot-key.json

Additionally I think it takes some parameters so it knows which authz to handle? Anyhow…

My Questions

  • What am I looking for in my logs in order to find a “leaked” auth?
  • Once found, what info do I put into the command, after the path to the certbot key?
  • The script ends with a message “Accepted challenge: %+v %+v\n” … What is the result of accepting a challenge - is that just how you clear it out so it’s no longer hanging?

I'm certainly not trying to dissuade anyone from helping you. I apologize if I was overly terse or too quick to suggest that you might benefit from an architectural reevaluation. I still believe that Certbot as you are using it is likely to cause you continued difficulties but I'm willing to table that suggestion if you disagree.

OK, I can understand hesitation to drastically rework the existing system if you already have an ACME v2 & DNS-01 approach in the works.

I'm still unclear why a pending authorization rate limit in the staging environment is affecting your production renewals.

I'm afraid there is no other procedure.

Yup, to be explicit you'll also need to follow the Go setup instructions, and then get the dependencies for that tool:

  1. go get golang.org/x/net/context
  2. go get golang.org/x/crypto/acme
  3. go get gopkg.in/square/go-jose.v2

A response from the new-authz endpoint that wasn't later used in a POST request to a challenge.

Nothing. It looks like you run the tool and it reads authorization URLs from STDIN. You can provide those from a file by using normal shell procedure ("< authz_urls.txt").

This tool was written by @_az - they might be able to provide more concrete guidance.

1 Like

clear-authz.go eats production authzs by default. You don't have to know Go, but you do have to change "acme-v01" to "acme-staging" in the source code. (Unfortunately.)

If you're using the ACMEv2 staging endpoint... that could be a problem. But you're not, right?

You don't have to figure out which authzs are pending or not, the program can detect that on its own.

Yes. A pending authz is one that you haven't tried to validate yet. The way to make them go away is to try to validate them. It doesn't matter of it succeeds or fails, the key is to make it try.

5 Likes

Sweet baby jesus, that did it.

Thank you @cpu and @mnordhoff and @jmorahan

we were able to feed our log file through the script and release the challenges.

Next step: solve this issue that’s causing the leaks in the first place.

Next step much later after that: total rewrite :stuck_out_tongue:

1 Like

Update: @_az has added several enhancements to clear-auth tool (version v.0.0.2). It can use an env variable to specify the endpoint to be used (defaults to acme-v01) so there is no need to edit the source file and recompile it if you want to use the staging endpoint. Also, there is no need to specify the path to the acme account key to be used if you don’t use several accounts for the same endpoint or you are not using another path different to /etc/letsencrypt/accounts/*, clear-authz will use the default.

Just for the records, as build clear-auth developed by @_az requires some steps, here a mini guide to compile/use it in GNU/Linux:

To use clear-authz you have two options, compile the tool or use the binaries provided by @_az.

Option 1 - Compiling clear-authz

1.- Create work dir:

mkdir -p ~/projects/

2.- Download last go version from https://golang.org/dl/ (at the time of writing this mini guide, it was 1.10.2)

cd ~/projects/
wget https://dl.google.com/go/go1.10.2.linux-amd64.tar.gz
tar -xzf go1.10.2.linux-amd64.tar.gz

3.- Export variables to define PATHS:

export PATH=~/projects/go/bin/:$PATH
export GOPATH=~/projects/

4.- Build clear-authz and copy the binary to /usr/local/bin/ :

go get -u github.com/alexzorin/clear-authz
sudo cp bin/clear-authz /usr/local/bin/

Note: copy the binary to another path is optional, just keep in mind that you will need to specify the relative or full path to the tool when you want to use it.

Option 2 - Download binary (as root user)

wget https://github.com/alexzorin/clear-authz/releases/download/v.0.0.2/clear-authz -O /usr/local/bin/clear-authz
chmod 750 /usr/local/bin/clear-authz

Now we can use clear-authz, so we need to pass to the program the logs of our client where the authzs are logged.

Examples:

1.- Using acme-v01 (default) as endpoint:

cat /var/log/letsencrypt/letsencrypt.log* | clear-authz

2.- Using a custom acme account key for acme-v01:

cat /var/log/letsencrypt/letsencrypt.log* | clear-authz /path/to/acme-v01/account/key

3.- Using staging as endpoint:

cat /var/log/letsencrypt/letsencrypt.log* | CLEAR_AUTHZ_SERVER=acme-staging.api.letsencrypt.org clear-authz

4.- Using a custom acme account key for staging:

cat /var/log/letsencrypt/letsencrypt.log* | CLEAR_AUTHZ_SERVER=acme-staging.api.letsencrypt.org clear-authz /path/to/staging/account/key

Output examples:

1.- One pending authz found and cleared:

# cat /var/log/letsencrypt/letsencrypt.log* | clear-authz
2018/05/09 11:16:04 Using /etc/letsencrypt/accounts/acme-v01.api.letsencrypt.org/directory/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/private_key.json for private key for acme-v01.api.letsencrypt.org
2018/05/09 11:16:04 Checking 1 authzs to see if they are pending ...
2018/05/09 11:16:05 Found pending authz at https://acme-v01.api.letsencrypt.org/acme/authz/Ad3-7bsCZ6Os-uxT-gjdAt-a989ailhCN_h1LrDiJbs, will accept first challenge
2018/05/09 11:16:05 Accepted challenge: &{Type:dns-01 URI:https://acme-v01.api.letsencrypt.org/acme/challenge/Ad3-7bsCZ6Os-uxT-gjdAt-a989ailhCN_h1LrDiJbs/4565806913 Token:a1_3x5plttT_I-eg3DLblr30qGXMDk4Zcfstq-yR5G4 Status:pending Error:<nil>} <nil>

2.- One authz found but it is not pending:

# cat /var/log/letsencrypt/letsencrypt.log* | clear-authz
2018/05/09 11:16:12 Using /etc/letsencrypt/accounts/acme-v01.api.letsencrypt.org/directory/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/private_key.json for private key for acme-v01.api.letsencrypt.org
2018/05/09 11:16:12 Checking 1 authzs to see if they are pending ...

3.- No authz found:

# cat /var/log/letsencrypt/letsencrypt.log* | clear-authz
2018/05/09 11:16:27 Using /etc/letsencrypt/accounts/acme-v01.api.letsencrypt.org/directory/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/private_key.json for private key for acme-v01.api.letsencrypt.org
2018/05/09 11:16:27 Checking 0 authzs to see if they are pending ...

Warning: this tool only works for ACME API v1 but not for v2

Hope this helps.

Cheers,
sahsanu

6 Likes

I just want to swing in again to say thank you to everyone!

Looking for questions I failed to answer while in a panic earlier...

Why are production certificate renewals affected by the staging rate limit?

Our current (and quickly becoming legacy) procedure is to cert 100 domains at once. Despite our best efforts to programmatically vet each domain first, inevitably several still fail their challenge... We then parse the failed domain from the letsencrypt stderr and try again. So, often this means attempting with 100, then 99, then 98... Sometimes down to as little as 70 or 60 certs, after 30 or 40 retries. Out of sheer caution, I decided these retries should happen on staging first. Once we prove that a set of domains can pass challenges in staging, THEN we process them in production. I suspect this has allowed us to bypass rate limit issues in production that we otherwise would have hit, though I could be wrong.

Our next solution is unlikely to have this "feature." I think we need to get along with what we've got for several more months, but with this authz_clear.go and some pausing between letsencrypt commands, I'm sure we'll be fine as we've been in the past. :+1:

By the way, a great resource for this (which I think will get even better over time) is the letsdebug code base from @_az. It can diagnose a number of things that are wrong with a domain configuration that could prevent it from getting a Let's Encrypt certificate.

In addition to the public instance at https://letsdebug.net/, you could incorporate the code into your own projects. I hope that I and others will be able to contribute more tests over time to detect more and more potential problems with domain configurations.

1 Like

Yikes. I’ve updated the repo for clear-authz with some instructions, binary releases, and parameterized the ACME server.

However, it’s of limited use these days since most people will have moved onto ACME v2.

5 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.