New Authz rate limit in Staging

Hey CPU!

I'm afraid you've gotten the wrong impression, and I'm afraid it will cause other folks to not take our problem seriously. Please let me clarify quickly

I think we’ve been here before back in April ...

The link you provided is from April 2017, over a year ago, and it happened during development when we were building this system and working the bugs out.

How did you resolve the problem?

The solution was to finish development. We hit that rate limit while developing and testing, learning how your system works, which was reasonable and expected, and not evidence against our overall architecture. It's ran very well for a year now.

Since then we've been fine. Last autumn we hit CAA issues after Let's Encrypt introduced some changes around CAA. Our original solution didn't account for CAA at all, I had never heard of CAA before and it was a learning experience. That was our only issue before this. There have been no other issues. That one issue during development and the CAA enhancement last autumn are what you're generalizing as a "history."

The current pending authz problem, and the malformed request problem you’re troubleshooting in the other thread I think there is mounting evidence....

What you're saying here I think is unfair: this problem is one and the same, not mounting evidence. The "malformed request problem" caused the newauthz rate limit. We're trying to solve the "malformed request problem." It's not fair to point to the 2 symptoms of our 1 current issue and cast them as multiple evidence that we're just unstable. I have the fear, though you haven't said it explicitly and so I could be wrong, that you're thinking we're not worth helping through this problem because you're seeing us as just generally unstable perhaps?

It sounds like your needs as a large integrator are perhaps not a great fit for Certbot...

... you need to revisit your architecture

Our solution has worked fine for a year. Also it's our eventual goal to re-write the solution entirely to use DNS auth and acquire wild-card certs, so any architectural changes or refactoring we'd do now would be sort of a waste.

A general criticism of our perceived stability isn't something we can improve on or take action on, at least not right now. Our problem right this minute is that, for the first time since launching, we have hundreds of expired customers calling in complaining, possibly losing customers, and a thousand more expiring in the next week before the newauthz limit dies. This is terrible situation for us. What we need is to clear out the new authz limit immediately.

Do you have access to your Certbot logs?

Yes, of course.

If I understand correctly, there's a process I can follow to clear out some of these authz myself, which would look something like this:

  • Read through logs, figure out which logs are indicating an auth failure (not sure what to look for)
  • Gather up specific information from those logs (not sure which information), which I'm finding out from you now is the "pending authorization IDs"
  • Write or use a script to hit some endpoint that I'm not yet familiar with, which will clear the newauthz limit, and my only information about how to hit that endpoint has come in the form of a script provided kindly by jmorahan, in a language I'm unfortunately not familiar with at the moment.

I can do this process, and will begin on it when I'm done with this post, but I estimate it could take a day or more, and was really just hoping that there would either be a more explicit, simpler process or maybe even that since this is just staging, and we're seriously hurting right now, an LE admin could be kind enough to just clear our staging rate limit for us. We need to fix the malformed issue, and if we test our fix in production we risk getting rate limited there as well, which I believe occurs much sooner than staging.