Failed Challenges Rate Limit/Prevention - Hosting Provider

Hi Matt,

We are querying each of the name servers directly for the TXT record and validating the value. Once all of the name servers have returned the expected record, we sleep for 5 seconds and accept the challenge.

The Google changes API is a great suggestion. As far as I can tell pending is similar to 202 Accepted and done means the change has been applied (although not necessarily propagated). I just ran a brief test by polling the changes API until the change id returned done, waiting 5 seconds, then checking the authoritative DNS for the DNS record which failed. Boulder wasn’t able to verify the DNS record either. The done reply usually comes back on the first check, whereas the propagation seems to take much longer.

:frowning2:

It may be impossible to use Google Cloud DNS perfectly reliably… The longer you wait, the better the chances are, but…

Does this problem mostly occur with newly registered domains? I think the propagation time for the .com nameservers is probably slower than Google Cloud DNS' propagation time.

From my experience querying against them, cloudflare, and a few others... the traffic is often distributed geographically and loadbalanced across clusters. The 'network' you resolve against might not be the same as the one Boulder resolves against, and even if it is... you may hit different internal servers.

I should have mentioned this before -- but some other hosts have spoken in the past about using multiple accounts and multiple keys across their clusters here. They were doing so for security and reporting, not to bypass ratelimits. I'm not sure if that is valid under the current TOS though.

Google Cloud DNS uses anycast. I don't believe there's a way to verify that a change has propagated globally.

I agree with @jsha that it sounds like you're doing the right thing. I think server or protocol changes will be needed to make the DNS challenge more reliable.

Short term, one mitigation might be to wait longer than 5 seconds.

It's not against ToS, it's just not best practice, since hitting the pending authz rate limit usually indicates a bug in the ACME client.

BTW, just to be clear, we're discussing two separate issues in this thread:

  • Hitting the pending authz limit (cause undetermined)
  • Failing DNS challenges (probably propagation-related)

The certbot client doesn't seem to have a means of listing or deactivating pending authorization challenges.

  1. Is there a Boulder API to list pending challenges for an account?
  2. Are there Boulder API docs for other clients to handle authorization revocations? (95% of my work is with a custom client, so that's of interest to me)

There is not. The client is responsible for keeping track of its pending authorizations.

For this you'd want to reference Section 7.5.2 of the latest ACME spec. By and large there is no Boulder API, just the draft ACME specification and the quirks Boulder has collected evolving alongside it.

Hope that helps,

Update: It seems that checking DNS records + sleeping for 5 seconds was enough most of the time, but still led to a significant amount of failures. As best I can tell it seems related to DNS propagation to different servers (likely based on geo location). I think someone may have mentioned this above but I can’t find the specific note at the moment.

Because each failed authorization attempt requires a ~10min delay in our system to requeue, and also requires additional DNS records to be created+torn down (burning through valuable rate limits with our DNS provider), I opted for a 20 second sleep after verifying the DNS records before accepting the challenge. Since implementing this on 4/27, we have had 0 failed DNS challenges. Clearly this timeout is arbitrary and will vary based on DNS provider and other factors, but I wanted to provide an update to wrap this issue up in case others find it helpful.

With regards to the pending authz limit: Storing all authorizations in a database has worked well for us. We have 0 pending authorizations since implementing. Our implementation involves listening for interrupt/kill signals and stopping immediately (which will cause the authorizations to be revoked and the database to be updated). We immediately stop on an interrupt signal so that we have the most amount of time to revoke authorizations and update the database (both of which will block the application from exiting given the importance). If for some reason we end up with pending authorizations, we can implement an occasional monitor that revokes + updates any pending authorizations older than x minutes since we have them all stored. This can help get us out of a bind quickly should something go wrong.

Lastly storing authorizations in the database has allowed us to throttle our workers (by ensuring we don’t attempt any authorizations if we have x authorizations already pending) and keep track of statistics so we can internally monitor success, failure, and revocation rates.

Thanks for everyone’s help. These notes are clearly specific to our implementation, but hopefully this helps others who are running into similar issues or trying to plan a large-scale integration.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.