We are seeing no such authorization error occasionally in our system. We are using auth url returned by new order by and we call load authorization API just after new order api, so I believe we shouldn't encounter this issue. Usually, cert issue process works so I'm wondering if this is Let's encrypt side issue.
Thank you for your quick follow up.
Got it, I'll use staging let's encrypt endpoint for our staging env. We also test our prod system, and we need to use prod let's encrypt endpoint in our prod system on our new system release date.
We also see same error for our customers domain too (sorry, I can't share the domain here without their permission).
Usually, we got rate limit error when we hit it, so our system wait new API call until specified timing and retry. but this time it's fails because our code considers 404 as non temporary error, so it won't retry.
We can change our system to retry on 404, but we would like to see rate limit error if this is rate limit error.
I haven't examined your case, but as general advice: Let's Encrypt production is geographically distributed to two redundant locations, each containing multiple database replicas. There is a known problem with this topology: It is possible to create a resource (like an account, authorization, or order), and then a subsequent read operation ends up hitting a read-replica which is not caught up. We do not allow our replication lag to drop below 1 second, but that does leave a window of opportunity for incorrect 404s for just-created resources.
If you're writing your own client, I'd recommend adding retries to most API calls. While this should be rare, and we are working on making it rarer or not present, it is probably worth handling. Ensuring your HTTP client is reusing connections will also help stay in the same datacenter.
Note that our staging environment is much less distributed, much smaller, and generally less likely to exhibit these behaviors.
Beyond that, I suggest logging all API calls and errors. With our custom client, we decided to log API calls to throttle on our end (if we know something will be rate-limited, just delay until it won't be!), and also log errors in a manner that allowed us to quickly replicate and test issues.
We typically sleep for at least 1 second on all authorizations to handle DNS-01 updates, as not applying that to HTTP-01 was a chunk of extra work. Until @mcpherrinm's post above, I had a ticket to remove that as a bug, but now I'm leaving it in as a feature to get around ISRG's replication.