We are seeing no such authorization error occasionally in our system. We are using auth url returned by new order by and we call load authorization API just after new order api, so I believe we shouldn't encounter this issue. Usually, cert issue process works so I'm wondering if this is Let's encrypt side issue.
I ran this command:
We have custom code using V2 API.
We called New order API and Load authorization for each auth url returned by the new order API just after new order API.
It produced this output:
We occasionally see "HTTP error: 404 Not Found\n(problem (type "urn:ietf:params:acme:error:malformed") (instance "") (id ) (title ""): (detail "No such authorization"))","
If we retry same step, we can successfully issue new certificates.
Thank you for your quick follow up.
Got it, I'll use staging let's encrypt endpoint for our staging env. We also test our prod system, and we need to use prod let's encrypt endpoint in our prod system on our new system release date.
We also see same error for our customers domain too (sorry, I can't share the domain here without their permission).
Usually, we got rate limit error when we hit it, so our system wait new API call until specified timing and retry. but this time it's fails because our code considers 404 as non temporary error, so it won't retry.
We can change our system to retry on 404, but we would like to see rate limit error if this is rate limit error.
Could you elaborate? We are only using Let's encrypts HTTPs endpoint, and use DNS-01 for verification process. Also, I believe auth URL on response for new order API is using https.
I haven't examined your case, but as general advice: Let's Encrypt production is geographically distributed to two redundant locations, each containing multiple database replicas. There is a known problem with this topology: It is possible to create a resource (like an account, authorization, or order), and then a subsequent read operation ends up hitting a read-replica which is not caught up. We do not allow our replication lag to drop below 1 second, but that does leave a window of opportunity for incorrect 404s for just-created resources.
If you're writing your own client, I'd recommend adding retries to most API calls. While this should be rare, and we are working on making it rarer or not present, it is probably worth handling. Ensuring your HTTP client is reusing connections will also help stay in the same datacenter.
Note that our staging environment is much less distributed, much smaller, and generally less likely to exhibit these behaviors.
But the 404 malformed error message doesn't really invite to replying methinks?
Isn't it possible to reply with the order just only when the replica has confirmed existance of the authz? If it's usually very fast, waiting for it shouldn't really impact much, right?
Beyond that, I suggest logging all API calls and errors. With our custom client, we decided to log API calls to throttle on our end (if we know something will be rate-limited, just delay until it won't be!), and also log errors in a manner that allowed us to quickly replicate and test issues.
We typically sleep for at least 1 second on all authorizations to handle DNS-01 updates, as not applying that to HTTP-01 was a chunk of extra work. Until @mcpherrinm's post above, I had a ticket to remove that as a bug, but now I'm leaving it in as a feature to get around ISRG's replication.
This is definitely a bug and shouldn't happen! And we'd like to fix it. But as a practical matter, as a Let's Encrypt client, it's possible to work around.