Hi, we started seeing errors from the cert order API as of Jul 13, 2023, 22:34 UTC, and we're currently facing a bunch of stuck cert issuance.
An example error is:
Error creating new order :: too many certificates (5) already issued for this exact set of domains in the last 168 hours: xxxx.com, retry after 2023-07-15T07:31:28Z: see Duplicate Certificate Limit - Let's Encrypt
This is happening across all domains, and we're not issuing duplicate certifications as the error message says. This all stared suddenly.
Could you please check if Let's Encrypt made any changes around that time? or, having an issue at the moment? I don't see any indications on statuspage.
I find this unlikely. I suspect that you are issuing many duplicate certificates. Let's Encrypt has talked about revamping their rate limit structure, but they haven't released anything along those lines yet.
I'm also very confused how you say your problems started in April, but you're only noticing them now?
And as @griffin says, it's going to be hard for anyone to help you without the actual domain names. (And even with them, it's likely all we can do is point to certificate transparency logs to show that duplicate certificates are being issued.)
Is this a custom client? It's really not going to help much, almost everyone here is a random person who just wants to help people out, we don't have access to Let's Encrypt's logs. (The actual staff members here have been known to dig in for particularly thorny problems, but I think we'd need a lot more information here for them to be able to do so.)
And it does look like they had a release that the monitoring there first saw at 17:52:01 UTC. You can look at the code changes if you want, but I don't know how helpful that'd be. And it is possible that they changed some configuration around the same time, which isn't code.
I tend to doubt that they turned on Asynchronous Order Finalization in production at the same time, though I guess it's not impossible. Though clients that follow the ACME spec wouldn't have a problem with it whether on or off. Does your client currently work when tested against the staging environment, which has had it on since March?
Because if there are multiples it's hard to imagine how they were created except by a valid request.
If you do notice more frequent issuance, when did it start? An error of "too many" today was caused by prior issuance any time in past week. And, based on the "retry after" suggested date this may have started around Jul8
I spot-checked several of these recent rate limit errors that were served to clients with the vercel-acme/1.0 user-agent, and it looks like they were correct. Not all of the matching certificates are visible in crt.sh yet because of their ingestion delay, but they show up in our database and in Censys.
I also see a very large volume of these errors:
400 :: malformed :: POST-as-GET requests must have an empty payload
403 :: orderNotReady :: Order's status (\"valid\") is not acceptable for finalization
We did upgrade to a new Boulder release earlier today, but I haven't identified anything that should have changed how clients interact with the API.
Yes, a change (unrelated to the Boulder update) had indeed enabled async order finalization in production. Thanks again for bringing this to our attention, and I'm sorry about the trouble! We've now reverted the change, so things should be back to normal. We'll keep you updated on what we learn at our post-incident review.
We've been removing an extra templating layer that generates our Boulder configuration files, in order to make changing (and reviewing changes to) them easier. As part of that process, we inadvertently re-enabled the Async Finalization feature flag in production and did not catch it in review. Ironically, this is a good example of the need for this refactoring.
We've determined that the best fix here is to move forward with the large formatting changes to get the configs into a better state. We'll proceed with extra caution, and add extra tools to our reviews, until this refactoring is complete. This config (for the RA service) was the most complex of those remaining.
We've also thought of some ways we can make it easier to get a single, convenient view of recent changes, in order to avoid the mistake I made in concentrating on the Boulder update to the exclusion of other recent changes.