Hi,
as a service provider we are getting series of 500s with “Error creating new cert” on attempted provisioning.
Started around 2017-07-28 12:33 Pacific Time, likely going from
or
as this is an internal, we don’t get any reasonable details from server, although they seem to be logged on your side.
I suspect you saw these errors transiently during the Boulder upgrade maintenance window from components being restarted, not because of the content of the update itself.
The last burst of issues we have seen were around 2017-07-31 22:12 Pacific Time which doesn’t correspond to any prod rollout (to my best knowledge). That is also the time window I sampled domains from.
Adding some details to the mbwalas report, as the problem is ongoing.
107/3087 ~ 3% of new-cert requests (from the last 8 days) we made for subdomains of sdp.certsbridge.com failed with “500 urn:acme:error:serverInternal: Error creating new cert”. On the other hand, for all other domains we had only 0.05% such errors for new-cert request. The problem is ongoing for about 1 month, the errors appear regularly, with peaks of ~5 errors in a row almost every day. We haven’t observed any specific time pattern apart from that.
If you could provide the list of domains that reliably fail, that would be helpful. We’ve dug into the problem a bit and are pretty sure it’s related to a slow database query in our rate limiting code, some of which changed recently. But we haven’t nailed down exactly why the query is slow. The list of domains would help.
We’re happy to let you know that we haven’t observed this issue since Aug 10, 10:18 PDT. This coincides with the last week’s planned Boulder push, so I guess the fix must have gone in with the new release.
Do you have any more context on what could have caused these problems? I couldn’t find any obvious fix in the changelog
I’m happy to hear the problem hasn’t resurfaced for you.
This was related to a new approach to calculating an existing rate limit that was introduced in master with 71f8ae. We were able to cross reference the information you provided with when this feature was enabled in production and identified that it interacted poorly with certain issuance patterns.
Since this code was feature-flag gated per our usual practice we disabled the feature flag as a configuration change which is why you aren’t able to see a fix in the changelog. As you observed this was done on Aug 10th See this API announcement post for more.
At this point I believe we intend to abandon the approach in master and will revisit with a more performant solution involving a database migration in the future when we have the resources on both the dev and ops side available.