Series of 500s with "Error creating new cert"

Hi,
as a service provider we are getting series of 500s with “Error creating new cert” on attempted provisioning.
Started around 2017-07-28 12:33 Pacific Time, likely going from


or

as this is an internal, we don’t get any reasonable details from server, although they seem to be logged on your side.

Could you check the logs for domains:
wkiofhkmtgacpeekmfsp.sdp.certsbridge.com
ntpgyiddtswxucaibbha.sdp.certsbridge.com
miargxtiqoqfoocfeeyw.sdp.certsbridge.com

I suspect this could be related to Boulder Update to +d2af4a0.

Thanks,
Marcin

We have found occurrences from well before Boulder Update to +d2af4a0 so please disregard this.

I suspect this could be just combination of various transient issues surfacing similarly over time.

I suspect you saw these errors transiently during the Boulder upgrade maintenance window from components being restarted, not because of the content of the update itself.

Hope that helps,

The last burst of issues we have seen were around 2017-07-31 22:12 Pacific Time which doesn’t correspond to any prod rollout (to my best knowledge). That is also the time window I sampled domains from.

Adding some details to the mbwalas report, as the problem is ongoing.
107/3087 ~ 3% of new-cert requests (from the last 8 days) we made for subdomains of sdp.certsbridge.com failed with “500 urn:acme:error:serverInternal: Error creating new cert”. On the other hand, for all other domains we had only 0.05% such errors for new-cert request. The problem is ongoing for about 1 month, the errors appear regularly, with peaks of ~5 errors in a row almost every day. We haven’t observed any specific time pattern apart from that.

Thanks for adding more detail.

I’ll raise this internally for more digging.

We too are seeing the error: “500 urn:acme:error:serverInternal: Error creating new cert”.

We are reliably producing this error (100% of attempts fail for a specific list of domains) every 5 minutes (our back-off re-try period).

What information can I provide to help debug?

If you could provide the list of domains that reliably fail, that would be helpful. We’ve dug into the problem a bit and are pretty sure it’s related to a slow database query in our rate limiting code, some of which changed recently. But we haven’t nailed down exactly why the query is slow. The list of domains would help.

@jsha Thank you for your response. Here’s the list. I notice it has a lot of TLDs. I’m not sure if LE’s database design is impacted by that.

Please let me know if I can assist in any other way. I’m happy to look at Boulder logs/source or whatever else. Thanks!

5636026810761216-fe1.pantheonsite.io
parsons.mit.edu
washingtongrantmakers.com
www.prospergroupcorp.com
www.masskiting.com
masskiting.com
www.americanresidentproject.net
www.guitar-list.com
www.news.solve.mit.edu
www.heartsine.do
www.mydropninja.com
trex.mit.edu
americanresidentproject.com
test.tribalselfgov.org
gaz.orangesv.com
americanresidentproject.net
somedude.gpsimpact.com
www.washingtongrantmakers.com
www.heartsine.com.sv
www.heartsine.com.tw
mydropninja.com
201.arielgold.win
murphy4nj.com
developer.inmar.com
news.solve.mit.edu
americanresidentproject.info
sustainability.mit.edu
www.mydrupalwizard.com
mydropwizard.com
www.heartsine.my
www.heartsine.pe
www.heartsine.pl
www.heartsine.re
drupalgroup.mit.edu
www.mydropwizard.com
test.episcopaldioceseny.org
www.washingtongrantmakers.org
www.heartsine.cr
www.heartsine.hn
www.kovima.com
prospergroupcorp.com
americanresidentproject.ketchumdigital.com
www.heartsine.ie
www.heartsine.kr
www.heartsine.pt
www.spiria.com
www.heartsine.com.ve
www.heartsine.hk
www.heartsine.ro
gatan.com
iha.gpsimpact.com
medlinks.mit.edu
gschwendlab.mit.edu
kovima.com
www.heartsine.com.tt
www.heartsine.dk
www.heartsine.no
www.heartsine.ph
www.americanresidentproject.info
www.heartsine.cz
www.heartsine.ly
www.heartsine.rs
www.gatan.com
www.americanresidentproject.org
www.heartsine.is
test.dioceseny.org
americanresidentproject.org
spiria.com
washingtongrantmakers.org
sandbox.earthrights.org
hemond-lab.mit.edu
mydrupalwizard.com
www.heartsine.ec
www.heartsine.ht
www.heartsine.it
harvey-lab.mit.edu
beta.murphy4nj.com
guitar-list.com
dev.alexfornuto.com
www.heartsine.lu
www.heartsine.mx
www.lautomobile.ca
www.episcopal.nyc
www.medlinks.mit.edu
www.heartsine.gr
www.heartsine.hu
www.heartsine.jp
www.heartsine.ru
mclaughlin-lab.mit.edu
www.heartsine.hr
www.heartsine.in
www.heartsine.me
stage.achievempls.org
www.americanresidentproject.com
lautomobile.ca
desmarais-lab.mit.edu
www.murphy4nj.com
www.heartsine.ma
flukeprocessinstruments.com
www.flukeprocessinstruments.com

1 Like

We’re happy to let you know that we haven’t observed this issue since Aug 10, 10:18 PDT. This coincides with the last week’s planned Boulder push, so I guess the fix must have gone in with the new release.

Do you have any more context on what could have caused these problems? I couldn’t find any obvious fix in the changelog

Hi @stanwise,

I’m happy to hear the problem hasn’t resurfaced for you.

This was related to a new approach to calculating an existing rate limit that was introduced in master with 71f8ae. We were able to cross reference the information you provided with when this feature was enabled in production and identified that it interacted poorly with certain issuance patterns.

Since this code was feature-flag gated per our usual practice we disabled the feature flag as a configuration change which is why you aren’t able to see a fix in the changelog. As you observed this was done on Aug 10th :slight_smile: See this API announcement post for more.

At this point I believe we intend to abandon the approach in master and will revisit with a more performant solution involving a database migration in the future when we have the resources on both the dev and ops side available.

Hope that helps clarify!

1 Like

2 posts were split to a new topic: Consistent 500’s for new-cert (failing CAA for one domain)

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.