Series of 500s with "Error creating new cert"

mbwalas · August 1, 2017, 9:32am

Hi,
as a service provider we are getting series of 500s with “Error creating new cert” on attempted provisioning.
Started around 2017-07-28 12:33 Pacific Time, likely going from

github.com

letsencrypt/boulder/blob/538aeb4a439949de5cd7a91ce2b17ce21c64b24e/wfe2/wfe.go#L775




	// Create new certificate and return
	// TODO IMPORTANT: The RA trusts the WFE to provide the correct key. If the
	// WFE is compromised, *and* the attacker knows the public key of an account
	// authorized for target site, they could cause issuance for that site by
	// lying to the RA. We should probably pass a copy of the whole request to the
	// RA for secondary validation.
	cert, err := wfe.RA.NewCertificate(ctx, certificateRequest, reg.ID)
	if err != nil {
		logEvent.AddError("unable to create new cert: %s", err)
		wfe.sendError(response, logEvent, problemDetailsForError(err, "Error creating new cert"), err)
		return
	}


	// Make a URL for this certificate.
	// We use only the sequential part of the serial number, because it should
	// uniquely identify the certificate, and this makes it easy for anybody to
	// enumerate and mirror our certificates.
	parsedCertificate, err := x509.ParseCertificate([]byte(cert.DER))
	if err != nil {
		logEvent.AddError("unable to parse certificate: %s", err)

or

github.com

letsencrypt/boulder/blob/2a84bc2495b79e7687ee5cde55e61d31d6371374/wfe/wfe.go#L960




	// Create new certificate and return
	// TODO IMPORTANT: The RA trusts the WFE to provide the correct key. If the
	// WFE is compromised, *and* the attacker knows the public key of an account
	// authorized for target site, they could cause issuance for that site by
	// lying to the RA. We should probably pass a copy of the whole request to the
	// RA for secondary validation.
	cert, err := wfe.RA.NewCertificate(ctx, certificateRequest, reg.ID)
	if err != nil {
		logEvent.AddError("unable to create new cert: %s", err)
		wfe.sendError(response, logEvent, problemDetailsForError(err, "Error creating new cert"), err)
		return
	}


	// Make a URL for this certificate.
	// We use only the sequential part of the serial number, because it should
	// uniquely identify the certificate, and this makes it easy for anybody to
	// enumerate and mirror our certificates.
	parsedCertificate, err := x509.ParseCertificate([]byte(cert.DER))
	if err != nil {
		logEvent.AddError("unable to parse certificate: %s", err)

as this is an internal, we don’t get any reasonable details from server, although they seem to be logged on your side.

Could you check the logs for domains:
wkiofhkmtgacpeekmfsp.sdp.certsbridge.com
ntpgyiddtswxucaibbha.sdp.certsbridge.com
miargxtiqoqfoocfeeyw.sdp.certsbridge.com

I suspect this could be related to Boulder Update to +d2af4a0.

Thanks,
Marcin

mbwalas · August 1, 2017, 9:48am

We have found occurrences from well before Boulder Update to +d2af4a0 so please disregard this.

I suspect this could be just combination of various transient issues surfacing similarly over time.

cpu · August 1, 2017, 11:58am

I suspect you saw these errors transiently during the Boulder upgrade maintenance window from components being restarted, not because of the content of the update itself.

Hope that helps,

mbwalas · August 1, 2017, 12:12pm

The last burst of issues we have seen were around 2017-07-31 22:12 Pacific Time which doesn’t correspond to any prod rollout (to my best knowledge). That is also the time window I sampled domains from.

gprusak · August 7, 2017, 10:32am

Adding some details to the mbwalas report, as the problem is ongoing.
107/3087 ~ 3% of new-cert requests (from the last 8 days) we made for subdomains of sdp.certsbridge.com failed with “500 urn:acme:error:serverInternal: Error creating new cert”. On the other hand, for all other domains we had only 0.05% such errors for new-cert request. The problem is ongoing for about 1 month, the errors appear regularly, with peaks of ~5 errors in a row almost every day. We haven’t observed any specific time pattern apart from that.

cpu · August 7, 2017, 2:52pm

Thanks for adding more detail.

I’ll raise this internally for more digging.

kf6nux · August 8, 2017, 2:45am

We too are seeing the error: “500 urn:acme:error:serverInternal: Error creating new cert”.

We are reliably producing this error (100% of attempts fail for a specific list of domains) every 5 minutes (our back-off re-try period).

What information can I provide to help debug?

jsha · August 8, 2017, 5:37am

If you could provide the list of domains that reliably fail, that would be helpful. We’ve dug into the problem a bit and are pretty sure it’s related to a slow database query in our rate limiting code, some of which changed recently. But we haven’t nailed down exactly why the query is slow. The list of domains would help.

kf6nux · August 8, 2017, 8:21am

@jsha Thank you for your response. Here's the list. I notice it has a lot of TLDs. I'm not sure if LE's database design is impacted by that.

Please let me know if I can assist in any other way. I'm happy to look at Boulder logs/source or whatever else. Thanks!

5636026810761216-fe1.pantheonsite.io
parsons.mit.edu
washingtongrantmakers.com
www.prospergroupcorp.com
www.masskiting.com
masskiting.com
www.americanresidentproject.net
www.guitar-list.com
www.news.solve.mit.edu
www.heartsine.do
www.mydropninja.com
trex.mit.edu
americanresidentproject.com
test.tribalselfgov.org
gaz.orangesv.com
americanresidentproject.net
somedude.gpsimpact.com
www.washingtongrantmakers.com
www.heartsine.com.sv
www.heartsine.com.tw
mydropninja.com
201.arielgold.win
murphy4nj.com
developer.inmar.com
news.solve.mit.edu
americanresidentproject.info
sustainability.mit.edu
www.mydrupalwizard.com
mydropwizard.com
www.heartsine.my
www.heartsine.pe
www.heartsine.pl
www.heartsine.re
drupalgroup.mit.edu
www.mydropwizard.com
test.episcopaldioceseny.org
www.washingtongrantmakers.org
www.heartsine.cr
www.heartsine.hn
www.kovima.com
prospergroupcorp.com
americanresidentproject.ketchumdigital.com
www.heartsine.ie
www.heartsine.kr
www.heartsine.pt
www.spiria.com
www.heartsine.com.ve
www.heartsine.hk
www.heartsine.ro
gatan.com
iha.gpsimpact.com
medlinks.mit.edu
gschwendlab.mit.edu
kovima.com
www.heartsine.com.tt
www.heartsine.dk
www.heartsine.no
www.heartsine.ph
www.americanresidentproject.info
www.heartsine.cz
www.heartsine.ly
www.heartsine.rs
www.gatan.com
www.americanresidentproject.org
www.heartsine.is
test.dioceseny.org
americanresidentproject.org
spiria.com
washingtongrantmakers.org
sandbox.earthrights.org
hemond-lab.mit.edu
mydrupalwizard.com
www.heartsine.ec
www.heartsine.ht
www.heartsine.it
harvey-lab.mit.edu
beta.murphy4nj.com
guitar-list.com
dev.alexfornuto.com
www.heartsine.lu
www.heartsine.mx
www.lautomobile.ca
www.episcopal.nyc
www.medlinks.mit.edu
www.heartsine.gr
www.heartsine.hu
www.heartsine.jp
www.heartsine.ru
mclaughlin-lab.mit.edu
www.heartsine.hr
www.heartsine.in
www.heartsine.me
stage.achievempls.org
www.americanresidentproject.com
lautomobile.ca
desmarais-lab.mit.edu
www.murphy4nj.com
www.heartsine.ma
flukeprocessinstruments.com
www.flukeprocessinstruments.com

stanwise · August 16, 2017, 10:37am

We’re happy to let you know that we haven’t observed this issue since Aug 10, 10:18 PDT. This coincides with the last week’s planned Boulder push, so I guess the fix must have gone in with the new release.

Do you have any more context on what could have caused these problems? I couldn’t find any obvious fix in the changelog

cpu · August 16, 2017, 1:45pm

Hi @stanwise,

I'm happy to hear the problem hasn't resurfaced for you.

This was related to a new approach to calculating an existing rate limit that was introduced in master with 71f8ae. We were able to cross reference the information you provided with when this feature was enabled in production and identified that it interacted poorly with certain issuance patterns.

Since this code was feature-flag gated per our usual practice we disabled the feature flag as a configuration change which is why you aren't able to see a fix in the changelog. As you observed this was done on Aug 10th See this API announcement post for more.

At this point I believe we intend to abandon the approach in master and will revisit with a more performant solution involving a database migration in the future when we have the resources on both the dev and ops side available.

Hope that helps clarify!

jsha · August 31, 2017, 9:31pm

2 posts were split to a new topic: Consistent 500’s for new-cert (failing CAA for one domain)

system · September 30, 2017, 9:31pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to get new certs, "error creating new cert" 500 error Help	30	6792	May 10, 2018
Server 500 but certificate still issued Server	45	10362	September 18, 2017
Consistent 500's for new-cert (failing CAA for one domain) Issuance Tech	17	4790	October 18, 2017
Consistent 500 urn:acme:error:serverInternal: Error creating new cert Help	2	928	December 3, 2017
Internal Server Error (500) when creating certificates Help	4	2255	April 6, 2017

Series of 500s with "Error creating new cert"

Related topics