How to workaround Boulder "500 Internal Server Error" when SCT timeout occurs?

dxjones · December 16, 2018, 12:58am

I ran this command:

Boulder

It produced this output:

**E001342 boulder-wfe2 [AUDIT] Internal error** - **Error** finalizing order :: Unable to meet CA SCT embedding requirements - getting SCTs: CT log group "b": context deadline exceeded

The operating system my web server runs on is (include version):

Ubuntu 16.04 LTS

My hosting provider, if applicable, is:

Amazon AWS EC2

I can login to a root shell on my machine (yes or no, or I don’t know):

Yes

I am running Boulder (outside Docker) on an Amazon EC2 instance and using AWS CloudHSM for signing certificates. Our experience is that CloudHSM is substantially slower than SoftHSM.

This slowdown when signing certificates can lead to a backlog of certificates waiting to be signed. Boulder doesn’t seem to handle this scenario as gracefully as it could.

Under moderately heavy load (~10 certs/sec) we are often seeing this error:

 **E001342 boulder-wfe2 [AUDIT] Internal error** - **Error** finalizing order :: Unable to meet CA SCT embedding requirements - getting SCTs: CT log group "b": context deadline exceeded

As far as I can tell, this has to do with “Certificate Transparency”. The “pre-certificate” gets signed and submitted to a public log for certificate transparency, and these SCTs (Signed Certificate Timestamps) get included in the certificate that eventually gets issued.

There must be a small time window, within which the pre-certificate is put in the public transparency log and the certificate must be signed and issued. Boulder is exceeding the deadline. The problem is that it returns 500 Internal Server Error, so client software gives up and fails to provide a certificate.

I have 3 questions:

Is there any way to disable the CT/SCT mechanism? This would make sense when using the LetsEncrypt/Boulder could for an internal CA scenario, where we don’t want certificates from an internal stealth project listed on a public log. It also makes sense in a testing/development environment to evaluate the effect CT has on throughput, and to disable some features when investigating bugs.
If we cannot disable the CT/SCT mechanism, is there any way to allow a longer time window?
When an SCT-related timeout occurs, Boulder returns a 500 Internal Server Error. If this is actually a transient error and it is possible for the “Finalize Order” request can be retried, then perhaps a 504 Gateway Timeout could be returned, so the client software could get a clue it can try again later. This would be a big improvement, since the client could use exponential backoff, and keep trying until the certificate gets issued.

How does LetsEncrypt.org handle hundreds of certificates per second? There must be occasional brown-outs when the CT/SCT slows down. Does it really start throwing 500 errors?

Any help would be appreciated.

_az · December 16, 2018, 1:09am

Sure does. You can search for the error on this forum to see production users running into it at finalization.

But, 10/sec seems really low. Amazon's presentations show that CloudHSM should support 1000+ RSA signing oerations per second per HSM.

Two things you might try:

Verify you swapped out the usage of libpkcs11-proxy.so to the one provided by CloudHSM (/opt/cloudhsm/lib/libcloudhsm_pkcs11.so)
Try Client SDK 3: Verify HSM performance with the pkpspeed tool - AWS CloudHSM

Have you tried commenting out the logs in ra.json ?

dxjones · December 16, 2018, 2:14am

Thanks for the tips.

Boulder gives roughly the same performance with a direct connection and proxy connection, so that doesn’t seem to be the issue.

/opt/cloudhsm/lib/libcloudhsm_pkcs11.so
/usr/local/lib/libpkcs11-proxy.so

I also tried pkpspeed to check the CloudHSM and single-threaded RSA_CRT performance looks reasonable at 330 op/sec, so it looks Boulder’s fault that I am getting only 10 certs/sec.

I haven’t tried changing the log levels yet, but I wouldn’t expect verbose logs to cause a 10X decrease in performance.

I just noticed these files:

test/rate-limit-policies.yml
test/rate-limit-policies-b.yml

I wonder if Boulder is intentionally rate-limiting my “account” as I am trying to measure throughput.

I haven’t seen any log output related to rate-limiting. I also haven’t checked if there are any throttle messages being sent back to the client in the headers, … How can I tell if Boulder is hitting rate-limits?

In my application, there are lots of clients, but they all submit CSRs through a single interface, so to Boulder it looks like a single “account” submitting all these CSRs.

Can anyone tell me how to remove rate-limits so I can do some throughput testing?

_az · December 16, 2018, 2:18am

I meant removing the Certificate Logs entirely from the RA config file:

.ra.CTLogGroups2 (CT logs for use with SCT)
.ra.InformationalCTLogs (non-SCT CT logs)

If Boulder has nowhere to submit certificates, it should effectively disable CT/SCT embedding.

Regarding rate limits, I don't think any queuing mechanism, like you suggest, exists in Boulder. Requests are immediately rejected with an HTTP 429 if they hit any rate limit, which you should be seeing pretty clearly on your clients.

dxjones · December 16, 2018, 2:39am

I just tried removing those items from ra.json, but Boulder was not happy. It always returned 500 errors.

Instead, I tried setting "submitFinalCert": false and that seems to speed things up, closer to 20 certs/sec, but the issued certificates still contain the SCT section, so Boulder is still communicating with the external CT logs.

I suspect think Boulder might be imposing rate limits for my single client “account”.

Any suggestions how I could check if rate limits are being enforced?

_az · December 16, 2018, 3:39am

You're right. I think Boulder currently requires precertificates and SCTs no matter how you configure it:

github.com

letsencrypt/boulder/blob/340c1e46814088e0fdfadea659f5cf00f41d98b8/grpc/ca-wrappers.go#L121-L123


      
          func (cas *CertificateAuthorityServerWrapper) IssueCertificateForPrecertificate(ctx context.Context, req *caPB.IssueCertificateForPrecertificateRequest) (*corepb.Certificate, error) {
          	if req == nil || req.DER == nil || req.OrderID == nil || req.RegistrationID == nil || req.SCTs == nil {
          		return nil, errIncompleteRequest

You can check if rate limits are being enforced by watching for HTTP 429 responses accompanied by the urn:acme:error:rateLimited error.

Most of the limits seem to be logged anyway:

boulder_1    | I031137 boulder-ra Rate limit exceeded, InvalidAuthorizationsByRegID, regID: 1

cpu · December 17, 2018, 1:36pm

There isn't any way to do this. Boulder is tailored to the web PKI and in 2018 there's no point in issuing certificates without SCTs from CT logs - they won't be trusted by web browsers. We explicitly do not support this internal CA scenario and are reluctant to be adding complexity in configuration to support tasks outside of our primary mission to encrypt the public web.

Overall I'd say you're running into lots of problems because you're taking a development environment (and all of our test configs) for a project principally intended to be used by one organization (Let's Encrypt) for the Web PKI and trying to use it for something unsupported (internal PKI) in an environment (AWS) we've never run it in. This is going to cause you lots of headaches and you're mostly on your own for fixes.

There are no doubt many improvements that can be made to make Boulder work better for internal PKIs and outside organizations but we do not have the resources to take on this work. The folks that run into the most problems using Boulder this way (so far) aren't offering pull requests with fixes. That's why we continue to repeat the mantra that Boulder is not appropriate for use in internal PKIs. This specific issue (Boulder not working without CT) was flagged by another person trying to use Boulder in an unsupported environment: RA without CT logs configured produces incomplete gRPC messages · Issue #3941 · letsencrypt/boulder · GitHub - @dxjones, are you interested in submitting a PR to fix it? The OP doesn't seem to be.

This issuance pattern is significantly different from Let's Encrypt's production/staging traffic - I wouldn't be surprised to hear you're running into performance problems as a result of this design choice interacting badly with Boulder's queries. Yet another reason why Boulder is not going to be a drop-in solution for you!

dxjones · December 17, 2018, 7:07pm

Hi, … I understand what you are saying about Boulder not being a good fit for an internal PKI project.

Nevertheless, we’re learning a lot by trying to make it work, and seeing what issues we encounter.

It seems like the ideal approach is somewhere in between Boulder-- and CloudFlareSSL++.

If we’re still exploring Boulder in January, I might consider submitting a PR to make it possible to disable CT with an environment variable.

cpu · December 17, 2018, 7:19pm

I don't mean to be overly discouraging, just realistic I think there are big pieces missing for making Boulder easy to consume external to Let's Encrypt in a production setting and I don't want anyone to find out about that the hard way. We don't even publish a changelog right now!

That would be helpful! I think having an implementation would make discussion easier. I'm hesitant about making this a configurable option for Boulder because its a liability for our operations. We use CT throughout all of our environments and failing to do so because of an errant setting would be a significant incident. I'm not sure the best way to balance the goals of an internal PKI with the increase in complexity/potential for error we could incur.

system · January 16, 2019, 7:19pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Boulder-SA is overwhelmed by 20 certs/sec. Is this normal? Help	19	1319	January 17, 2019
Boulder new-cert - retry after x seconds Client dev	7	3065	January 14, 2017
Database timeouts, October 13 2016 Incidents	0	2840	October 17, 2016
Full Certificate Transparency by 30th April Help	8	2747	April 16, 2018
Certificate error with large SAN cert Server	19	5690	December 10, 2017

How to workaround Boulder "500 Internal Server Error" when SCT timeout occurs?

Related topics