How to connect to AWS CloudHSM successfully? (failing at "login" step)


#1

The operating system my web server runs on is (include version):

Ubuntu Linux 16.04 LTS

My hosting provider, if applicable, is:

Amazon AWS, EC2, CloudHSM

I can login to a root shell on my machine (yes or no, or I don’t know):

Yes

We are running LetsEncrypt Boulder on an AWS EC2 instance, and it is configured to talk to an AWS CloudHSM for signing.

We followed the Amazon instructions to install and configure the CloudHSM and we can talk to is using:

/opt/cloudhsm/bin/cloudhsm_mgmt_util /opt/cloudhsm/etc/cloudhsm_mgmt_util.cfg

When we launch Boulder, we getting tanalizingly close, but we are running into an error as we try to login to the HSM to upload our private keys.

We’ve added some debugging statements so it barfs out some details just before it fails.

We’re having trouble diagnosing exactly what is going wrong as we try to login to the HSM, but it is complaining about “CKR_ARGUMENTS_BAD”.

You can see below, there is only 1 slot, and the predefined label is “cavium”, the hardware manufacturer (I think).

In the HSM management tool, we login using a username like “admin” instead of a userid like “1”, but I am not sure if that is why it is failing. I presume the “pin” refers to the password we have assigned, and are able to test directly in the management utility.

pkcs11key: &{Module:/usr/local/lib/libpkcs11-proxy.so TokenLabel:cavium PIN:5678 PrivateKeyLabel:intermediate_key}

DEBUG: New, modulePath: /usr/local/lib/libpkcs11-proxy.so tokenLabel: cavium pin: 5678 privateKeyLabel: intermediate_key
DEBUG: initialize, modulePath: /usr/local/lib/libpkcs11-proxy.so
DEBUG: setup, privateKeyLabel: intermediate_key
DEBUG: openSession
DEBUG: ps.module.GetSlotList: len(slots) = 1
DEBUG: tokenInfo.Label = cavium
DEBUG: slot: 1
DEBUG: session created using ps.module.OpenSession
DEBUG: login failed using ps.module.Login, CKU_USER: 1 ps.pin: 5678
E231648 boulder-ca [AUDIT] Couldn't load private key: pkcs11key: problem making Key: pkcs11: 0x7: CKR_ARGUMENTS_BAD
Couldn't load private key: pkcs11key: problem making Key: pkcs11: 0x7: CKR_ARGUMENTS_BAD

If anyone in the LetsEncrypt community has some real-life hands-on experience with Amazon AWS CloudHSM, and/or any kind of knowledge about how Boulder actually runs in production (on Amazon?), we would be eternally grateful if you could provide a few pointers to help us navigate this particular obstacle.

I feel confident that once we are able to coax Boulder to login to our CloudHSM, the rest of the code should work!

Any help or suggestions appreciated.


#2

After some more research, I can answer my own question. I will share it just in case anybody else is trying to use AWS CloudHSM.

By default, Boulder uses SoftHSM, which seems to expect a 4-digit string for the pin.

With CloudHSM, the trick is that the pin parameter needs to be a string of the form "USER:PASSWORD".

The USER and PASSWORD need to be created manually using this AWS tool: /opt/cloudhsm/bin/cloudhsm_mgmt_util

After that, simply updating the pin in the config file is enough for the Boulder code to “login” to the CloudHSM successfully.


#3

Great! Also it looks like you might be using pkcs11-proxy, which is not production- appropriate code. I’d recommend swapping it out for the module associated with CloudHSM.

Out of curiosity, what are you using Boulder for?


#4

Can you point to any documentation related to how we can swap out the pkcs11-proxy and instead use the module for CloudHSM? There are lots of moving parts in Boulder, … so even just a few sentences about which config files, and what changes to make would be very helpful.

By the way, we did some throughput measurements and the CloudHSM seems very slow, … less than 15 certs/sec. I have no idea if this is an inherent limitation.

Has anyone else done benchmarking with AWS CloudHSM? I would be interested to compare results. Is there a standalone HSM speed test available in Golang?

The slow certificate signing in CloudHSM causes another problem. When we try to push Boulder, the HSM is the rate-limiting step, so there is a growing backlog of pending CSRs. Boulder seems to use a fairly aggressive timeout deadline, so if the certificate is not signed and returned to the client within a few seconds, it gives up and returns 500 Internal Server Error.

Can you recommend how to adjust the timeout setting to be more forgiving? Is there an environment variable we can adjust? It would be good to accept a bursts of CSRs and let the queue drain over time.

Using Pebble, and signing certificates in software, we can easily get sustained throughput of 250 certs/sec. I know there are normally rate-limits per account, but what kind of throughput can Boulder sustain? How many certs/sec in production?


#5

The configuration for the PKCS11 module is in the CA’s config. In the test environment, that’s test/config/ca.json.

We do about 30 certs/sec in prod, plus many times more than that in signatures for OCSP responses. Performance really varies widely between HSM models, but should be documented for CloudHSM.

The pkcs11key package has a benchmark test that can be run manually as a performance test.

The timeout settings are also in the various config JSON, though I’d recommend matching your HSM capacity to your needs instead of tweaking timeouts. If the HSM can’t keep up, your backlog will grow infinitely.


#6

Thanks for the valuable info, … I was able to replace the pkcs11-proxy library with the cloudhsm library and it worked as expected. Interestingly the throughput was about the same, so the proxy was not what was slowing us down.

I found the benchmark program /opt/cloudhsm/bin/pkpspeed and I was able to do RSA_CRT at a sustained rate of 330 ops/sec, and that was single-threaded, so it’s maybe not the HSM that is slowing us down. (I haven’t tried the pkcs11key package yet, but thanks for the tip.)

It would be interesting to know the typical number of crypto operations per certificate, since there is the pre-certificate, the SCTs, the OCSP responses, … Just to enable a back-of-the-envelope calculation to convert crypto ops/sec to certs/sec. For instance, if there were 6 ops per certificate that would suggest ~50 certs/sec maximum, given our CloudHSM performance.

In a separate discussion on this site, I mention that we recently found out that Boulder-SA seems to be our real bottleneck. And under heavy load, I suspect that some unlucky goroutine trying to send stuff through the SA may end up being delayed beyond the timeout, due to so many other requests randomly jumping in front of the line.


#7

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.