Boulder-SA is overwhelmed by 20 certs/sec. Is this normal?

I ran this command:

Boulder

It produced this output:

E030449 boulder-ra [AUDIT] Could not persist order error: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

The operating system my web server runs on is (include version):

Ubuntu 16.04 LTS

My hosting provider, if applicable, is:

Amazon AWS EC2

I can login to a root shell on my machine (yes or no, or I don’t know):

Yes

I am running Boulder (without Docker) on Amazon AWS EC2, with certificates being saved to AWS RDS (MariaDB).

We are signing certificates using AWS CloudHSM which has single-threaded performance of 330 op/sec.

I have been investigating why Boulder can only sign 20 certs/sec (and also return 500 errors due to timeout errors).

The machine running Boulder has 8 cores and when I run top, I can see all the boulder components sharing the workload fairly evenly, except boulder-sa seems overwhelmed, even at just 20 certs/sec.

I am skeptical that this is expected behaviour because the amount of data being read/written to the database is small and queries are simple and used indices.

How can I tell if boulder-sa (or the database) are misconfigured?

Has anyone else who runs Boulder seen this kind of sluggish storage performance?

Why is boulder-sa chewing up so much CPU time for a handful of certs/sec ?

Any help appreciated, … including tips about how to investigate further, or config settings to adjust.


Screen capture shows boulder-sa at 260%. Timeout errors start occurring as it gets closer to 300% (which I think means it has 3 CPU cores working hard). All the other components are similar around 15%, barely breaking a sweat.)

If you’re interested in finding where the CPU time is going:

  1. Expose port 8003 in your docker-compose.yml (assuming you use Docker). This is the debug HTTP service for boulder-sa.

  2. Before you start your load test, run locally (adjusting seconds for the duration of the CPU profile you want):

    go tool pprof -png "http://localhost:8003/debug/pprof/profile?seconds=60" > profile.png
    
  3. Run your load test

  4. profile.png will contain the CPU profile for that execution, which will look like this (but more useful, since mine is not at all loaded and doesn’t really show anything). Yours should be very obvious.

2 Likes

Thanks for the tip about using pprof. The output looks helpful.

Is there a list of all the debug HTTP ports for the different Boulder components? (I’m running Boulder outside Docker, so port 8003 was already exposed.)

I ran my load test and pprof shows the hot path of execution goes through go-gorp/rawselect.

It’s not surprising the SA code is doing a select, but what’s puzzling is why it takes up to 20X more CPU time than the other components of Boulder.

boulder/sa/(*SQLStorageAuthority) getAuthorizations is also on the hot path (rather than something related to storing certificates).

I checked our ACMEv2 client and it only fetches one authorization for each certificate, so there are only 20 per second.

So it still remains a bit of a puzzle.

$ ag debug test/config
test/config/wfe.json
17:    "debugAddr": ":8000",

test/config/publisher.json
5:    "debugAddr": ":8009",

test/config/ocsp-responder.json
9:    "debugAddr": ":8005"

test/config/ocsp-updater.json
19:    "debugAddr": ":8006",

test/config/wfe2.json
17:    "debugAddr": ":8013",

test/config/sa.json
7:    "debugAddr": ":8003",

test/config/ra.json
11:    "debugAddr": ":8002",

test/config/ca-a.json
6:    "debugAddr": ":8001",

test/config/ca-b.json
6:    "debugAddr": ":8001",

test/config/va.json
4:    "debugAddr": ":8004",

test/config/expiration-mailer.json
14:    "debugAddr": ":8008",

The only thing I can see relates to something you said earlier - that all of your orders are coming through on a single ACME registration.

The way I am reading SQLStorageAuthority.getAuthorizations, I could guess that having a massive number of valid or pending authorizations under a single registration (and for a single identifier, if that is the case) could end up scanning a lot of data into Go, which could explain the massive CPU time in boulder-sa.

It’d be nice to see the actual profile, though.

One way to disprove it would be to re-run your test but to use separate registrations … and maybe see if using a fresh database helps, too.

A second thing to do might be to also profile wfe2, because it would show a top-level down view of which operation is primarily responsible for the slowness.

1 Like

I haven't read the code, but IIRC that's been a headache in the past.

For that matter, Let's Encrypt itself averages about 10 certificates per second.

Issuing certificates at a comparable rate with only one account could hit edge cases Let's Encrypt doesn't usually encounter.

2 Likes

Across multiple instances of Boulder, at that. :slight_smile: (though not sure if SA runs multiple or not)

I have a thought about why Boulder-SA is causing timeout errors when measuring throughput.

Under heavy load, there must be lots of goroutines competing to save stuff through Boulder-SA. As far as I know, the Go runtime schedules “runnable” goroutines on a random basis, without any priority and without respect for who has been waiting the longest. What must happen is that sometimes a pending request through Boulder-SA gets delayed because everyone else keeps jumping in front of that unlucky request, and it eventually times out. Hard to predict, because its random, but just push 5,000 CSRs through Boulder as fast as you can, and you’ll be sure to reproduce the phenomenon.

Does that make sense? Does the Go language have any mechanism to prevent this kind of “starvation” ?

The goroutine-to-thread scheduler does use a queue (global queue and then processor-specific queue), not random scheduling. And I don’t think that Boulder has any non-preemptible code in it. If this was a problem with the Go runtime at this workload, I think it would be more widely experienced.

So if any code experiences CPU starvation, it should be because the CPU is actually starved.

Did you find time to see if the problem still occurs using separate ACME registrations and DNS identifiers and/or empty database?

Just some final thoughts to close out this thread.

I used pprof to profile all the Boulder components while benchmarking, and I think we figured out where the actual bottleneck is …

Boulder implements Certificate Transparency (https://www.certificate-transparency.org/what-is-ct) which requires communication with the external CT server. Since the SCTs (Signed Certificate Timestamps) are included inside the certificate, Boulder must wait for responses from the CT server before it can finalize the certificate.

As far as I can tell, this is the real bottleneck.

I also found we could eliminate Boulder returning 500 Internal Server Error, simply by limiting the number of pending certificate orders to 64. From our benchmarking, at steady state Boulder can sign 16 certs/sec, so that means the latency is about 4 seconds from submitting a CSR to receiving a certificate. I think at least 50% of that latency is waiting for the SCTs.

It would be hard to modify Boulder to make CT an optional feature (for an internal PKI application), since it is distributed throughout the code.

If you're using our development environment as-is then the CT servers you're submitting to are backed by Boulder's own ct-test-srv. These aren't actually logs at all (there's no merkle tree or any CT APIs implemented beyond add-chain/add-pre-chain) and they just give back an SCT after adding artificial latency. The default configuration has fixed latency schedules for each "log" to deliberately introduce slowness of varying levels. At the least you can fix these latency schedules to all be 0.0.

How have you deployed the boulder components? Are they all running on the same host? Are you using our Docker images and the test/ config directory tree?

Aha! How should I interpret the list of values in LatencySchedule? Should I just set all those to zero, like this?

      "LatencySchedule": [
        0
      ]

It will be interesting to see how fast Boulder is without any artificial latency added.

I have Boulder deployed on a single EC2 instance (8-cores). All the components are running on the same host. There is an external AWS database.

With the artificially added latencies you mentioned, as long as we only have 64 certificates pending, we never run into any timeouts, and most Boulder components are running at 15% CPU, but Boulder-SA is around 50%, for a load average of just 1.5 (out of 8), so there is room to go faster.

We started out using the Docker images and test/config. Recently, we are running Boulder outside Docker.

Looking at the source you can actually just omit the latency schedule config entirely:

If you're constrained by the CT logs and not system resources that's probably OK but it isn't representative of how we use Boulder at Let's Encrypt scale.

Excellent :+1: Dropping Docker is a good idea. The base configurations in test/config and test/config-next are also unoptimized.

I tried setting the LatencySchedule to [ 0 ] for ct-test-srv and I was expecting Boulder to go much faster.

Instead, the throughput seems exactly the same.

I notice removing LatencySchedule from the config file (so it is nil) in the code might make it a bit faster, since it would avoid the is.Lock().

Are there any other hidden latencies built into Boulder for testing purposes?

None that come to mind.

Much of Boulder (including gRPC request/response latencies) are instrumented with Prometheus. You might find digging in with that tooling is easier than raw pprof.

I’ve seen the Prometheus stuff throughout the code. Is there anywhere specific I should start looking to connect Boulder to a Prometheus dashboard?

You'll want to configure Prometheus with targets for each of the Boulder component's debug addresses'. There are no Grafana/Prometheus dashboards provided with Boulder, you'll have to look at the exported stats and make your own queries/dashboards. (You can curl the /metrics path of the debug addr to get a feel for what's available, or use the Prometheus web UI, or spelunk the Boulder source).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.