I’ve been working with Samuel Williams a bit (and on my own a bit) to do more benchmarking Fiber speeds in Ruby and comparing them to processes and threads. There’s always more to do! Not only have I been running more trials for each configuration (get that variance down!), I also tried out a couple more configurations of the test code. It’s always nice to see what works well and what doesn’t.

New Configurations and Methodology

Samuel pointed out that for threads, I could run one thread per worker in the master process, for a total of 2 * workers threads instead of using IO.select in a single thread in the master. True! That configuration is less like processes but more like fibers, and is arguably a fairer representation of a ‘plain’ thread-based solution to the problem. It’s also likely to be slower in at least some configurations since it requires twice as many threads. I would naively expect it to perform worse for lack of a good centralised place to coordinate which thread is working next. But let’s see, shall we?

Samuel also put together a differently-optimised benchmark for fibers, one based on read_nonblock. This is usually worse for throughput but better for latency. A nonblocking implementation can potentially avoid some initial blocking, but winds up much slower on very old Ruby when read_nonblock was unusably slow. This benchmark, too, has an interesting performance profile that’s worth a look.

I don’t know if you remember from last time, but I was also doing something fairly dodgy with timing - I measured the entire beginning-to-end process time from outside the Ruby process itself. That means that a lot of process/thread/fiber setup got ‘billed’ to the primitive in question. That’s not an invalid way to benchmark, but it’s not obviously the right thing.

As a quick spoiler on that last one: process setup takes between about 0.3 and 0.4 seconds for everything - running Ruby, setting up the IO pipes, spawning the workers and all. And there’s barely any variation in that time between threads vs processes vs fibers. The main difference between “about 0.3” and “about 0.4” seconds is whether I’m spawning 10 workers or 1000 workers. In other words, it basically didn’t turn out to matter once I actually bothered to measure - which is good, and I expected, but it’s always better to measure than to expect and assume.

I also put together a fairly intense runner script to make sure everything was done in a random order - one problem with long tests is that if something changes significantly (the Amazon hardware, some network connection, a background process to update Ubuntu packages…) then a bunch of highly-correlated tests all have the same problem. Imagine if Ubuntu started updating its packages right as the fiber tests began, and then stopped as I switched to thread tests. It would look like fibers were very slow and prone to huge variation in results! I handle this problem for my important results by re-running lots of tests when it’s significant… But I’m not always 100% scrupulous, and I’ve been bitten by this before. There’s a reason I can tell you the specifics of the problem, right? A nice random-order runner doesn’t keep background delays from happening, but they keep them from all being in the same kind of test. Extra randomly-distributed background noise makes me think, “huh, that’s a lot of variance, maybe this batch of test runs is screwy,” which is way better than if I think, “wow, fibers really suck.”

So: the combination of 30 test-runs per configuration rather than 10 and running them in a random order is a great way to make sure my results are basically solid.

I’ve also run with the October 18th prerelease version of Ruby 2.7… And the performance is mostly just like the tested 2.6. A little faster, but barely. You’ll see the graphs.

Threaded Results

Since we have two new configurations, let’s start with one of them. The older thread-based benchmark used IO.select and the newer one uses a lot of threads. In most languages, I’d now comment how the “lot of threads” version needs extra coordination — but Ruby’s GIL turns out to handle that for us nicely without further work. There are advantages to having a giant, frequently-used lock already in place!

I had a look at the data piecemeal, and yup, on Linux I saw about what I expected to for several of the runs. I saw some different things on my Mac, but Mac can be a little weird for Ruby performance, zigging when Linux zags. Overall we usually treat Linux as our speed-critical deployment platform in the English-speaking world - because who runs their production servers on Mac OS?

Anyway, I put together the full graph… Wait, what?

Y Axis is the time in seconds to process 100,000 messages with the given number of threads

That massive drop-off at the end… That’s a good thing, no question, but why is thread contention suddenly not a problem in this case when it was for the previous six years of Ruby?

The standard deviation is quite low for all these samples. The result holds for the other numbers of threads I checked (5 and 1000), I just didn’t want to put eight heavily-overlapped lines on the same graph - but the numbers are very close for those, too.

I knew these were microbenchmarks, and those are always a bit prone to large changes from small changes. But, uh, this one surprised me a bit. At least it’s in a good direction?

Samuel is looking into it to try to find the reason. If he gets back to me before this gets published, I’ll tell you what it is. If not, I guess watch his Twitter feed if you want updates?

Fibrous Results

Fibers sometimes take a little more code to do what threads or processes manage. That should make sense to you. They’re a higher-performance, lower-overhead method of concurrency. That sometimes means a bit more management and hand-holding, and they allow you to fully control the fiber-to-fiber yield order (manual control) which means you often need to understand that yield order (no clever unpredictable automatic control.)

Samuel Williams, who has done a lot of work on Ruby’s fiber improvements and is the author of the Falcon fiber-based application server, saw a few places to potentially change up my benchmark and how it did things with a little more code. Awesome! The changes are pretty interesting - not so much an obvious across-the-board improvement as a somewhat subtle tradeoff. I choose to interpret that as a sign that my initial effort was pretty okay and there wasn’t an immediately obvious way to do better ;-)

He’s using read_nonblock rather than straight-up read. This reduces latency… but isn’t actually amazing for bandwidth, and I’m primarily measuring bandwidth here. And so his code would likely be even better in a latency-based benchmark. Interesting, read_nonblock had horrifically bad performance in really old Ruby versions, partly because of using exception handling for its flow control - a no-no in nearly any language with exceptions.

You can see the code for the original simpler benchmark versus his version with changes here.

It turns out that the resulting side by side graph is really interesting. Here, first look for yourself:

Red and orange are the optimised version, while blue and green are the old simple one.

You already know that read_nonblock is very slow for old Ruby. That’s why the red and orange lines are so high (bad) for Ruby until 2.3, but then suddenly get faster than the blue and green lines for 2.3 and 2.4.

You may remember in my earlier fiber benchmarks that the fiber performance has a sort of humped curve, with 2.0 being fast, 2.3 being slow and 2.6 eventually getting faster than 2.0. The blue and the green lines are a re-measurement of the exact same thing and so have pretty much exactly the same curve as last week. Good. You can see an echo of the same thing in the way the red and orange lines also get slower for 2.2.10, though it’s obscured by the gigantic speedup to read_nonblock in 2.3.8.

By 2.5, all the samples are basically in a dead heat - close enough that none of them are really outside the range of measurement error of each other. And by 2.6.5, suddenly the simple versions have pulled ahead, but only slightly.

One thing that’s going on here is that read_nonblock has a slight disadvantage compared to blocking I/O in the kind of test I’m doing (bandwidth more than latency.) Another thing that’s going on is that microbenchmarks give large changes with small differences in which operations are fast.

But if I were going to tell one overall story here, it’s that recent Ruby is clearly winning over older Ruby. So our normal narrative applies here too: if you care about the speed of these things, upgrade to the latest stable Ruby or (occasionally, in specific circumstances) later.

Overall Results

The basic conclusions from the previous benchmarks also still hold. In no particular order:

Processes get a questionably-fair boost by stepping around the Global Interpreter Lock
Threads and Fibers are both pretty quick, but Fibers are faster where you can use them
Processes are extremely quick, but in large numbers will eat all your resources; don’t use too many
For both threads and fibers, upgrade to a very recent Ruby for best speed

I’ll also point out that I’m doing very little here - in practice, a lot of this will depend on your available memory. Processes can get very memory-hungry very quickly. In that case, you may find that having only one copy of your in-memory data by using threads or fibers is a huge win… At least, if you’re not doing too much calculation and the GIL messes you up.

See why we have multiple different concurrency primitives? There truly isn’t an easy answer to ‘which is best.’ Except, perhaps, that Matz is “not a threading guy” (still true) - and we don’t prefer threads in CRuby. Processes and Fibers are both better where they work.

(Please note that these numbers, and these attitudes, can be massively different in different Ruby implementations - as they certainly are in JRuby!)

Engineering Blog

More Fiber Benchmarking

New Configurations and Methodology

Threaded Results

Fibrous Results

Overall Results

Engineering Blog

Posts