Awhile back, I set out to look at Fiber performance and how it's improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby's Fiber class by Samuel Williams.
It's not hard to write a microbenchmark for something like Fiber.yield. But it's harder, and more interesting, to write a benchmark that's useful and representative.
Wait, Wait, Wait - What?
Okay, first a quick summary: what are fibers?
You know how you can fork a process or create a thread and suddenly there’s this code that’s also running, alongside your code? I mean, sure, it doesn’t necessarily literally run at the same time. But there’s another flow of control and sometimes it’s running. This is all called concurrency by developers who are picky about vocabulary.
A fiber is like that. However, when you have multiple fibers running, they don’t automatically switch from one to the other. Instead, when one fiber calls Fiber.yield, Ruby will switch to another fiber. As long as all the fibers call yield regularly, they all get a chance to run and the result is very efficient.
Fibers, like threads, all run inside your process. By comparison, if you call “fork” for a new process then of course it isn’t in the same process. Just as a process can contain multiple threads, a thread can contain multiple fibers. For instance, you could write an application with ten processes, each with eight threads, and each of those threads could have six fibers.
A thread is lighter-weight than a process, and multiple can run inside a process. A fiber is lighter-weight than a thread, and multiple can run inside a thread. And unlike threads or processes, fibers have to manually switch back and forth by calling “yield.” But in return, they get lower memory usage and lower processor overhead than threads in many cases.
We’ll also be talking about the Global Interpreter Lock, or GIL, which these days is more properly called the Global VM Lock or GVL - but nobody does, so I’m calling it the GIL here. Basically, multiple Ruby threads or fibers inside a single process can only have one of them running Ruby at once. That can make a huge difference in performance. We’re not going to go deeply into the GIL here, but you may want to research it further if this topic interests you.
Why Not App Servers?
Some of you are thinking, "but comparing threads and fibers isn’t hard at all." After all, I do lots of HTTP benchmarking here. Why not just benchmark Puma, which uses threads, versus Falcon, which uses fibers, and call it a day?
One: there are a lot of differences between Falcon and Puma. HTTP parsing, handling of multiple processes, how the reactor is written. And in fact, both of them spend a lot of time in non-Ruby code via nio4r, which lets Ruby use some (very cool, very efficient) C libraries to do the heavy lifting. That's great, and I think it's a wonderful choice... But it's not really benchmarking Ruby, is it?
No, we need something much simpler to look at raw fiber performance.
Also, Ruby 3x3 uses Ruby 2.0 as its baseline. Falcon, nio4r and recent Puma all very reasonably require more recent Ruby than that. Whatever benchmark I use, I want to be able to compare all the way back to Ruby 2.0. Puma 2.11 can do that, but no version of Falcon can.
Some Approaches that Didn't Work
Just interested in the punchline? Skip this section. Curious about the methodology? Keep reading.
I tried putting together a really simple HTTP client and server. The client was initially wrk while the server was actually three different servers - one threads, one processes, one fibers. I got it partly working.
But it all failed. Badly.
Specifically, wrk is intentionally picky and finicky. If the server closes the socket on it too soon, it gives an error. Lots of errors. Read errors and write errors both, depending. Just writing an HTTP server with Ruby's TCPSocket is harder than it looks, basically, if I want a picky client to treat it as reasonable. Curl thinks it's fine. Wrk wants clean benchmark results, and says no.
Yeah, okay, fine. I guess I do want clean benchmark results. Maybe.
Okay, so then, maybe just a TCP socket server? Raw, fast C client, three different TCPServer-based servers, one threads, one processes, one fibers? It took some doing, but I did all that.
That also failed.
Specifically, I got it all working with threads - they're often the easiest. And a 10,000-request run took anything from 3 seconds to 30 seconds. That... seems like a lot. I thought, okay, maybe threads are bad at this, and I tried it with fibers. Same problem.
So I tried it with straight-line non-concurrent code for the server. Same problem. What about a simple select-based reactor for the fiber version to see if some concurrency helps? Nope. Same problem.
It turns out that just opening a TCP/IP socket, even on localhost, adds a huge amount of variation to the time for the trial. So much variation that it swamps what I'm trying to measure. I could have just run many, many trials to (mostly) average out the noise. But having more measurement noise than signal to measure is a really bad idea.
So: back to the drawing board.
No HTTP. No TCP. No big complicated app servers, so I couldn't go more complicated.
What was next?
What's more predictable and less variable than TCP/IP sockets? Local process-to-process sockets with no network protocol in the middle. In Ruby, one easy way to do that is IO.pipe.
You can put together a pretty nice simple master/worker pattern by having the master set up a bunch of workers, each with a shell-like pipe. It's very fast to set up and very fast to use. This is the same way that shells like bash sets up pipe operators for "cat myfile | sort | uniq" to run output through several programs before it's done.
So that's what I did. I used threads as workers for the first version. The code for that is pretty simple.
Set up read and write pipes
Set up threads as workers, ready to read and write
Start the master/controller code in Ruby’s main process and thread
Keep running until finished, then clean up
There’s some brief reactor code for master to make sure it only reads and writes to pipes that are currently ready. But it’s very short, certainly under ten lines of “extra.”
The multiprocess version is barely different - it's so similar that there are about fives lines of difference between them.
And Now, Fibers
The fiber version is a little more involved. Let's talk about that.
Threads and processes both have pre-emptive multitasking. So if you set one of them running and mostly forget about it, roughly the right thing happens. Your master and your workers will trade off pretty nicely between them. Not everything works perfectly all the time, but things basically tend to work out okay.
Fibers are different. A fiber has to manually yield control when it's done. If a fiber just reads or writes at the wrong time, it can block your whole program until it’s done. That's not as severe a problem with IO.pipe as with TCP/IP. But it's still a good idea to use a pattern called a reactor to make sure you're only reading when there's data available and only writing when there's space in the pipe for it.
Samuel Williams has a presentation about Ruby fibers that I used heavily as a source for this post. He includes a simple reactor pattern for fibers there that I'll use to sort my workers out. Like the master in the earlier code, this reactor uses IO.select to figure out when to read and write and how to transfer control between the different fibers. The reactor pattern can be used for threads or processes as well, but Samuel's code is written for fibers.
So initially, I put all the workers into a reactor in one thread, and the master with an IO.select reactor in another thread. That's very similar to how the thread and process code is set up, so it's clearly comparable. But as it turned out, the performance for that version isn't great.
But it seems silly to say it's testing fibers while using threads to switch back and forth... So I wrote a "remastered" version of the code, with the master code using a fiber per worker. Would this be really slow since I was doubling the number of fibers...? Not so much.
In fact, using just fibers and a single reactor doubled the speed for large numbers of messages.
And with that, I had some nice comparable thread, process and fiber code that's nearly all I/O.
How’s It Perform?
I put it through its paces locally on my Macbook Pro with Ruby 2.6.2. Take this as “vaguely suggestive” performance, in other words, not “heavily vetted” performance. But I think it gives a reasonable start. I’ll be validating on larger Linux EC2 instances before you know it - we’ve met me before.
Here are numbers of workers and requests along with the type of worker, and how long it takes to process that number of requests:
|Threads||Processes||Fibers w/ old-style Master||Fibers w/ Fast Master|
|5 workers w/ 20,000 reqs each||2.6||0.71||4.2||1.9|
|10 workers w/ 10,000 reqs each||2.5||0.67||4.0||1.7|
|100 workers w/ 1,000 reqs each||2.5||0.76||3.9||1.6|
|1000 workers w/ 100 reqs each||2.8||2.5||5.0||2.4|
|10 workers w/ 100,000 reqs each||25||5.8||41||16|
Some quick notes: Processes give an amazing showing, partly because they have no GIL. Threads beat out Fibers with a threaded master, so combining threads and fibers too closely seems to be dubious. But with a proper fiber-based master they’re faster than threads, as you’d hope and expect.
You may also notice that processes do not scale gracefully to 1000 workers, while threads and fibers do much better at that. That’s normal and expected, but it’s nice to see the data bear it out.
That final row has 10 times as many total requests as all the other rows. So that’s why its numbers are about ten times higher.
A Strong Baseline for Performance
This article is definitely long enough, so I won't be testing this from Ruby version 2.0 to 2.7... Yet. You can expect it soon, though!
We want to show that fiber performance has improved over time - and we'd like to see if threads or processes have changed much. So we'll test over those Ruby versions.
We also want to compare threads, processes and fibers at different levels of concurrency. This isn't a perfectly fair test. There's no such thing! But it can still teach us something useful.
And we'd also like a baseline to start looking at various "autofiber" proposals - variations on fibers that automatically yield when doing I/O so that you don't need the extra reactor wrapping for reads and writes. That simplifies the code substantially, giving something much more like the thread or process code. There are at least two autofiber proposals, one by Eric Wong and one by Samuel Williams.
Don't expect all of that for the same blog post, of course. But the background work we just did sets the stage for all of it.