How Much Do You Save With Ruby 2.7 Memory Compaction?

If you read this blog recently, you may have seen that Ruby Memory Compaction in 2.7 doesn’t affect the speed of Rails Ruby Bench much. That’s fair - RRB isn’t a great way to see that because of how it works. Indeed, Nate Berkopec and others have criticised RRB for that very thing.

I think RRB is still a pretty useful benchmark, but I see their point. It is not an example of a typical Rails deployment, and you could write a great benchmark centred around the idea of “how many requests can you process at most without compromising latency at all?” RRB is not that benchmark.

But that benchmark wouldn’t be perfect for showing off memory compaction either - and this is why we need a variety of performance benchmarks. They show different things. Thanks for coming to my TED talk.

So how would we see what memory compaction does?

If we wanted a really contrived use case, we’d show the “wedged” memory state that compaction fixes - we’d allocate about one page of objects, free all but one, do it over and over and we’d wind up with Ruby having many pages allocated, sitting nearly empty, unfreeable. That is, we could write a sort of bug reproduction showing an error in current (non-compacting) Ruby memory behaviour. “Here’s this bad thing Ruby can do in this specific weird case.” And with compaction, it doesn’t do that.

Or we could look at total memory usage, which is also improved by compaction.

Wait, What’s Compaction Again?

You may recall that Ruby divides memory allocation into tiny, small and large objects - each object is one of those. Depending on which type it is, the object will have a reference (always), a Slot (except for tiny objects) and a heap allocation (only for large objects.)

The problem is the Slots. They’re slab-allocated in large numbers. That means they’re cheap to allocate, which is good. But then Ruby has to track them. And since Ruby uses C extensions with old C-style memory allocation, it can’t easily move them around once it’s using them. Ruby deals with this by waiting until you’ve free all the Slots in a page (that’s the slab) of Slots, then freeing the whole thing.

That would be great, except… What happens if you free all but one (or a few) Slots in a page? Then you can’t free it or re-use it. It’s a big chunk of wasted memory. It’s not quite a leak, since it’s tracked, but Ruby can’t free it while there’s even a single Slot being used.

Enter the Memory Compactor. I say you “can’t easily move them around.” But with significant difficulty, a lot of tracking and burning some CPU cycles, actually you totally can. For more details I’d recommend watching this talk by Aaron Patterson. He wrote the Ruby memory compactor. It’s a really good talk.

In Ruby 2.7, the memory compactor is something you have to run manually by calling “GC.compact”. The plan (as announced in Nov 2019) is that for Ruby 3.0 they’ll have a cheaper memory compactor that can run much more frequently and you won’t have to call it manually. Instead, it would run on certain garbage collection cycles as needed.

How Would We Check the Memory Usage?

A large, complicated Rails app (cough Discourse cough) tends to have a lot of variability in how much memory it uses. That makes it hard to measure a small-ish change. But a very simple Rails app is much easier.

If you recall, I have a benchmark that uses an extremely simple Rails app. So I added the ability to check the memory usage after it finishes, and a setting to compact memory at the end of its initialisation.

A tiny Rails app will have a lot less to compact - mostly classes and code, so there’s less in a small Rails app. But it will also have a lot less variation in total memory size. Compaction or no, Ruby doesn’t usually free memory back to the operating system (like other dynamic languages), so a lot of what we want to check is whether the total size is smaller after processing a bunch of requests.

A Rails server, if you recall, tends to asymptotically approach a memory ceiling as it runs requests. So there’s still a lot of variation in the total memory usage. But this is a benchmark, so we all know I’m going to be running it many, many times and comparing statistically. So that’s fine.

Methodology

For this post I’m using Ruby 2.7.0-preview3. That’s because memory compaction was added in Ruby 2.7, so I can’t use a released 2.6 version. And as I write this there’s no final release of 2.7. I don’t have any reason to think compaction will change size later, so these memory usage numbers should be accurate for 2.7 and 3.0 also.

I’m using Rails Simpler Bench (RSB) for this (source link). It’s much simpler than Rails Ruby Bench and far more suitable for this purpose.

For now, I set an after_initialize hook in Rails to run when RSB_COMPACT is set to YES and I don’t do that when it’s set to NO. I’m using 50% YES samples and 50% NO samples, as you’d expect.

I run the trials in a random order with a simple runner script. It’s running Puma with a single thread and a single process - I was repeatability far more than I want speed for this. It’s hitting an endpoint that just statically renders a single string and never talks to a database or any external service. This is as simple as a Rails app gets, basically.

Each trial gets the process’s memory usage after processing all requests using Richard Schneeman’s get_process_mem gem. This is running on Linux, so it uses the /proc filesystem to check. Since my question is about how Ruby’s internal memory organisation affects total OS-level memory usage, I’m getting my numbers from Linux’s idea of RSS memory usage. Basically, I’m not trusting Ruby’s numbers because I already know we’re messing with Ruby’s tracking - that’s the whole reason we’re measuring.

And then I go through and analyse the data afterward. Specifically, I use a simple script to read through the data files and compare memory usage in bytes for compaction and non-compaction runs.

The Trials

The first and simplest thing I found was this: this was going to take a lot of trials. One thing about statistics is that detecting a small effect can take a lot of samples. Based on my first fifty-samples-per-config trial, I was looking for a maybe half-megabyte effect in a 71-megabyte memory usage total, and around 350 kilobytes of standard deviation.

Does 350 kilobytes of standard deviation seem high? Remember that I’m measuring total RSS memory usage, which somewhat randomly approaches a memory ceiling, and where a lot of it depends on when garbage collection happened, a bit on memory layout and so on. A standard deviation of 350kb in a 71MB process isn’t bad. Also, that was just initially - the standard deviation goes down as the number of samples goes up, because math.

Similarly, does roughly 500 kilobytes of memory savings seem small? Keep in mind that we’re not changing big allocations like heap allocations, and we’re also not touching cases where Slots are already working well (e.g. large numbers of objects that are allocated together and then either all kept or all freed.) The only case that makes much of a difference is where Rails (very well-tuned Ruby code) is doing something that doesn’t work well with Ruby’s memory system. This is a very small Rails app, and so we’re only getting some of the best-tuned code in Ruby. Squeezing out another half-megabyte for “free” is actually pretty cool, because other similar-sized Ruby programs probably get a lot more.

So I re-ran with 500 trials each for compaction and no compaction. That is, I ran around 30 seconds of constant HTTP requests against a server about a thousand more times, then checked the memory usage afterward. And then another 500 trials each.

Yeah, But What Were the Results?

After doing all those measurements, it was time to check the results again.

You know those pretty graphs I often put here? This wasn’t really conducive to those. Here’s the output of my processing script in all its glory:

Compaction: YES
  mean: 70595031.11787072
  median: 70451200.0
  std dev: 294238.8245869074
--------------------------------
Compaction: NO
  mean: 71162253.14068441
  median: 70936576.0
  std dev: 288533.47219640197
--------------------------------

It’s not quite as pretty, I’ll admit. But with a small amount of calculation, we see that we save around 554 kilobytes (exact: 567,222 bytes) per run, with a standard deviation of around 285 kilobytes.

Note that this does not involve ActiveRecord or several other weighty parts of Rails. This is, in effect, the absolute minimum you could be saving with a Rails app. Overall, I’ll take it.

Did you just scroll down here hoping for something easier to digest than all the methodology and caveats? That’s totally fair. I’ll just add a little summary line, my own equivalent of “and thus, the evil princess was defeated and the wizard was saved.”

And so you see, with memory compaction, even the very smallest Rails app will save about half a megabyte. And as Aaron says in his talk, the more you use, the more you save!

How Do I Use Rails Ruby Bench?

How do I do these explorations with Rails Ruby Bench? How could you do them? There’s full source code, but source code is only one piece of the story.

So today, let’s look at that. The most common way I do it is with AWS, so I’m going to describe it that way. Watch this space for a local version in later weeks!

An Experiment

Rails Ruby Bench is a benchmark, which means it’s mostly useful for experiments in the tradition of the scientific method. It exists to answer questions about performance, so it’s important that I have a question in mind. Here’s one: does Ruby’s new compacting GC make a difference in performance currently? I’ve chosen that question partly because it’s subtle - the answer isn’t clear, and Rails Ruby Bench isn’t a perfect tool for exploring it. That means there will be problems, and backtracking, and general difficulties. That’s not the best situation for easy great results, but it’s absolutely perfect for documenting how RRB works. For a benchmark you don’t want to hear about the happy path. You want to hear how to use it when things are normal-or-worse.

My hypothesis is that compacting GC will make a difference in speed but not a large one. Rails Ruby Bench tends to show memory savings as if it were extra speed, and so if compacting GC is doing a good job then it should speed up slightly. I may prove it or not - I don’t know yet, as I write this. And that’s important - you want to follow this little journey when I still don’t know because you’ll be in the same situation if you do this.

(Do I expect you to actually benchmark changes with Rails Ruby Bench? Probably a few of you. But many, many of you will want to do a benchmarking experiment at some point in your career, and those are always uncertain when you’re doing them.)

AWS Setup, Building an Image

RRB’s canonical measurements are always done using AWS. For the last two-ish years, I’ve always used m4.2xlarge dedicated instances. That’s a way to keep me honest about hardware while giving you access to the same thing I use. It does, however, cost money. I’ll understand if you don’t literally spin up new instances and follow along.

Packer starts to build your image via “packer build ami.json”

Packer starts to build your image via “packer build ami.json”

First you’ll need an image. I already have one built where I can just “git pull” a couple of things and be ready to go. But let’s assume you don’t yet, or you don’t want to use one of my public images. I don’t always keep everything up to date - and even when I do, you shouldn’t 100% trust me to. The glory of open source is that if I screw something up, you can find that out and fix it. If that happens, pull requests are appreciated.

To build an image, first check out the Rails Ruby Bench repo, then cd into the packer directory. You’ll need Packer installed. It’s software to build VM images, such as the AWS Amazon Machine Image you’ll want for Rails Ruby Bench. This lets us control what’s installed and how, a bit like Docker, but without the extra runtime overhead that Docker involves (Docker would, truthfully, be a better choice for RRB if I knew enough about setting it up and also had a canonical hardware setup for final numbers. I know just enough places where it does cause problems that I’m not confident I can get rid of all the ones I don’t know.)

Got Packer installed? Now “packer build ami.json”. This will go through a long series of steps. It will create a small, cheap AWS instance based on one of the standard Ubuntu AMIs, and then install a lot of software that Rails Ruby Bench and/or RSB want to have available at runtime. It will not install every Ruby version you need. We’ll talk about that later.

And after around an hour, you have a Packer image. It’ll print the AMI, which you’ll need.

And after around an hour, you have a Packer image. It’ll print the AMI, which you’ll need.

(If you do Packer builds repeatedly you will get transient errors sometimes - a package will fail to download, an old Ubuntu package will be in a broken state, etc. In most cases you can re-run until it works, or wait a day or two. More rarely something is now broken and needs an update.)

If all goes well, you’ll get a finished Packer image. It’ll take in the neighbourhood of an hour but you can re-use the image as often as you like. Mostly you’ll rebuild when the Ubuntu version you’re using gets old enough that it’s hard to install new software, and you find a reason you need to install new software.

An Aside: “Old Enough”

Not every benchmark will have this problem, but Rails Ruby Bench has it in spades: legacy versions. Rails Ruby Bench exists specifically to measure against a baseline of Ruby 2.0.0-p0. Ruby releases a new minor version every Christmas, and so that version of Ruby is about to turn seven years old, or more than five years older than my youngest kid. It is not young software as we measure it, and it’s hard to even get Ruby 2.0 to compile on Mac OS any more.

Similarly, the version of Discourse that I use is quite old and so are all its dependencies. Occasionally I need to do fairly gross code spelunking to get it all working.

If you have ordinary requirements you can avoid this. Today’s article will restrict itself to 2.6- and 2.7-series Ruby versions. But keep in mind that if you want to use RRB for its intended purpose, sometimes you’re going to have an ugly build ahead of you. And if you want to use RRB for modern stuff, you’re going to see a lot of little workarounds everywhere.

If you ask, “why are you using that Ubuntu AMI? It’s pretty old,” the specific answer is “it has an old enough Postgres to be compatible with the ancient Discourse gems, including the Rails version, while it’s new enough that I can install tools I experiment with like Locust.” But the philosophical answer is closer to “I upgrade it occasionally when I have to, but mostly I try to keep it as a simple baseline that nearly never changes.”

In general, Rails Ruby Bench tries not to change because change is a specific negative in a benchmark used as a baseline for performance. But I confess that I’m really looking forward to Christmas of 2020 when Ruby 3x3 gets released and Ruby 2.0 stops being the important baseline to measure against. Then I can drop compatibility with a lot of old gems and libraries.

You’ll also sometimes notice me gratuitously locking things down, such as the version of the Bundler. It’s the same basic idea. I want things to remain as constant as they can. That’s not 100% possible - for instance, Ubuntu will automatically add security fixes to older distributions, so there’s no equivalent of a Gemfile.lock for Ubuntu. They won’t let you install old insecure versions for more compatibility, though you can use an old AMI for a similar result. But where I can, I lock the version of everything to something specific.

Starting an Image

If you built the AMI above then you have an AMI ID. It’ll look something like this: ami-052d56f9c0e718334. In fact, that one’s a public AMI I built that I’m using for this post. If you don’t want to build your own AMI you’re welcome to use mine, though it may be a bit old by the time you need to do this.

If you like the AWS UI more than the AWS command-line tools (they’re both pretty bad), then you can just start an instance in the UI. But in case you prefer the command-line tools, here’s the invocation I use:

aws ec2 run-instances --count 1 --instance-type m4.2xlarge --key-name noah-packer-1 --placement Tenancy=dedicated --image-id ami-052d56f9c0e718334 --tag-specifications 'ResourceType=instance,Tags=[]'

Dismal, isn’t it? I also have a script in the RRB repo to launch instances from my most recent AMI. That’s where this comes from. Also, you’ll need your own keypair since your AWS account doesn’t have a key called noah-packer-1.

You’ll need to look up the IP address for the instance, and eventually you’ll want the instance ID in order to terminate it. I’m going to trust you to do those things - do make sure to terminate the instance. Dedicated m4.2xlarges are expensive!

Exploration

Once you have the AMI and you can in theory start the AMI, it’s time to think about the actual experiment: what does GC compaction do relative to Rails Ruby Bench? And how will we tell?

In this case, we’re going to run a number of Ruby versions with compaction on and off and see how it changes the speed of Rails Ruby Bench, which means running it a lot on different Ruby versions with different compaction settings.

To gather data, you generally need a runner script of some kind. You’re going to be running Rails Ruby Bench many times and it would be silly (and error-prone!) to do it all by hand.

First, here’s a not-amazing runner script of the kind I used for awhile:

#!/bin/bash -l

# Show commands, break on error
set -e
set -x

rvm use 2.6.5
bundle

for i in ; do
  bundle exec ./start.rb -i 10000 -w 1000 -s 0 --no-warm-start -o data/
done

rvm use 2.7.0-preview2
bundle

for i in ; do
  bundle exec ./start.rb -i 10000 -w 1000 -s 0 --no-warm-start -o data/
done

It’s… fine. But it shows you that a runner script doesn’t have to be all that complicated. It runs bash with -l for login so that rvm is available. It makes sure to break on error - modern Ruby doesn’t get a lot of errors in Discourse, but you do want to know if it happens. And then it runs 30 trials each on Ruby 2.6.5 and Ruby 2.7.0-preview2, each with 10,000 HTTP requests and 1,000 warmup (untimed) HTTP requests, with the default number of processes (10) and threads per process (6).

With this runner script you’re better off using a small number of iterations (30 is large-ish) and running it repeatedly. That way a transient slowdown doesn’t look like it’s all a difficulty with the same Ruby. In general, you’re better off running everything multiple times if you can, and I often do. All the statistics in the world won’t stop you from doing something stupid, and reproducing everything is one way to make sure you didn’t do some kinds of stupid things. At least, that’s something I do to reduce the odds of me doing stupid things.

There’s a better runner to start from now in Rails Ruby Bench. The main difference is that it runs all the trials in a random order, which helps with that “transient slowdown” problem. For GC compaction we’ll want to modify it to run with and without GC compaction for Rubies that have it (2.7-series Rubies) and only with no compaction for 2.6-series Rubies. Here’s what the replacement loop for that looks like:

commands = []
RUBIES.each do |ruby|
  TESTS.each_with_index do |test, test_index|
    invocation_wc = "rvm use # && # && export RUBY_RUNNER_TEST_INDEX=# && #"
    invocation_nc = "rvm use # && # && RUBY_RUNNER_TEST_INDEX=# && #"
    if ruby["2.6."]  # Ruby is 2.6-series?
      commands.concat([invocation_nc] * TIMES)
    else
      commands.concat([invocation_nc,invocation_wc] * TIMES)
    end
  end
end

It’s not simple, but it’s not rocket science. The WITH_COMPACT and NO_COMPACT snippets are already in the runner because it’s not necessarily obvious how to do that - I like to keep that kind of thing around too. But in general you may need some kind of setup code for an experiment, so remember to remove it for the runs that shouldn’t have it. In this case, there’s not a “compaction setting” for Ruby proper, we just run GC.compact manually in an initialiser script. So those snippets create or remove the initialiser script.

The compaction snippets also set an environment variable, RUBY_COMPACT=YES (or NO.) That doesn’t do anything directly. Instead, RRB will remember any environment variable that starts with RUBY for the run so you can tell which is which. I might have done an overnight run and messed that up the first time and had to re-do it because I couldn’t tell which data was which… But in general, if an environment variable contains RUBY or GEM, Rails Ruby Bench will assume it might be an important setting and save a copy with the run data.

For each experiment, you’ll want to either change the runner in-place or create a new one. In either case, it’s just a random script.

I also changed the RUBIES variable to include more Rubies. But first I had to install them.

More Rubies

There are two kinds of Ruby versions you’ll sometimes want to test: prebuilt and custom-built. When I’m testing ordinary Ruby versions like 2.6.0, 2.6.5 or 2.7.0-preview2, I’ll generally just install them with RVM after I launch my AWS instance. A simple “rvm install 2.6.5” and we’re up and running. The new runner script will install the right Bundler version (1.17.3) and the right gems to make sure RRB will run properly. That can be important when you’re testing four or five or eight different Ruby versions - it’s easy to forget to “bundle _1.17.3_ install” for each one.

If you want to custom-build Ruby, there’s slightly more to it. The default Packer build creates one head-of-master custom build, but of course that’s from whenever the Packer image was built. You may want one that’s newer or more specific.

You’ll find a copy of the Ruby source in /home/ubuntu/rails_ruby_bench/work/mri-head. You’ll also find, if you run “rvm list”, that there’s an ext-mri-head the same age as that checkout. But let’s talk about how to make another one.

We’re exploring GC compaction today, so I’m interested in specific changes to Ruby’s gc.c. If you check the list of commits that changed the file, there’s a lot there. For today, I’ve chosen a few specific ones: 8e743f, ffd082 and dddf5a. There’s nothing magical about these. They’re changes to gc.c, a reasonable distance apart, that I think might have some kind of influence on Ruby’s speed. I could easily have chosen twenty others - but don’t choose all twenty because the more you choose, the slower testing goes. Also, with GC compaction I know there are some subtle bugs that got fixed so the commits are all fairly recent. I don’t particularly want crashes here if I can avoid them. They’re not complicated to deal with, but they are annoying. Worse, frequent crashes usually mean no useful data since “fast but crashy” means that version of Ruby is effectively unusable. Not every random commit to head-of-master would make a good release.

For each of these commits I follow a simple process. I’ll use 8e743f to demonstrate.

  1. git checkout 8e743f

  2. mkdir -p /home/ubuntu/ruby_install/8e743f

  3. ./configure —prefix=/home/ubuntu/ruby_install/8e743f (you may need to autoconf first so that ./configure is available)

  4. make clean (in case you’re doing this multiple times)

  5. make && make install

  6. rvm mount -n mri-pre-8e743f /home/ubuntu/ruby_install/8e743f

You could certainly make a script for this, though I don’t currently install one to the Packer image.

And then you’ll need to use these once you’ve built them. Here’s what the top of my runner script looks like:

RUBIES = [
  "2.6.0",
  "2.6.5",
  "ext-mri-head",  # Since I have it sitting around
  "ext-mri-pre-8e743f",
  "ext-mri-pre-ffd082",
  "ext-mri-pre-dddf5a",
]

Nothing complicated in RUBIES, though notice that rvm tacks on an “ext-” on the front of mounted Rubies’ names.

How Does It Run?

If all goes well, the next part is underwhelming. Now we actually run it. I’m assuming you’ve done all the prior setup - you have an instance running with Rubies installed, you have a runner script and so on.

First off, you can just run the runner from the command line, something like “./runner.rb”. In fact I’d highly recommend you do that first, possibly set with only an iteration or two of each configuration, just to make sure everything is working fine. If you have a Ruby installation that doesn’t work or a Rails version not working with a gem you added or a typo in code somewhere, you want to find that out before you leave it alone for eight hours to churn. In RRB’s runner you can change TIMES from 30 down to something reasonable like 2 (why not 1? I sometimes get config bugs after some piece of configuration is done, so 2 iterations is a bit safer.)

If it works, great! Now you can set TIMES back to something higher. If it doesn’t, now you have something to fix.

You can decide whether to keep the data around from that first few iterations - I usually don’t. If you want to get rid of it then delete /home/ubuntu/rails_ruby_bench/data/*.json so that it doesn’t wind up mixed with your other data.

You can just run the runner from the command line, and it will usually work fine. But if you’re worried about network latency or dropouts (my residential DSL isn’t amazing) then there’s a better way.

Instead, you can run “nohup ./runner &”. That tells the shell not to kill your processes if your network connection goes away. It also says to run it in the background, which is a good thing. All the output will go into a file called nohup.out.

If you need to check progress occasionally, you can run “tail -f nohup.out” to show the output as it gets printed. And doing a quick “ls /home/ubuntu/rails_ruby_bench/data/*.json | wc -l” will tell you how many data files have completed. Keep in mind that the runner scripts and RRB itself are designed to crash if anything goes wrong - silent failure is not your friend when you collect benchmark data. But an error like that will generally be in the log.

Processing the Result

# A cut-down version of the JSON raw data format
{
  "version": 3,
  "settings": {
    "startup_iters": 0,
    "random_seed": 16541799507913229037,
    "worker_iterations": 10000,
    (More settings...)
  },
  "environment": {
    "RUBY_VERSION": "2.7.0",
    "RUBY_DESCRIPTION": "ruby 2.7.0dev (2019-11-22T20:42:24Z v2_7_0_preview3~5 8e743fad4e) [x86_64-linux]",
    "rvm current": "ext-mri-pre-8e743f",
    "rails_ruby_bench git sha": "1bba9dbeaa1e02684d8c2ca8a8f9100c90506d5c\n",
    "ec2 instance id": "i-0cf628df3200d5ad5",
    "ec2 instance type": "m4.2xlarge",
    "env-GEM_HOME": "/home/ubuntu/.rvm/gems/ext-mri-pre-8e743f",
    "env-MY_RUBY_HOME": "/home/ubuntu/.rvm/rubies/ext-mri-pre-8e743f",
    "env-rvm_ruby_string": "ext-mri-pre-8e743f",
    "env-RUBY_VERSION": "ext-mri-pre-8e743f",
    "env-RUBYOPT": "-rbundler/setup",
    "env-RUBYLIB": "/home/ubuntu/.rvm/gems/ext-mri-pre-8e743f/gems/bundler-1.17.3/lib",
    (More settings...)
  },
  "warmup": {
    "times": [
      [
        0.177898031,
        0.522202063,
        0.706261902,
        0.372002397,

If you’ve done everything so far, now you have a lot of large JSON files full of data. They’re pretty straightforward, but it’s still easier to use a processing script to deal with them. You’d need a lot of quality time with a calculator to do it by hand!

I do this a lot, so there’s a data-processing script in the Rails Ruby Bench repo that can help you.

First, copy your data off the AWS instance to somewhere cheaper. If you’re done with the instance, this is a decent time to terminate it. Then, copy the RRB script called process.rb to somewhere nearby. You can see this same setup repeatedly in my repository of RRB data. I also have a tendency to copy graphing code into the same place. Copying, not linking, means that the version of the data-processing script is preserved, warts and all, so I know later if something was screwed up with it. The code is small and the data is huge so it’s not a storage problem.

Now, figure out how you’re going to divide up the data. For instance, for this experiment we care which version of Ruby and whether we’re compacting. We can’t use the RUBY_VERSION string because all those pre-2.7.0 Rubies say they’re 2.7.0. But we can use ‘rvm current’ since they’re all mounted separately by RVM.

I handle environment variables by prefixing them with “env” - that way there can’t be a conflict between RUBY_VERSION, which is a constant that I save, with an environment variable of the same name.

The processing script takes a lot of data, divides it into “cohorts”, and then shows information for each cohort. In this case, the cohorts will be divided by “rvm current” and “env-RUBY_COMPACT”. To make the process.rb script do that, you’d run “process.rb -c ‘rvm current,env-RUBY_COMPACT’”.

It will then print out a lot of chunks of text to the console while writing roughly the same thing to another JSON file. For instance, here’s what it printed about one of them for me:

Cohort: rvm current: ext-mri-pre-8e743f, env-RUBY_COMPACT: YES, # of data points: 600000 http / 0 startup, full runs: 60
   0%ile: 0.00542679
   1%ile: 0.01045952148
   5%ile: 0.0147234587
  10%ile: 0.0193235859
  50%ile: 0.1217705375
  90%ile: 0.34202113749999996
  95%ile: 0.4023132304000004
  99%ile: 0.53301011523
  100%ile: 1.316529161
--
  Overall thread completion times:
   0%ile: 44.14102196700001
  10%ile: 49.34424536089996
  50%ile: 51.769418454499984
  90%ile: 54.03600075760001
  100%ile: 56.40413652299999
--
  Throughput in reqs/sec for each full run:
  Mean: 187.45566524151448 Median: 188.96162032049574 Variance: 16.072435858651925
  [177.2919614844611, 178.24351344183614, 180.07540051803122, 180.3893011741887, 180.64734390789422, 180.78633357692414, 180.9370756562659, 181.48759316874003, 181.50042200695788, 181.7831931840077, 181.82136366559922, 182.42668523798133, 182.9695378281489, 183.4271937021401, 183.69630166389499, 185.39624590894704, 186.6188358046953, 186.72653137536867, 187.41516559992874, 187.44972315610178, 187.79211195172797, 188.03560095362238, 188.04550491676113, 188.16079648567523, 188.47720218882668, 188.57493052728336, 188.77093032659823, 188.7810661284267, 188.82632914724448, 188.9600070136181, 188.96323362737334, 189.05603777953803, 189.07694018310067, 189.09085709051078, 189.3054218996176, 189.42953673775793, 189.67879103436863, 189.68938987320993, 189.70449808150627, 189.7789255152989, 189.79846786458847, 189.89027249507834, 189.90364836070546, 189.98443889440762, 190.0304216448691, 190.2516551068254, 190.43172176734097, 190.51420115472305, 190.56095325134356, 190.56496123229778, 190.70854487422903, 190.7499088018249, 190.94577669990025, 191.0250241857314, 191.2679317071894, 191.39842651014004, 191.44203815980674, 191.94534584952945, 193.16205400859081, 193.47628839756382]

--
  Startup times for this cohort:
  Mean: nil Median: nil Variance: nil

What you see there is the cohort for Ruby 8e743f with compaction turned on. I ran start.rb sixty times in that configuration (two batches of 30, random order), which gave 600,000 data points (HTTP requests.) It prints what cohort it is in (the values of “rvm current” and “env-RUBY_COMPACT”). If your window is wide enough you can see that it prints the number of full runs (60) and the number of startups (0). If you check the command lines up above we told it zero startup iterations, so that makes sense.

The top batch of percentiles are for individual HTTP requests, ranging from about 0.005 seconds to around half a second for very slow requests, to 1.3 seconds for one specific very slow request (the 100th-percentile request.) The next batch of percentiles are called “thread completion times” are because the load tester divides the 10,000 requests into buckets and runs them through in parallel - in this case, each load-tester is running with 30 threads, so that’s about 333 consecutive requests each, normally taking in the neighbourhood of 52 seconds for the whole bunch.

You can also just treat it as one giant 10,000-request batch and time it end-to-end. If you do that you get the “throughput in reqs/sec for each full run” above. Since that happened 60 times, you can take a mean or median for all 60. Data from Rails Ruby Bench generally has a normal-ish distribution, resulting in the mean and median being pretty close together - 187.5 versus 189.0 is pretty close, particularly with a variance of around 16 (which means the standard deviation is close to 4, since standard deviation is the square root of variance.)

If you don’t believe me about it being normal-ish, or you just want to check if a particular run was weird, you’ll also get all the full-run times printed out one after the other. That’s sixty of them in this case, so I expect they run off the right side of your screen.

All this information and more also goes into a big JSON file called process_output.json, which is what I use for graphing. But just for eyeballing quickly, I find process.rb’s console output to be easier to skim. For instance, the process_output.json for all of this (ten cohorts including compaction and no-compaction) runs to about six million lines of JSON text and includes the timing of all 600,000 HTTP requests by cohort, among other things. Great for graphing, lousy for quick skimming.

But What’s the Answer?

I said I didn’t know the answer when I started writing this post - and I didn’t. But I also implied that I’d find it out, and I’ve clearly run 600,000 HTTP requests’ worth of data gathering. So what did I find?

Um… That the real memory compaction is the friends we made along the way?

After running all of this for a couple of days, the short answer is “nothing of statistical significance.” I still see Ruby 2.6.5 being a bit slower than 2.6.0, like before, but close enough that it’s hard to be sure - it’s within about two standard deviations. But the 2.7.0 prereleases are slightly faster than 2.6. And turning compaction on or off makes essentially no difference whatsoever. I’d need to run at least ten times as many samples as this to see statistical significance in these thresholds. So if there’s a difference between 2.7 Rubies, or with compaction, at all, it’s quite small.

And that, alas, is the most important lesson in this whole long post. When you don’t get statistical significance, and you’ve checked that you did actually change the settings (I did), the answer is “stop digging.” You can run more samples (notice that I told you to use 30 times and I gave data for 60 times?). You can check the data files (notice that I mentioned throwing away an old run that was wrong?) But in the end, you need to expect “no result” as a frequent answer. I have started many articles like this, gotten “no result” and then either changed direction or thrown them away.

But today I was writing about how to use the tools! And so I get a publishable article anyway. Alas, that trick only works once.

If you say to yourself, “self, this seems like a lot of data to throw away,” you’re not wrong. Keep in mind that there are many tricks that would let you see little or no difference with a small run before doing something large like this. Usually you should look for promising results in small sets and only then reproduce them as a larger study. There are whole fields of study around how to do studies and experiments.

But today I was showing you the tools. And not infrequently, this is what happens. And so today, this is what you see.

Does this mean Ruby memory compaction doesn’t help or doesn’t work? Nope. It means that any memory it saves isn’t enough to show a speed difference in Rails Ruby Bench — but that’s not really what memory compaction is for, even if I wanted to know the result.

Memory compaction solves a weird failure case in Ruby where a single Ruby object can keep a whole page from being freed, resulting in high memory usage for no reason… But Rails Ruby Bench doesn’t hit that problem, so it doesn’t show that case. Basically, memory compaction is still useful in the failure cases it was designed for, even if Rails Ruby Bench is already in pretty good shape for memory density.

Symbol#to_s Returned a Frozen String in Ruby 2.7 previews - and Now It Doesn’t

How a Broken Interface Getting Fixed Showed Us That It's Broken

One of the things I love about Ruby is the way its language design gets attention from many directions and many points of view. A change in the Ruby language will often come from the JRuby side, such as this one proposed by Charles Nutter. Benoit Daloze (a.k.a. eregon), the now-lead of TruffleRuby is another major commenter. And of course, you’ll see CRuby-side folks including Matz, who is still Ruby’s primary language designer.

That bug has some interesting implications… So let’s talk about them a bit, and how an interface not being perfectly thought out at the beginning often means that fixing it later can have difficulties. I’m not trying to pick on the .to_s method, which is a fairly good interface in most ways. But all of Ruby started small and has had to deal with more and more users as the language matures. Every interface has this problem at some point, as its uses change and its user base grows. This is just one of many, many good examples.

So… What’s This Change, Then?

You likely know that in Ruby, when you call .to_s on an object, it’s supposed to return itself “translated” to a string. For instance if you call it on the number 7 it will return the string “7”. Or if you call it on a symbol like :bob it will return the string “bob”. A string will just return itself directly with no modifications.

There are a whole family of similar “typecast” methods in Ruby like to_a, to_hash, to_f and to_i. Making it more complicated, most types have two typecast operators, not one. For strings that would be to_s and to_str, which for arrays it’s to_a and to_ary. For the full details of these operators, other ways to change types and how they’re all used, I highly recommend Avdi Grimm’s book Confident Ruby, which can be bought, or ‘traded’ for sending him a postcard! In any case, take my word for it that there are a bunch of “type conversion operators,” and to_s is one of them.

In Ruby 2.7-preview2, a random Ruby prerelease, Symbol#to_s started returning a frozen string, which can’t be modified. That breaks a few pieces of code. That’s how I stumbled across the change — I do speed-testing on pretty ancient Ruby code regularly, so there are a lot of little potential problems that I hit.

But Why Is That a Problem?

When would that break something? When somebody calls #to_s and then messes with the result, mostly. Here’s the code that I had trouble with, from an old version of ActiveSupport:

    def method_missing(name, *args)
      name_string = name.to_s
      if name_string.chomp!("=")
        self[name_string] = args.first
      else
        bangs = name_string.chomp!("!")

        if bangs
          self[name_string].presence || raise(KeyError.new(":# is blank"))
        else
          self[name_string]
        end
      end
    end

So… Was this a perfectly okay way to do it, broken by a new change? Oooooh… That’s a really good question!

Here are some more good questions that I, at least, didn’t know the answers to offhand:

  • If a string usually just returns itself, is it okay that modifying the string also modifies the original?

  • Is it a problem, optimisation-wise, to keep allocating new strings every time? (Schneems had to work around this)

  • If you freeze the string, which freezes the original, is that okay?

These are hard questions, not least because fixing question #1 in the obvious way probably breaks question #2 and vice-versa. And question #3 is just kind of weird - is it okay to stop this behaviour part way through? Ruby makes it possible, but that’s not what we care about, is it?

I mention this interface, to_s, “not being perfectly thought out” up at the top of this post. And this is what I mean. to_s is a limited interface that does some things really well, but it simply hasn’t been thought through in this context. That’s true of any interface - there will always be new uses, new contexts, new applications where it either hasn’t been thought about or the original design was wrong.

“Wrong?” Isn’t that a strong statement? Not really. Charles Nutter points out that the current design is simply unsafe in the way we’re using it - it doesn’t guarantee what happens if you modify the result, or decide whether it’s legal to do so. And people are, in fact, modifying its result. If they weren’t then we could trivially freeze the result for safety and optimisation reasons and nobody would notice or care (more on that below.)

Also, we’ll know in the future, not just for to_s but for conversion methods in general - it’s not safe to modify their results. I doubt that to_s is the only culprit!

Many Heads and a Practical Answer

In the specific Ruby 2.7 sense, we have an answer. Symbol#to_s returned a frozen string and some code broke. Specifically, the answer to “what broke?” seems to be “about six things, some of them old or obscure.” But this is what trying something out in a preview is for, right? If it turns out that there are problems with it, we’re likely to find them before the final release of 2.7 and we can easily roll this back. Such things have happened before, and will again.

(In fact, it did happen. The release 2.7.0 won’t do this, and they’re rethinking the feature. It may come back, or may change and come back in a different form. The Ruby Core Team really does try to keep backward compatibility where they can.)

In the mean time, if you’re modifying the result of calling to_s, I recommend you stop! Not only might the language break that (or not) later, but you’re already given no guarantees that it will keep working! In general, don’t trust the result of a duplicated object from a conversion method to be modifiable. It might be frozen, or worse it might modify the original object… And yet, it isn’t guaranteed to, or to keep doing it if it already does.

And so the march of progress digs up another problem for us, and we all learn another little bit of interface design together.

Ruby 2.7.0's Rails Ruby Bench Speed is Unchanged from 2.6.0

As of the 25th of December, 2019 we have a released version of Ruby 2.7.0. As you can read in the title - it’s basically the same as 2.6.0.

The 2.7.0 series is remarkable in how little the speed has changed. Overall it has been very stable with very little change in performance. I’ve seen a tiny bit of drift in Rails Ruby Bench results, sometimes as much as 1%-2%, but no more.

The other significant news is also not news: JIT performance is nearly entirely unchanged for Rails apps from 2.6.0. I don’t recommend using CRuby’s MJIT for Rails, and neither does Takashi Kokubun, MJIT’s primary maintainer.

I have a lot of data files to this effect, but… The short version is that, when I run 150 trials of 10,000 HTTP requests each for 2.6.0 versus 2.7.0, the results are well within the margin of error on the measurement. With JIT the results aren’t quite that close, but it’s the same to within a few percent - which means you still shouldn’t turn on JIT for a large Rails app.

I spent some time trying to see if there was a small speedup anywhere in the 2.7 previews that we might have had and missed - there are speed differences of about that size between the fastest and slowest prerelease 2.7 Rubies, which is still very, very small as a span of speeds. And as far as I can tell, no individual change has made a large speed difference, not even 2%. There’s just a very slow drift over time.

Does that mean that Ruby has gotten as fast as it can? Not at all.

Vladimir Makharov (the original author of CRuby’s MJIT) is still working on Mir, a new style of Ruby JIT. Takashi Kokubun is still tuning the existing JIT. I’ve heard interesting things about work from Koichi Sasada on significant reworks of VM subsystems. There are new features happening, and we now have memory compaction.

But I think that at this point, we can reasonably say that the low-hanging performance fruit has been picked. Most speedups from here are going to be more effort-intensive, or require significant architectural changes.

More Fiber Benchmarking

I’ve been working with Samuel Williams a bit (and on my own a bit) to do more benchmarking Fiber speeds in Ruby and comparing them to processes and threads. There’s always more to do! Not only have I been running more trials for each configuration (get that variance down!), I also tried out a couple more configurations of the test code. It’s always nice to see what works well and what doesn’t.

New Configurations and Methodology

Samuel pointed out that for threads, I could run one thread per worker in the master process, for a total of 2 * workers threads instead of using IO.select in a single thread in the master. True! That configuration is less like processes but more like fibers, and is arguably a fairer representation of a ‘plain’ thread-based solution to the problem. It’s also likely to be slower in at least some configurations since it requires twice as many threads. I would naively expect it to perform worse for lack of a good centralised place to coordinate which thread is working next. But let’s see, shall we?

Samuel also put together a differently-optimised benchmark for fibers, one based on read_nonblock. This is usually worse for throughput but better for latency. A nonblocking implementation can potentially avoid some initial blocking, but winds up much slower on very old Ruby when read_nonblock was unusably slow. This benchmark, too, has an interesting performance profile that’s worth a look.

I don’t know if you remember from last time, but I was also doing something fairly dodgy with timing - I measured the entire beginning-to-end process time from outside the Ruby process itself. That means that a lot of process/thread/fiber setup got ‘billed’ to the primitive in question. That’s not an invalid way to benchmark, but it’s not obviously the right thing.

As a quick spoiler on that last one: process setup takes between about 0.3 and 0.4 seconds for everything - running Ruby, setting up the IO pipes, spawning the workers and all. And there’s barely any variation in that time between threads vs processes vs fibers. The main difference between “about 0.3” and “about 0.4” seconds is whether I’m spawning 10 workers or 1000 workers. In other words, it basically didn’t turn out to matter once I actually bothered to measure - which is good, and I expected, but it’s always better to measure than to expect and assume.

I also put together a fairly intense runner script to make sure everything was done in a random order - one problem with long tests is that if something changes significantly (the Amazon hardware, some network connection, a background process to update Ubuntu packages…) then a bunch of highly-correlated tests all have the same problem. Imagine if Ubuntu started updating its packages right as the fiber tests began, and then stopped as I switched to thread tests. It would look like fibers were very slow and prone to huge variation in results! I handle this problem for my important results by re-running lots of tests when it’s significant… But I’m not always 100% scrupulous, and I’ve been bitten by this before. There’s a reason I can tell you the specifics of the problem, right? A nice random-order runner doesn’t keep background delays from happening, but they keep them from all being in the same kind of test. Extra randomly-distributed background noise makes me think, “huh, that’s a lot of variance, maybe this batch of test runs is screwy,” which is way better than if I think, “wow, fibers really suck.”

So: the combination of 30 test-runs per configuration rather than 10 and running them in a random order is a great way to make sure my results are basically solid.

I’ve also run with the October 18th prerelease version of Ruby 2.7… And the performance is mostly just like the tested 2.6. A little faster, but barely. You’ll see the graphs.

Threaded Results

Since we have two new configurations, let’s start with one of them. The older thread-based benchmark used IO.select and the newer one uses a lot of threads. In most languages, I’d now comment how the “lot of threads” version needs extra coordination — but Ruby’s GIL turns out to handle that for us nicely without further work. There are advantages to having a giant, frequently-used lock already in place!

I had a look at the data piecemeal, and yup, on Linux I saw about what I expected to for several of the runs. I saw some different things on my Mac, but Mac can be a little weird for Ruby performance, zigging when Linux zags. Overall we usually treat Linux as our speed-critical deployment platform in the English-speaking world - because who runs their production servers on Mac OS?

Anyway, I put together the full graph… Wait, what?

Y Axis is the time in seconds to process 100,000 messages with the given number of threads

Y Axis is the time in seconds to process 100,000 messages with the given number of threads

That massive drop-off at the end… That’s a good thing, no question, but why is thread contention suddenly not a problem in this case when it was for the previous six years of Ruby?

The standard deviation is quite low for all these samples. The result holds for the other numbers of threads I checked (5 and 1000), I just didn’t want to put eight heavily-overlapped lines on the same graph - but the numbers are very close for those, too.

I knew these were microbenchmarks, and those are always a bit prone to large changes from small changes. But, uh, this one surprised me a bit. At least it’s in a good direction?

Samuel is looking into it to try to find the reason. If he gets back to me before this gets published, I’ll tell you what it is. If not, I guess watch his Twitter feed if you want updates?

Fibrous Results

Fibers sometimes take a little more code to do what threads or processes manage. That should make sense to you. They’re a higher-performance, lower-overhead method of concurrency. That sometimes means a bit more management and hand-holding, and they allow you to fully control the fiber-to-fiber yield order (manual control) which means you often need to understand that yield order (no clever unpredictable automatic control.)

Samuel Williams, who has done a lot of work on Ruby’s fiber improvements and is the author of the Falcon fiber-based application server, saw a few places to potentially change up my benchmark and how it did things with a little more code. Awesome! The changes are pretty interesting - not so much an obvious across-the-board improvement as a somewhat subtle tradeoff. I choose to interpret that as a sign that my initial effort was pretty okay and there wasn’t an immediately obvious way to do better ;-)

He’s using read_nonblock rather than straight-up read. This reduces latency… but isn’t actually amazing for bandwidth, and I’m primarily measuring bandwidth here. And so his code would likely be even better in a latency-based benchmark. Interesting, read_nonblock had horrifically bad performance in really old Ruby versions, partly because of using exception handling for its flow control - a no-no in nearly any language with exceptions.

You can see the code for the original simpler benchmark versus his version with changes here.

It turns out that the resulting side by side graph is really interesting. Here, first look for yourself:

Red and orange are the optimised version, while blue and green are the old simple one.

Red and orange are the optimised version, while blue and green are the old simple one.

You already know that read_nonblock is very slow for old Ruby. That’s why the red and orange lines are so high (bad) for Ruby until 2.3, but then suddenly get faster than the blue and green lines for 2.3 and 2.4.

You may remember in my earlier fiber benchmarks that the fiber performance has a sort of humped curve, with 2.0 being fast, 2.3 being slow and 2.6 eventually getting faster than 2.0. The blue and the green lines are a re-measurement of the exact same thing and so have pretty much exactly the same curve as last week. Good. You can see an echo of the same thing in the way the red and orange lines also get slower for 2.2.10, though it’s obscured by the gigantic speedup to read_nonblock in 2.3.8.

By 2.5, all the samples are basically in a dead heat - close enough that none of them are really outside the range of measurement error of each other. And by 2.6.5, suddenly the simple versions have pulled ahead, but only slightly.

One thing that’s going on here is that read_nonblock has a slight disadvantage compared to blocking I/O in the kind of test I’m doing (bandwidth more than latency.) Another thing that’s going on is that microbenchmarks give large changes with small differences in which operations are fast.

But if I were going to tell one overall story here, it’s that recent Ruby is clearly winning over older Ruby. So our normal narrative applies here too: if you care about the speed of these things, upgrade to the latest stable Ruby or (occasionally, in specific circumstances) later.

Overall Results

The basic conclusions from the previous benchmarks also still hold. In no particular order:

  • Processes get a questionably-fair boost by stepping around the Global Interpreter Lock

  • Threads and Fibers are both pretty quick, but Fibers are faster where you can use them

  • Processes are extremely quick, but in large numbers will eat all your resources; don’t use too many

  • For both threads and fibers, upgrade to a very recent Ruby for best speed

I’ll also point out that I’m doing very little here - in practice, a lot of this will depend on your available memory. Processes can get very memory-hungry very quickly. In that case, you may find that having only one copy of your in-memory data by using threads or fibers is a huge win… At least, if you’re not doing too much calculation and the GIL messes you up.

See why we have multiple different concurrency primitives? There truly isn’t an easy answer to ‘which is best.’ Except, perhaps, that Matz is “not a threading guy” (still true) - and we don’t prefer threads in CRuby. Processes and Fibers are both better where they work.

(Please note that these numbers, and these attitudes, can be massively different in different Ruby implementations - as they certainly are in JRuby!)

JIT and Ruby's MJIT

Arthur Rackham explains Ruby debugging

Arthur Rackham explains Ruby debugging

If you already know lots about JIT in general and Ruby’s MJIT in particular… you may not learn much new in this post. But in case you wonder “what is JIT?” or “what is MJIT?” or “what’s different about Ruby’s JIT?” or perhaps “why in the world did they decide to do THAT?”…

Well then, perhaps I can help explain!

Assisting me in this matter will be Arthur Rackham, famed early-twentieth-century children’s illustrator whose works are now in the public domain. This whole post is adapted from slides to a talk I gave at Southeast Ruby in 2018.

I will frequently refer to TruffleRuby, which is one of the most complex and powerful Ruby implementations. That’s not because you should necessarily use it, but because it’s a great example of Ruby with a powerful and complicated JIT implementation.

What is JIT?

Do you already know about interpreted languages versus compiled languages? In a compiled language, before you run the program you’re writing, you run the compiler on it to turn it into a native application. Then you run that. In an interpreted language, the interpreter reads your source code and runs it more directly without converting it.

A compiled language takes a lot of time to do the conversion… once. But afterward, a native application is usually much faster than an interpreted application. The compiler can perform various optimizations where it recognizes that there is an easier or better way to do some operation than the straightforward one and the native code winds up better than the interpreted code - but it takes time for the compiler to analyze the code and perform the optimization.

A language with JIT (“Just In Time” compilation) is a hybrid of compiled and interpreted languages. It begins by running interpreted, but then notices which pieces of your program are called many times. Then it compiles just those specific parts in order to optimize them.

The idea is that if you have used a particular method many times, you’ll probably use it again many times. So it’s worth the time and trouble to compile that method.

A JITted language avoids the slow compilation step, just like interpreted languages do. But they (eventually) get the faster performance for the parts of your program that are used the most, like a compiled language.

Does JIT Work?

In general, JIT can be a very effective method. How effective depends on what language you’re compiling and what features of that language - you’ll see numbers from 6% to 40% or even more in JavaScript, for instance.

And in fact, there’s an outdated blog post by Benoit Daloze about how TruffleRuby (with JIT) can run a particular CPU-heavy benchmark at 900% the speed of standard CRuby, largely because of its much better JIT (see graph below.) I say “outdated” because TruffleRuby is likely to be even faster now… though so is the latest CRuby.

These numbers are from Benoit Daloze in 2016, see link above

These numbers are from Benoit Daloze in 2016, see link above

And in fact, the most recent CRuby with JIT enabled runs this same benchmark about 280% the speed of older interpreted CRuby.

JIT Tradeoffs

Nothing is perfect in all situations. Every interesting decision you make as an engineer is a tradeoff of some kind.

Compared to interpreting your language, JIT’s two big disadvantages are memory usage and warmup time.

Memory usage makes sense - if you use JIT, you have to have the interpreted version of your method and the compiled, native version. Two versions, more memory. For complicated reasons, sometimes it’s more than two versions - TruffleRuby often has a lot more than two, which is part of why it’s so fast, but uses lots of memory.

A JIT Implementation beset by troubles

A JIT Implementation beset by troubles

In addition to keeping multiple versions of each method, JIT has to track information about the method. How many times was it called? How much time was spent there? With what arguments? Not every JIT keeps all information, but that means a more complicated JIT with better performance will track more information and use more memory.

In addition to memory usage, there’s warmup time. With JIT, the interpreter has to recognize that a method is called a lot and then take time to compile it. That means there’s a delay between when the program starts and when it gets to full speed.

Some JITs try to compile optimistically - to quickly notice that a method is called a lot and compile it. If it does that, it will often compile methods that don’t get called again much, sometimes, which wastes its time. The Java Virtual Machine (JVM) is (in)famous for this, and tends to run very slowly until JIT has finished.

Other JITs compile pessimistically - they compile methods slowly, and only after they have been called many times. This makes for less waste by compiling the wrong methods, but more warmup time near program start before the program is running quickly. There’s not a “right” answer, but instead various interesting tradeoffs and situations.

JIT is best for programs that run for a long time, like background jobs or network servers. For long-running programs there’s plenty of time to compile the most-used methods and plenty of time to benefit from that speedup. As a result, JIT is often counterproductive for small, short-running programs. Think of “gem list” or small Rake tasks as examples where JIT may not help, and could easily hurt.

Why Didn’t Ruby Get JIT Sooner?

A Ruby core developer tests a JIT implementation for stability

A Ruby core developer tests a JIT implementation for stability

JIT’s two big disadvantages (memory usage, startup/warmup time) are both huge CRuby advantages. That made JIT a tough sell.

Ruby’s current JIT, called MJIT for “Method JIT,” was far from the first attempt. Evan Phoenix built an LLVM Ruby JIT long ago that wound up becoming Rubinius. Early prototypes have been around long before MJIT or its at-the-time competitors. JIT in other Ruby implementations (LLVM libs in Rubinius, OMR) have been tried out and rejected many times. Memory usage has been an especially serious hangup. The Core Team wants CRuby to run well on the smallest Heroku dynos and (historically) in embedded environments.

And while it’s possible to tune a JIT implementation to be okay for warmup time, most JIT is not tuned that way. The Java Virtual Machine (JVM) is an especially serious offender here. Since JRuby (Ruby written in Java) is the most popular alternate Ruby implementation, most Ruby programmers think of “Ruby with JIT” startup time as “Ruby with JVM” startup time, which is dismal.

Also, a JIT implementation can be quite large and complicated. The Ruby core team didn’t really want to adopt something large and complicated that they didn’t have much experience with into the core language.

Shyouhei Urabe, a core team member, created a “deoptimization branch” for Ruby that basically proved you could write a mini-JIT with limited memory use, fast startup time and minimal complexity. This convinced Matz that such a thing was possible and opened the door to JIT in CRuby, which had previously seemed difficult or impossible.

Several JIT implementations were developed… And eventually, Vladimir Makarov created an initial implementation for what would become Ruby’s JIT, one that was reasonably quick, had very good startup time and didn’t use much memory — we’ll talk about how below.

And that was it? No, not quite. MJIT wasn’t clearly the best possibility. Vlad’s MJIT-in-development competed with various other Ruby implementations and with Takashi Kokubun’s LLVM-based Ruby JIT. After Vlad convinced Takashi that MJIT was better, Takashi found a way to take roughly the simplest 80% of MJIT and integrate it nicely into Ruby in a way that was easy to deactivate if necessary and touched very little code outside itself, which he called “YARV-MJIT.”

And after months of integration work, YARV-MJIT was accepted provisionally into prerelease Ruby 2.6 to be worked on by the other Ruby core members, to make sure it could be extended and maintained.

And that was how Ruby 2.6 got MJIT in its current form, though still requiring the Ruby programmer to opt into using it.

Making fun of Ruby for not having JIT yet

Making fun of Ruby for not having JIT yet

MJIT: CRuby’s JIT

The MJIT implementation shows early promise

The MJIT implementation shows early promise

MJIT is an unusual JIT implementation: it uses a Ruby-to-C language translator and a background thread running a C compiler. It literally writes out C language source files on the disk and compiles them into shared libraries which the Ruby process can load and use. This is not at all how most JIT implementations work.

When a method has been called a certain number of times (10,000 times in current prerelease Ruby 2.7), MJIT will mark it to be compiled into native code and put it on a “to compile” queue. MJIT’s background thread will pull methods from the queue and compile them one at a time into native code.

Remember how we talked about the JVM’s slow startup time? That’s partly because it rapidly begins compiling methods to native code, using a lot of memory and processor time. MJIT compiles only one method at once and expects the result to take time to come back. MJIT sacrifices time-to-full-speed to get good performance early on. This is a great match for CRuby’s use in small command-line applications that often don’t run for long.

“Normal” JIT compiles inside the application’s process. That means if it uses a lot of memory for compiling (which it nearly always does) then it’s very hard to free that memory back to the system. Ruby’s MJIT runs the compiler as a separate background process - when the compiling finishes, the memory is automatically and fully freed back to the operating system. This isn’t as efficient — it sets up a whole external process for compiling. But it’s wonderful for avoiding extra memory usage.

How To Use JIT

This has mostly been a conceptual post. But how do you actually use JIT?

In Ruby 2.6 or higher, use the “—jit” argument to Ruby. This will turn JIT on. You can also add “—jit” to your RUBYOPT environment variable, which will automatically pass it to Ruby every time.

Not sure if your version of Ruby is high enough? Run “ruby —version”. Need to install a later Ruby? Use rvm, ruby-build or your version manager of choice. Ruby 2.6 is already released as I write this, with Ruby 2.7 coming at Christmastime of 2019.

What About Rails?

Unfortunately, there is one huge problem with Ruby’s current MJIT. At the time I write this in mid-to-late 2019, MJIT will slow Rails down instead of speeding it up.

That’s a pretty significant footnote.

Problems, Worries and Escape Hatches

If you want to turn JIT off for any reason in Ruby 2.6 or higher, you can use the “—disable-jit” command-line argument to do that. So if you know you don’t want JIT and you may run the same command with Ruby 3, you can explicitly turn JIT off.

Why might you want to turn JIT off?

Debugging JIT problems

Debugging JIT problems

  • Slowdowns: you may know you’re running a tiny program like “gem list —local” that won’t benefit from JIT at all.

  • No compiler available: you’re running on a production machine without GCC, Clang, etc. MJIT won’t work.

  • You’re benchmarking: you don’t want JIT because you want predictability, not speed.

  • Memory usage: MJIT is unusually good for JIT, but it’s not free. You may need every byte you can get.

  • Read-Only /tmp Dir: If you can’t write the .c files to compile, you can’t compile them.

  • Weird platform: If you’re running Ruby on your old Amiga or iTanium, there isn’t going to be a supported compiler. You may want to turn JIT off out of general worry and distrust.

  • Known bug: you know of some specific un-fixed bug and you want to avoid it.

What’s My Takeaway?

Telling the playfully frightened children of a once-JITless Ruby

Telling the playfully frightened children of a once-JITless Ruby

If you’re running a non-Rails Ruby app and you’d like to speed it up, test it out with “—jit”. It’s likely to do you some good - at least if the CPU is slowing you down.

If you’re running a Rails app or you don’t need better CPU performance, don’t do anything. At some point in the future JIT will become default, and then you’ll use it automatically. It’s already pretty safe, but it will be even safer with a longer time to try it out. And by then, it’s likely to help Rails as well.

If you have a specific reason to turn JIT off (see above,) now you know how.

And if you’ve heard of Ruby JIT and you’re wondering how it’s doing, now you know!

RubyConf Nashville

Hey, folks! I’d love to call out a little fun Ruby news from RubyConf in Nashville.

Ruby 3.0 and Ruby Core

We’ve been saying for awhile that Ruby 3.0 will ‘probably’ happen next year (2020.) It has now been formally announced that Ruby 3 will definitely happen next year. From the same Matz Q&A, we heard that he’s still not planning to allow emoji operators. 🤷

Additionally, it looks like a lot of “Async” gems for use with Fibers will be pulled into Ruby Core. In general, it looks like there’s a lot of interesting Fiber-related change coming.

Artichoke

I like to follow alternative (non-MRI) Ruby implementations. Artichoke is a Ruby interpreter running in Rust and mruby (an embedded lightweight Ruby dialect, different from normal Ruby). It compiles to WebAssembly, allowing it to be easily embedded and sandboxed in a web page to run untrusted code, or to run under Node.js on a server.

It’s pretty early days for artichoke, but it runs a lot of Ruby code already. They consider any difference in behaviour from MRI to be a bug, which is a good sign. You can play with their in-browser version from their demo page.

Rubyfmt

Rubyfmt, pronounced “Ruby format,” is a Ruby automatic formatter, similar to “go fmt.” If that doesn’t mean anything to you, imagine that you could run any Ruby source file through a program and it would use absolutely standard spacing to reformat it - there could be exactly one way to arrange spaces to format your source file. The benefit is that you can stop arguing about it and just use the one standard way of spacing.

Rubyfmt is still in progress. Penelope Phippen very insistently wants it to be faster before there’s anything resembling a general release. But there’s enough now that it’s possible to contribute and to play with it.

Ruby's Roots and Matz's Leadership

I recently had the excellent fortune to be invited to a gathering at CookPad in Bristol, where Matz, Koichi, Yusuke Endoh (a.k.a. Mame) and Aaron Patterson all gave great talks about Ruby.

I was especially interested in Matz’s first talk, which was about where he got inspiration for various Ruby features, and about how he leads the language - how and why new features are added to Ruby.

You can find plenty of online speculation about where Ruby is going and how it’s managed. And I feel awkward adding to that speculation — especially since I have great respect for Matz’s leadership. But it seems reasonable to relay his own words about how he chooses.

And I love hearing about how Ruby got where it is. Languages are neat.

You’ll notice that most of Ruby’s influences are old languages, often obscure ones. That’s partly because Ruby itself dates back to 1995. A lot of current languages didn’t exist to get these features from!

Ruby and Its Features

Much of what Matz had to say was about Ruby itself, and where particular features came from. This is a long list, so settle in :-)

The begin and end keywords, and Ruby’s idea of “comb indentation” - the overall flow of if/elsif/else/end - came from the Eiffel language. He mentions that begin/end versus curly-braces are about operator precedence, which I confess I’d never even considered.

On that note, why “elsif”? Because it was the shortest way to spell “else if” that was still pronounced the same way. “elseif” is longer, and “elif” wouldn’t be pronounced the same way.

Then” is technically a Ruby keyword, but it’s optional, a sort of “soft” keyword as he called it. For instance, you can actually use “then” as a method name and it’s fine. You might be forgiven for asking, “wait, what does that do?” The effective answer is “nothing.”

Ruby’s loop structure, with continue, next, break and so on came from C and Perl. He liked Perl’s “next” because it’s shorter than the equivalent C structures.

Ruby’s mixins, mostly embodied by modules, came from Lisp’s Flavors.

Unless” is from Perl. Early on, Ruby was meant as a Perl-style scripting language and many of its early features came from that fact. “Until” is the same. Also, when I talk about Perl here I mean Perl 5 and before, which are very different from Perl 6 - Ruby was very mature by the time Perl 6 happened.

He’s actually forgotten where he stole the for/in loop syntax from. Perhaps Python? It can’t be JavaScript, because Ruby’s use of for/in is older than JavaScript.

Ruby’s three-part true/false/nil with false and nil being the only two falsy values is taken from various Lisp dialects. For some of them there is a “false” constant as the only false value, and some use “t” and “nil” in a similar way. He didn’t say so, but I wonder if it might have had a bit of SQL influence. SQL booleans have a similar true/false/NULL thing going on.

Ruby’s and/or/not operations come straight from Perl, of course. Matz likes the way they feel descriptive and flow like English. As part of that, they’re often good for avoiding extra parentheses.

Matz feels that blocks are the greatest invention of Ruby (I agree.) He got the idea from a 1970s language called CLU from MIT, which called them “iterators” and only allowed them on certain loop constructs.

He took rescue/ensure/retry from Eiffel, but Eiffel didn’t otherwise have “normal” exception handling like similar languages. Ruby’s method of throwing an exception object isn’t like Eiffel, but is like several other older languages. He didn’t mention a single source for that, I don’t think.

He tried to introduce a different style of error handling where each call returns an error object along with its return value from a 1970s language called Icon from the University of Arizona. But after early trials of that method, he thought it would be too hard for beginners and generally too weird. It sounds a lot like what I see of GoLang error handling from his descriptions.

Return came from C. No surprise. Though of course, not multivalue return.

He got self and super from SmallTalk, though SmallTalk’s super is different - it’s the parent object, and you can call any parent method you like on it, not just the one that you just received.

He says he regrets alias and undef a little. He got them from Sather (1980s, UC Berkeley, a derivative language from Eiffel.) Sather had specific labelling for interface inheritance versus implementation inheritance. Ruby took alias and undef without keeping that distinction, and he feels like we often get those two confused. Also, alias and undef tend to be used to break Liskov Substitution, where a child-class instance can always be used as if it were a parent-class instance. As was also pointed out, both alias and undef can be done with method calls in Ruby, so it’s not clear you really need to use keywords for them. He says the keywords now mostly exist for historical reasons since you can define them as methods… but that he doesn’t necessarily think you should always use the methods (alias_method, undefine_method) over the keywords.

BEGIN and END are from Awk originally, though Perl folks know they exist there too. This was also from Ruby’s roots as a system administrator’s scripting language. Matz doesn’t recommend them any more, especially in non-script applications such as Ruby on Rails apps.

C folks already know that __FILE__ and __LINE__ are from the C preprocessor, a standard tool that’s part of the C language (but occasionally used separately from it.)

On Matz and Ruby Leadership

That was a fun trip down memory lane. Now I’ll talk about what Matz said about his own leadership of Ruby. Again, I’m trying to keep this to what Matz actually said rather than putting words in his mouth. But there may be misunderstandings or similar errors - and if so, they are my errors and I apologize.

Matz points out that Ruby uses the “Benevolent Dictator for Life” model, as Python did until recently. He can’t personally be an expert on everything so he asks other people for opinions. He points out that he has only ever written a single Rails application, for instance, and that was from a tutorial. But in the end, after asking various experts, it is his decision.

An audience member asked him: when you add new features, it necessarily adds entropy to the language (and Matz agreed.) Isn’t he afraid of doing too much of that? No, said Matz, because we’re not adding many different ways of doing the same thing, and that’s what he considers the problem with too many language features causing too much entropy. Otherwise (he implied) a complicated language isn’t a particularly bad thing.

He talked a bit about the new pipeline operator, which is a current controversy - a lot of people don’t like it, and Matz isn’t sure if he’ll keep it. He suggested that he might remove or rework it. He’s thinking about it. (Edit: it has since been removed.)

But he pointed out: he does need to be able to experiment, and putting a new feature into a prerelease Ruby to see how he likes it is a good way to do that. The difficulty is with features that make it into a release, because then people are using them.

The Foreseeable Future

Matz also talked about some specific things he does or doesn’t want to do with Ruby.

Matz doesn’t expect he’ll add any new fully-reserved words to the language, but is considering “it” as a sort of “soft”, or context-dependent keyword. In the case of “it” in particular, it would be a sort of self-type variable for blocks. So when he says “no new keywords,” it doesn’t seem to be an absolute.

He’s trying not to expand into emoji or Unicode characters, such as using the Unicode lambda (λ) to create lambdas - for now, they’re just too hard for a lot of users to type. So Aaron’s patch to turn the pipeline operator into a smiley emoji isn’t going in. Matz said he’d prefer the big heart anyway :-)

And in general, he tries hard to keep backward compatibility. Not everybody does - he cites Rails as an example of having lower emphasis on backward compatibility than the Ruby language. But as Matz has said in several talks, he’s really been trying not to break too much since the Ruby 1.8/1.9 split that was so hard for so many users.

What Does That Mean?

Other than a long list of features and where Matz got them, I think the thing to remember is: it’s up to Matz, and sometimes he’s not perfectly expert, and sometimes he’s experimenting or wrong… But he’d love to hear from you about it, and he’s always trying hard and looking around.

As the list above suggests, he’s open to a wide variety of influences if the results look good.

Ruby 2.7preview2, a Quick Speed Update

As you know, I like to check Ruby’s speed for running big Rails apps. Recently, Ruby 2.7 preview2 was released. Normally Ruby releases a new version every Christmas, so it was about time.

I’ve run Rails Ruby Bench on it to check a few things - first, is the speed significantly different? Second, any change in Ruby’s JIT?

Today’s is a pretty quick update since there haven’t been many changes.

Speed and Background

Mostly, Ruby’s speed jumps are between minor versions - 2.6 is different from 2.5 is different from 2.4, but there’s not much change between 2.4.0 and 2.4.5, for instance. I’ve done some checking of this, and it’s held pretty true over time. It's much less true of prerelease Ruby versions, as you’d expect - they’re often still getting big new optimisations, so 2.5’s prereleases were quite different from each other, and from the released 2.5. That’s appropriate and normal.

But I went ahead and speed-checked 2.6.5 against 2.6.0. While these small changes don’t usually make a significant difference, 2.6.0 was one I checked carefully.

And of course, over time I’ve checked how JIT is doing with Rails. Rails is still too tough for it, but there’s a difference in how close do breakeven it is, depending on both what code I’m benchmarking and exactly what revision of JIT I’m testing.

Numbers First

While I’ve run a lot of trials of this, the numbers are fairly simple - what’s the median performance of Ruby, running Discourse flat-out, for this number of samples? This is code I’ve benchmarked many times in roughly this configuration, and it turns out to be well-summarised by the median.

In this case, the raw data is small enough that I can just hand it to you. Here’s my data for 90 runs per configuration with 10,000 HTTP requests per run, with everything else how I generally do it:

Ruby versionMedian reqs/secStd. Dev.Variance
2.6.0174.01.472.17
2.6.5170.11.692.86
2.7.0175.61.632.67
2.7.0 w/ JIT110.41.051.11

One of the first things you’re likely to notice: except for 2.7 with JIT, which we expect to be slow, these are all pretty close together. The difference between 2.6.5 and 2.7.0 is only 5.5 reqs/second, which is a little over three standard deviations - not a huge difference.

I’ve made a few trials, though, and these seem to hold up. 2.6.5 does seem just a touch slower than 2.6.0. The just-about-2% slower that you’re seeing here seems typical. 2.7.0 seems to be a touch faster than 2.6.0, but as you see here, it would take a lot of samples to show it convincingly. One standard deviation apart like this could easily be measurement error, even with the multiple runs I’ve done separately. This is simply too close to call without extensive measurement.

Conclusions

Sometimes when you do statistics, you get the simple result: overall, Ruby 2.7 preview 2 is the same speed as 2.6.0. There might be a regression in 2.6.5, but if so, it’s a small one and there’s a small optimisation in 2.7 that’s balancing it out. Alternately, all these measurements are so close that they may all, in effect, be the same speed.

How MJIT Generates C From Ruby - A Deep Dive

You probably already know the basics of JIT in Ruby. CRuby’s JIT implementation, called MJIT, is a really interesting beast.

But what does the C code actually look like? How is it generated? What are all the specifics?

If you’re afraid of looking at C code, this may be a good week to skip this blog. I’m just sayin’.

How Ruby Runs Your Code

I’ll give you the short version here: Ruby parses your code. It turns it into an Abstract Syntax Tree, which is just a tree-data-structure version of the operations you asked it to do. Before Ruby 1.9, Ruby would directly interpret the tree structure to run your code. Current Ruby (1.9 through 2.6-ish) translates it into buffers of bytecodes. These buffers are called ISEQs, for “Instruction SEQuences.” There are various tools like yomikomu that will let you dump, load and generally examine ISEQs. BootSnap, the now-standard tool to optimize startup for large Rails apps, works partly by loading dumped ISEQs instead of parsing all your code from .rb files.

Also, have I talked up Pat Shaughnessy’s Ruby Under a Microscope lately? He explains all of this in massive detail. If you’re a Ruby-internals geek (guilty!) this is an amazing book. It’s also surprising how little Ruby’s internals have changed since he wrote it.

In the Ruby source code, there’s a file full of definitions for all the instructions that go into ISEQs. You can look up trivial examples like the optimized plus operator and see how they work. Ruby actually doesn’t call these directly - the source file is written in a weird, not-exactly-C syntax that gets taken apart and used in multiple ways. You can think of it as a C DSL if you like. For the “normal” Ruby interpreter, they all wind up in a giant loop which looks up the next operation in the ISEQ, runs the appropriate instructions for it and then loops again to look up the next instruction (and so on.)

A Ruby build script generates the interpreter’s giant loop as a C source file when you build Ruby. It winds up built into your normal ruby binary.

Ruby’s MJIT uses the same file of definitions to generate C code from Ruby. MJIT can take an ISEQ and generate all the lines of C it would run in that loop without actually needing the loop or the instruction lookup. If you’re a compiler geek, yeah, this is a bit like loop unrolling since we already know the instruction sequence that the loop would be operating on. So we can just “spell out” the loop explicitly. That also lets the C compiler see where operations would be useless or cancel each other out and just skip them. That’s hard to do in an interpreter!

So what does all this actually look like when Ruby does it?

MJIT Options and Seeing Inside

It turns out that MJIT has some options that let us see behind the curtain. If you have Ruby 2.6 or higher then you have JIT available. Run “ruby —help” and you can see MJIT’s extra options on the command line. Here’s what I see in 2.6.2 (note that some options are changing for not-yet-released 2.7):

JIT options (experimental):
  --jit-warnings  Enable printing JIT warnings
  --jit-debug     Enable JIT debugging (very slow)
  --jit-wait      Wait until JIT compilation is finished everytime (for testing)
  --jit-save-temps
                  Save JIT temporary files in $TMP or /tmp (for testing)
  --jit-verbose=num
                  Print JIT logs of level num or less to stderr (default: 0)
  --jit-max-cache=num
                  Max number of methods to be JIT-ed in a cache (default: 1000)
  --jit-min-calls=num
                  Number of calls to trigger JIT (for testing, default: 5)

Most of these aren’t a big deal. Debugging and warnings can be useful, but they’re not thrilling. But “—jit-save-temps” there may look intriguing to you… I know it did to me!

That will actually save the C source files that Ruby is using and we can see inside them!

If you do this, you may want to set the environment variables TMP or TMPDIR to a directory where you want them - OS X often puts temp files in weird places. I added an extra print statement to mjit_worker.c in the function “convert_unit_to_func” right after “sprint_uniq_filename” so that I could see when it created a new file… But that means messing around in your Ruby source, so you do you.

Multiplication and Combinatorics

# multiply.rb
def multiply(a, b)
  a * b
end

1_000_000.times do
  multiply(7.0, 10.0)
end

I decided to start with really simple Ruby code. MJIT will only JIT a method, so you need a method. And then you need to call it, preferably a lot of times. So the code on the right is what I came up with. It is intentionally not complicated.

The “multiply” method multies two numbers and does nothing else. It gets JITted because it’s called many, many times. I ran this code with “ruby —jit —jit-save-temps multiply.rb”, which worked fine for me once I figured out where MacOS was putting its temp files.

The resulting .c file generated by Ruby is 236 lines. Whether you find this astoundingly big or pretty darn small depends a lot on your background. Let me show you a few of the highlights from that file.

Here is a (very) cut-down and modified version:

// Generated by MJIT from multiply.rb
ALWAYS_INLINE(static VALUE _mjit_inlined_6(...));
static inline VALUE
_mjit_inlined_6(rb_execution_context_t *ec, rb_control_frame_t *reg_cfp, const VALUE orig_self, const rb_iseq_t *original_iseq)
{
    // ...
}

VALUE
_mjit0(...)
{
    // ...
    label_6: /* opt_send_without_block */
    {
        // ...
        stack[0] = _mjit_inlined_6(ec, reg_cfp, orig_self, original_iseq);
    }
}

What I’m showing here is that there is an inlined _mjit_inlined_6 method (C calls them “functions”) that gets called by a top-level “mjit0” method, which is the MJIT-ted version of the “multiply” method in Ruby. “Inlined” means the C compiler effectively rewrites the code so that it’s not a called method - instead, the whole method’s code, all of it, gets pasted in where the method would have been called. It’s a bit faster than a normal function call. It also lets the compiler optimize it just for that one case, since the pasted-in code won’t be called by anything else. It’s pasted in at that one single call site.

If you look at the full code, you’ll also see that each method is full of “labels” and comments like the one above (“opt_send_without_block”). Below is basically all of the code to that inlined function. If you ignore the dubious indentation (generated code is generated), you have a chunk of C for each bytecode instruction and some setup, cleanup and stack-handling in between. The large “cancel” block at the end is all the error handling that is done if the method does not succeed.

The chunks of code at each label, by the way, are what the interpreter loop would normally do.

And if you examine these specific opcodes, you’ll discover that this is taking two local variables and multiplying them - this is the actual multiply method from the Ruby code above.

static inline VALUE
_mjit_inlined_6(rb_execution_context_t *ec, rb_control_frame_t *reg_cfp, const VALUE orig_self, const rb_iseq_t *orig
inal_iseq)
{
    const VALUE *orig_pc = reg_cfp->pc;
    const VALUE *orig_sp = reg_cfp->sp;
    VALUE stack[2];
    static const VALUE *const original_body_iseq = (VALUE *)0x7ff4cd51a080;

label_0: /* getlocal_WC_0 */
{
    MAYBE_UNUSED(VALUE) val;
    MAYBE_UNUSED(lindex_t) idx;
    MAYBE_UNUSED(rb_num_t) level;
    level = 0;
    idx = (lindex_t)0x4;
    {
        val = *(vm_get_ep(GET_EP(), level) - idx);
        RB_DEBUG_COUNTER_INC(lvar_get);
        (void)RB_DEBUG_COUNTER_INC_IF(lvar_get_dynamic, level > 0);
    }
    stack[0] = val;
}

label_2: /* getlocal_WC_0 */
{
    MAYBE_UNUSED(VALUE) val;
    MAYBE_UNUSED(lindex_t) idx;
    MAYBE_UNUSED(rb_num_t) level;
    level = 0;
    idx = (lindex_t)0x3;
    {
        val = *(vm_get_ep(GET_EP(), level) - idx);
        RB_DEBUG_COUNTER_INC(lvar_get);
        (void)RB_DEBUG_COUNTER_INC_IF(lvar_get_dynamic, level > 0);
    }
    stack[1] = val;
}

label_4: /* opt_mult */
{
    MAYBE_UNUSED(CALL_CACHE) cc;
    MAYBE_UNUSED(CALL_INFO) ci;
    MAYBE_UNUSED(VALUE) obj, recv, val;
    ci = (CALL_INFO)0x7ff4cd52b400;
    cc = (CALL_CACHE)0x7ff4cd5192e0;
    recv = stack[0];
    obj = stack[1];
    {
        val = vm_opt_mult(recv, obj);

        if (val == Qundef) {
            reg_cfp->sp = vm_base_ptr(reg_cfp) + 2;
            reg_cfp->pc = original_body_iseq + 4;
            RB_DEBUG_COUNTER_INC(mjit_cancel_opt_insn);
            goto cancel;
        }
    }
    stack[0] = val;
}

label_7: /* leave */
    return stack[0];

cancel:
    RB_DEBUG_COUNTER_INC(mjit_cancel);
    rb_mjit_iseq_compile_info(original_iseq->body)->disable_inlining = true;
    rb_mjit_recompile_iseq(original_iseq);
    const VALUE current_pc = reg_cfp->pc;
    const VALUE current_sp = reg_cfp->sp;
    reg_cfp->pc = orig_pc;
    reg_cfp->sp = orig_sp;

    struct rb_calling_info calling;
    calling.block_handler = VM_BLOCK_HANDLER_NONE;
    calling.argc = 2;
    calling.recv = reg_cfp->self;
    reg_cfp->self = orig_self;
    vm_call_iseq_setup_normal(ec, reg_cfp, &calling, (const rb_callable_method_entry_t *)0x7ff4cd930958, 0, 2, 2);

    reg_cfp = ec->cfp;
    reg_cfp->pc = current_pc;
    reg_cfp->sp = current_sp;
    *(vm_base_ptr(reg_cfp) + 0) = stack[0];
    *(vm_base_ptr(reg_cfp) + 1) = stack[1];
    return vm_exec(ec, ec->cfp);

} /* end of _mjit_inlined_6 */

The labels mark where a particular bytecode instruction in the ISEQ starts, and the name is the name of that bytecode instruction. This is doing nearly exactly what the Ruby interpreter would, including lots of Ruby bookkeeping for things like call stacks.

What Changes?

Okay. We’ve multiplied two numbers together. This is a single, small operation.

What changes if we do more?

Well… This is already a fairly long blog post. But first, I’ll link a repository of the output I got when multiplying more than two numbers.

And then after you clone that repo, you can start doing interesting things yourself to see what changes over time. For instance:

# See what's different between multiplying 2 Floats and multiplying 3 Floats
diff -c multiply_2_version_0.c multiply_3_version_0.c

And in fact, if we multiply three or more Floats, MJIT will realize it can improve some things over time. When multiplying three (or four!) Floats, it will produce three different chunks of C code, not just one, as it continues to iterate. So:

# See what's different between the first and second way to multiply three Floats
diff -c multiply_3_version_0.c multiply_3_version_1.c

I’ll let you have a look. When looking at diffs, keep in mind that the big hexadecimal numbers in the CALL_INFO and CALL_CACHE lines will change for every run, both in my output and in any output you make for yourself — they’re literally hardcoded memory addresses in Ruby, so they’re different for every run. But the other changes are often interesting and substantive, as MJIT figures out how to optimize things.

What Did We Learn?

I like to give you interesting insights, not just raw code dumps. So what’s interesting here?

Here’s one interesting thing: you don’t see any checks for whether operations like multiply are redefined. But that’s not because of excellent JIT optimization - it’s because that all lives inside the vm_opt_mult function call up above. At best, they might be recognized as a repeat check and the compiler might be able to tell that it doesn’t need to check them again. But that’s actually hard — there’s a lot of code here, and it’s hard to verify that none of it could possibly ever redefine an operation… Especially in Ruby!

So: MJIT is going to have a lot of trouble skipping those checks, given the way it structures this code.

And if it can’t skip those checks, it’s going to have a lot of trouble doing optimisations like constant folding, where it multiplies two numbers at compile time instead of every time through the loop. You and I both know that 7 * 10 will always be 70, every time through the loop because nobody is redefining Integer multiply halfway. But MJIT can’t really know that - what if there was a trace_func that redefined operations constantly? Or a background thread that redefined the operation halfway through? Ruby allows it!

To put it another way, MJIT isn’t doing a lot of interesting language-level optimisation here. Mostly it’s just optimising simple bookkeeping like the call stack and C-level function calls. Most of the Ruby operations, including overhead like checking if you redefined a function, stay basically the same.

That should make sense. Remember how MJIT got written and merged in record time? It’s very hard to make language-level optimizations without a chance of breaking something. MJIT tries not to change the language semantics at all. So it doesn’t make many changes or assumptions. So mostly, MJIT is a simple mechanical transform of what the interpreter was already doing.

If you didn’t already know what the Ruby interpreter was doing under the hood, this is also a fun look into that.

Benchmark Results: Threads, Processes and Fibers

You may recall me writing an I/O-heavy test for threads, processes and fibers to benchmark their performance. I then ran it a few times on my Mac laptop, wrote the numbers down and called it a day.

While that can be useful, it’s now how we usually do benchmarking around here. So let’s do something with a touch more rigor, shall we?

Also some pretty graphs. I like graphs.

Methodology

If you’re the type to care about methodology (I am!) then this is a great time to review the previous blog post and/or the code to the tests.

I’ve written a simple master/worker pattern in (separately) threads, fibers and processes. In each case, the master writes to the worker, which reads, writes a response, and waits for the next write. This is very simple, but heavy on I/O and coordination.

For this post, I’ll be timing the results for not just threads vs fibers vs processes, but also for Rubies 2.0 through 2.6 - specifically, CRuby versions 2.0.0-p0, 2.1.10, 2.2.10, 2.3.8, 2.4.5, 2.5.3 and 2.6.2.

I’ll mention “workers” for all these tests. For thread-based testing, a “worker” is a thread. Same for processes and fibers - one worker is one process or one fiber.

First Off, Which is Faster?

It’s hard to definitively say which of the three methods of concurrency is faster in general. In fact, it’s nearly a meaningless question since they do significantly different things and are often combined with each other.

t_v_f_v_p_ruby_2_6.png

Now, with sanity out of the way, let’s pretend we can just answer that with a benchmark. You know you want to.

The result for Ruby 2.6 is to the right.

It looks as if processes are always faster assuming you don’t use too many of them.

And that’s true, sort of. Specifically, it’s true until you start to hit limits on memory or number of processes, and then it’s false. That’s probably why you’re seeing that rapid rise in processing time for 1,000 processes. These are extremely simple processes - if you’re doing more real work you wouldn’t use 1,000 workers because you’d run out of memory long before that.

However, for a simple task like this, fibers beat threads because they’re lighter-weight, using less memory or CPU. And processes beat both, because they get around Ruby’s use of the GIL, and it’s such a tiny task that we don’t hit memory constraints until we use close to 1,000 processes - a far larger number of workers than is useful or productive.

tfp_ruby_20.png

In fact, you would normally use multiple of these. You can and should use multiple threads or fibers per process with multiple processes in CRuby to avoid GIL issues. Yeah, fine, real-world issues. Let’s ignore them and have more fun with graphs. Graphs are awesome.

You might (and should) reasonably ask, "but is this an artifact of Ruby 2.6?” To the right are the results for Ruby 2.0, for reference. They do not include 1,000 workers because Ruby 2.0 segfaults when you try that.

Processes Across the Years

forks_by_ruby.png

How has our Ruby multiprocess performance changed since Ruby 2.0? That’s the baseline for Ruby 3x3, so it can be our baseline here, too.

If you look to the right, the short answer is that if you use a reasonable number of workers the performance is excellent and very stable. If you use a completely over-the-top number of workers, the performance isn’t amazing. I wouldn’t really call that a bug.

Incidentally, that isn’t just noisy data. While I only ran each test 10 times, the variance is very low on the results. Ruby 2.3.8 and 2.6.2 just seem to be (reliably) extra-bad with far too many processes. Of course, that’s a bad idea on any Ruby, not just the extra-bad versions.

In general, though, Ruby processes are living up to their reputation here - CRuby has used processes for concurrency first and foremost. Their performance is excellent and so is their stability.

Though you’ll notice that the “1,000 processes” line doesn’t go all the way back to Ruby 2.0.0-p0, as mentioned above. That’s because it gets a segmentation fault and crashes the Ruby interpreter. That’s a theme - fibers and threads also crash Ruby 2.0.0-p0 when you try to use far too many of them. But Ruby 2.1 has fixed the problem. I hope that doesn’t mean you need to upgrade, since Ruby 2.1 is almost six years old now…

Threads Across the Years

threads_by_ruby.png

That was processes. What about threads?

They’re pretty good too. And unlike processes, thread performance has improved pretty substantially between Ruby 2.0 and 2.6. That’s nearly twice as fast for an all-coordination, all-I/O task like this one!

1,000 threads is still far too many for this task, but CRuby handles it gracefully with only a slight performance degradation. It’s a minor tuning error, not a horrible misuse of resources like 1,000 processes would be.

What you’re seeing there, with 5-10 threads being optimal for an I/O-heavy workload, is pretty typical of CRuby. It’s hard to get great performance with a lot of threads because the GIL keeps more than one from running Ruby at once. Normally with 1,000 threads, CRuby’s performance will fall off a cliff - it simply won’t speed up beyond something like 6 threads. But this task is nearly all I/O, and so the GIL does fairly minimal harm here.

Fibers Across the Years

fibers_by_ruby.png

Fibers are the really interesting case here. We know they’ve received some rewriting love in recent Ruby versions, and I’ve seen Fiber.yield times significantly improved from very old to very new CRuby. Their graph is to the right. And it is indeed interesting.

First, 1,000 fibers are clearly too many for this task, as with threads and processes. In fact, threads seem to handle the excess workers better, at least until 2.6.

Also, fibers seem to get worse for performance after 2.0 until 2.6 precipitously fixes them. Perhaps that’s Samuel Williams’ work?

It’s also fair to point out that I only test fibers (or threads or processes, for that matter) with a pure-Ruby reactor. All of this assumes that a simple IO.select is adequate, when you can get better performance using something like nio4r to use more interesting system calls, and to do more of the work in optimized C.

Addendum: Ruby 2.7

I did a bit of extra benchmarking of (not yet released) Ruby 2.7, with the September 6th head-of-master commit. The short version is that threads and processes are exactly the same speed as 2.6 (makes sense), while fibers have gained a bit more than 6% speed from 2.6.

So there’s more speed coming for fibers!

Conclusions

Clearly, the conclusion is to only use processes in CRuby, ever, and to max out at 10 processes. Thank you for coming to my TED talk.

No, not really.

Some things you’re seeing here:

  • Fibers got faster in Ruby 2.6 specifically. If you use them, consider upgrading to Ruby 2.6+.

  • Be careful tuning your number of threads and processes. You’ve seen me say that before, and it’s still true.

  • Threads, oddly, have gained a bit of performance in recent CRuby versions. That’s unexpected and welcome.

Thank you and good night.

Benchmarking Fibers, Threads and Processes

Awhile back, I set out to look at Fiber performance and how it's improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby's Fiber class by Samuel Williams.

It's not hard to write a microbenchmark for something like Fiber.yield. But it's harder, and more interesting, to write a benchmark that's useful and representative.

Wait, Wait, Wait - What?

And don’t get me started on parallelism…

And don’t get me started on parallelism…

Okay, first a quick summary: what are fibers?

You know how you can fork a process or create a thread and suddenly there’s this code that’s also running, alongside your code? I mean, sure, it doesn’t necessarily literally run at the same time. But there’s another flow of control and sometimes it’s running. This is all called concurrency by developers who are picky about vocabulary.

A fiber is like that. However, when you have multiple fibers running, they don’t automatically switch from one to the other. Instead, when one fiber calls Fiber.yield, Ruby will switch to another fiber. As long as all the fibers call yield regularly, they all get a chance to run and the result is very efficient.

Fibers, like threads, all run inside your process. By comparison, if you call “fork” for a new process then of course it isn’t in the same process. Just as a process can contain multiple threads, a thread can contain multiple fibers. For instance, you could write an application with ten processes, each with eight threads, and each of those threads could have six fibers.

A thread is lighter-weight than a process, and multiple can run inside a process. A fiber is lighter-weight than a thread, and multiple can run inside a thread. And unlike threads or processes, fibers have to manually switch back and forth by calling “yield.” But in return, they get lower memory usage and lower processor overhead than threads in many cases.

Make sense?

We’ll also be talking about the Global Interpreter Lock, or GIL, which these days is more properly called the Global VM Lock or GVL - but nobody does, so I’m calling it the GIL here. Basically, multiple Ruby threads or fibers inside a single process can only have one of them running Ruby at once. That can make a huge difference in performance. We’re not going to go deeply into the GIL here, but you may want to research it further if this topic interests you.

Why Not App Servers?

It’s a nice logo, isn’t it?

It’s a nice logo, isn’t it?

Some of you are thinking, "but comparing threads and fibers isn’t hard at all." After all, I do lots of HTTP benchmarking here. Why not just benchmark Puma, which uses threads, versus Falcon, which uses fibers, and call it a day?

Several reasons.

One: there are a lot of differences between Falcon and Puma. HTTP parsing, handling of multiple processes, how the reactor is written. And in fact, both of them spend a lot of time in non-Ruby code via nio4r, which lets Ruby use some (very cool, very efficient) C libraries to do the heavy lifting. That's great, and I think it's a wonderful choice... But it's not really benchmarking Ruby, is it?

No, we need something much simpler to look at raw fiber performance.

Also, Ruby 3x3 uses Ruby 2.0 as its baseline. Falcon, nio4r and recent Puma all very reasonably require more recent Ruby than that. Whatever benchmark I use, I want to be able to compare all the way back to Ruby 2.0. Puma 2.11 can do that, but no version of Falcon can.

Some Approaches that Didn't Work

Just interested in the punchline? Skip this section. Curious about the methodology? Keep reading.

I tried putting together a really simple HTTP client and server. The client was initially wrk while the server was actually three different servers - one threads, one processes, one fibers. I got it partly working.

But it all failed. Badly.

Specifically, wrk is intentionally picky and finicky. If the server closes the socket on it too soon, it gives an error. Lots of errors. Read errors and write errors both, depending. Just writing an HTTP server with Ruby's TCPSocket is harder than it looks, basically, if I want a picky client to treat it as reasonable. Curl thinks it's fine. Wrk wants clean benchmark results, and says no.

If I avoid strategy and vision, I can narrow the scope of my failures. That’s the right takeaway, I’m sure of it.

If I avoid strategy and vision, I can narrow the scope of my failures. That’s the right takeaway, I’m sure of it.

Yeah, okay, fine. I guess I do want clean benchmark results. Maybe.

Okay, so then, maybe just a TCP socket server? Raw, fast C client, three different TCPServer-based servers, one threads, one processes, one fibers? It took some doing, but I did all that.

That also failed.

Specifically, I got it all working with threads - they're often the easiest. And a 10,000-request run took anything from 3 seconds to 30 seconds. That... seems like a lot. I thought, okay, maybe threads are bad at this, and I tried it with fibers. Same problem.

So I tried it with straight-line non-concurrent code for the server. Same problem. What about a simple select-based reactor for the fiber version to see if some concurrency helps? Nope. Same problem.

It turns out that just opening a TCP/IP socket, even on localhost, adds a huge amount of variation to the time for the trial. So much variation that it swamps what I'm trying to measure. I could have just run many, many trials to (mostly) average out the noise. But having more measurement noise than signal to measure is a really bad idea.

So: back to the drawing board.

No HTTP. No TCP. No big complicated app servers, so I couldn't go more complicated.

What was next?

Less Complicated

I’m starting to enjoy how tremendously bad the visual explanations of shell pipes are. Maybe that’s a form of Stockholm Syndrome?

I’m starting to enjoy how tremendously bad the visual explanations of shell pipes are. Maybe that’s a form of Stockholm Syndrome?

What's more predictable and less variable than TCP/IP sockets? Local process-to-process sockets with no network protocol in the middle. In Ruby, one easy way to do that is IO.pipe.

You can put together a pretty nice simple master/worker pattern by having the master set up a bunch of workers, each with a shell-like pipe. It's very fast to set up and very fast to use. This is the same way that shells like bash sets up pipe operators for "cat myfile | sort | uniq" to run output through several programs before it's done.

So that's what I did. I used threads as workers for the first version. The code for that is pretty simple.

Basically:

  • Set up read and write pipes

  • Set up threads as workers, ready to read and write

  • Start the master/controller code in Ruby’s main process and thread

  • Keep running until finished, then clean up

There’s some brief reactor code for master to make sure it only reads and writes to pipes that are currently ready. But it’s very short, certainly under ten lines of “extra.”

The multiprocess version is barely different - it's so similar that there are about fives lines of difference between them.

And Now, Fibers

The fiber version is a little more involved. Let's talk about that.

Threads and processes both have pre-emptive multitasking. So if you set one of them running and mostly forget about it, roughly the right thing happens. Your master and your workers will trade off pretty nicely between them. Not everything works perfectly all the time, but things basically tend to work out okay.

In cooperative multitasking, he keeps the goofy grin on his face and switches when he feels like. In preemptive multitasking he can’t spend too long on the cellphone or the hand with the book slaps him.

In cooperative multitasking, he keeps the goofy grin on his face and switches when he feels like. In preemptive multitasking he can’t spend too long on the cellphone or the hand with the book slaps him.

Fibers are different. A fiber has to manually yield control when it's done. If a fiber just reads or writes at the wrong time, it can block your whole program until it’s done. That's not as severe a problem with IO.pipe as with TCP/IP. But it's still a good idea to use a pattern called a reactor to make sure you're only reading when there's data available and only writing when there's space in the pipe for it.

Samuel Williams has a presentation about Ruby fibers that I used heavily as a source for this post. He includes a simple reactor pattern for fibers there that I'll use to sort my workers out. Like the master in the earlier code, this reactor uses IO.select to figure out when to read and write and how to transfer control between the different fibers. The reactor pattern can be used for threads or processes as well, but Samuel's code is written for fibers.

So initially, I put all the workers into a reactor in one thread, and the master with an IO.select reactor in another thread. That's very similar to how the thread and process code is set up, so it's clearly comparable. But as it turned out, the performance for that version isn't great.

But it seems silly to say it's testing fibers while using threads to switch back and forth... So I wrote a "remastered" version of the code, with the master code using a fiber per worker. Would this be really slow since I was doubling the number of fibers...? Not so much.

In fact, using just fibers and a single reactor doubled the speed for large numbers of messages.

And with that, I had some nice comparable thread, process and fiber code that's nearly all I/O.

How’s It Perform?

I put it through its paces locally on my Macbook Pro with Ruby 2.6.2. Take this as “vaguely suggestive” performance, in other words, not “heavily vetted” performance. But I think it gives a reasonable start. I’ll be validating on larger Linux EC2 instances before you know it - we’ve met me before.

Here are numbers of workers and requests along with the type of worker, and how long it takes to process that number of requests:

ThreadsProcessesFibers w/ old-style MasterFibers w/ Fast Master
5 workers w/ 20,000 reqs each2.60.714.21.9
10 workers w/ 10,000 reqs each2.50.674.01.7
100 workers w/ 1,000 reqs each2.50.763.91.6
1000 workers w/ 100 reqs each2.82.55.02.4
 
10 workers w/ 100,000 reqs each255.84116

Some quick notes: Processes give an amazing showing, partly because they have no GIL. Threads beat out Fibers with a threaded master, so combining threads and fibers too closely seems to be dubious. But with a proper fiber-based master they’re faster than threads, as you’d hope and expect.

You may also notice that processes do not scale gracefully to 1000 workers, while threads and fibers do much better at that. That’s normal and expected, but it’s nice to see the data bear it out.

That final row has 10 times as many total requests as all the other rows. So that’s why its numbers are about ten times higher.

A Strong Baseline for Performance

This guy has just gotten Ruby Fiber code to work. You can tell by the posture.

This guy has just gotten Ruby Fiber code to work. You can tell by the posture.

This article is definitely long enough, so I won't be testing this from Ruby version 2.0 to 2.7... Yet. You can expect it soon, though!

We want to show that fiber performance has improved over time - and we'd like to see if threads or processes have changed much. So we'll test over those Ruby versions.

We also want to compare threads, processes and fibers at different levels of concurrency. This isn't a perfectly fair test. There's no such thing! But it can still teach us something useful.

And we'd also like a baseline to start looking at various "autofiber" proposals - variations on fibers that automatically yield when doing I/O so that you don't need the extra reactor wrapping for reads and writes. That simplifies the code substantially, giving something much more like the thread or process code. There are at least two autofiber proposals, one by Eric Wong and one by Samuel Williams.

Don't expect all of that for the same blog post, of course. But the background work we just did sets the stage for all of it.

How Ruby Encodes References - Ruby Tiny Objects Explained

When you’re using Ruby and you care about performance, you’ll hear a specific recommendation: “use small, fast objects.” As a variation on this, people will suggest you use symbols (“they’re faster than strings!”), prefer nil to the empty string and a few similar recommendations.

It’s usually passed around as hearsay and black magic, and often the recommendations are somehow wrong. For instance, some folks used to say “don’t use symbols! They can’t be garbage collected!”. But nope, now they can be. And the strings versus symbols story gets a lot more complicated if you use frozen strings…

I’ve explained how Ruby allocates tiny, small and large objects before, but this will be a deep dive into tiny (reference) objects and how they work. That will help you understand the current situation and what’s likely to change in the future.

We’ll also talk a bit about how C stores objects. CRuby (aka Matz’s Ruby or “plain” Ruby) is written in C, and uses C data structures to store your Ruby objects.

And along the way you’ll pick up a common-in-C trick that can both be used in Ruby (Matz does!) and help you understand the deeper binary underpinnings of a lot of higher-level languages.

How Ruby Stores Objects

You may recall that Ruby has three different objects sizes, which I’ll call “tiny,” “small” and “large.” For deeper details on that, the slides from my 2018 RubyKaigi talk are pretty good (or: video link.)

But the short version for Ruby on 64-bit architectures (such as any modern processor) is:

  • A Ruby 8-byte “reference” encodes tiny objects directly inside it, or points to…

  • A 40-byte RVALUE structure, which can fully contain a small object or the starting 40 bytes of…

  • A Large object (anything bigger), which uses an RVALUE and an allocation from the OS.

Make sense? Any Ruby value gets a reference, even the smallest ones. Tiny values are encoded directly into the 8-byte reference. Small or large objects (but not tiny) also get a 40-byte RVALUE. Small objects are encoded directly into the 40-bytes RVALUE. And large objects don’t fit in just a reference or just an RVALUE, so they get an extra allocation of whatever size they actually need (plus the RVALUE and the reference.) For the C folks in the audience, that “extra allocation” is the same thing as a call to malloc(), the usual C memory allocation function.

The RVALUE is often called a “Slot” when you’re talking about Ruby memory. Technically Ruby uses the word “slot” for the allocation and “RVALUE” for the data type of the structure that goes in a slot, but you’ll see both words used both ways - treat them as the same thing.

Why the three-level system? Because it gets more expensive in performance as the objects get bigger. 8-byte references are tiny and very cheap. Slots get allocated in blocks of 408 at a time and aren’t that big, so they’re fairly cheap - but a thousand or more of them start to get expensive. And a large object takes a reference and a slot and a whole allocation of its own that gets separate tracked - not cheap.

So: let’s look at references. Those are the 8-byte tiny values.

Which Values are Tiny?

I say that “some” objects are encoded into the reference. Which ones?

  • Fixnums between about negative one billion and one billion

  • Symbols

  • Floating-point numbers (like 3.7 or -421.74)

  • The special values true, false, undef and nil

That’s a pretty specific set. Why?

C: Mindset, Hallucinations and One Weird Trick That Will Shock You

C really treats all data as a chunk of bits with a length. There are all sorts of operations that act on chunks of bits, of course, and some of those operations might be assigned something resembling a “type” by a biased human observer. But C is a big fan of the idea that if you have a chunk of bytes and you want to treat it as a string in one line and an integer the next, that’s fine. Length is the major limitation, and even length is surprisingly flexible if you’re careful and/or you don’t mind the occasional buffer overrun.

What’s a pointer? Pointers are how C tracks memory. If you imagine numbering all the bytes of memory starting at zero, and the next byte is one, the next byte two and so on, you get exactly how old processors addressed memory. Some very simple embedded processors still do it that way. That’s exactly what a C pointer is - an index for a location in memory, if you were to treat all of memory as one giant vector of bytes. Memory addressing is more complicated in newer processors, OSes and languages, but they still present your program with that same old abstraction. In C, you use it very directly.

So when I say that in C a pointer is a memory address, you might ask, “is that a separate type from integer with a bunch of separate operations you can do on it?” and I might answer “it’s C, so I just mean there are a bunch of pointer operations that you can do with any piece of data anywhere inside your process.” The theme here is “C doesn’t track your stuff for you at runtime, who do you think C is, your mother?” The other, related theme is “C assumes when you tell it something you know what you’re doing, whether you actually do or not.” And if not, eh, crashes happen.

One bit related to this mindset: allocating a new “object” (really a chunk of bytes) in C is simple: you call a function and you get back a pointer to a chunk of bytes, guaranteed to hold at least the size you asked for. Ask it for 137 bytes, get back a pointer to a buffer that is at least 137 bytes big. That’s what “malloc” does. When you’re done with the buffer you call “free” to give it back, after which it may become invalid or be handed back to somebody else, or split up and parts of it handed back to somebody else. Data made of bits is weird.

A side effect of all of this “made of bits” and “track it yourself” stuff is that often you’ll do type tagging. You keep one piece of data that says what type another piece of data is, and then you interpret the second one completely differently depending on the first one. Wait, what? Okay, so, an example: if you know you could have an integer or a string, you keep a tag, which is either 0 for integer or 1 for string. When you read the object, first you check the tag for how to interpret the second chunk of bits. When you set a new value (which could be either integer or string) you also set the tag to the correct value. Does this all sound disorganized and error-prone? Good, you’re understanding a bit of what C is like.

One last oddity: because of how processor alignment and memory tracking works, due to a weird quirk of history, pointers are essentially always even. In fact, values returned by a memory allocator on a modern processor is always a multiple of 8, because most processors don’t like accessing an 8-bytes value on an address that isn’t a multiple of 8. The memory allocator can’t just tell you not to use any 8-byte values. Processors are weird, yo.

Which means if you looked at the representation of your pointer in binary, the smallest three bits would always be zero. Because, y’know, multiple of 8. Which means you could use those three bits for something. Keep that in mind for the next bit.

Okay, So What Does Ruby Do?

If this sounds like I’m building up to explaining some type-tagging… Yup, well spotted!

It turns out that a reference is normally a C pointer under the hood. Basically every dynamic language does this, with different little variations. So all references to small and large Ruby objects are pointers. The exception is for tiny objects, which live completely in the reference.

Think about the last three bits of Ruby’s 8-byte references. You know that if those last bits are all zeroes, the value is (or could be) a pointer to something returned by the memory allocator - so it’s a small or large object. But if they’re not zero, the value lives in the reference and it’s a tiny object.

And Ruby is going to pass around a lot of values that you’d like to be small and fast… Numbers, say, or symbols. Heck, you’d like nil to be pretty small and fast too.

So: CRuby has a few things that it calls “immediate” values in its source code. And the list of those immediate values look exactly like the list above - values you can store as tiny objects directly in a reference.

Let’s get back to those last three bits of the reference again.

If the final bit is a “1” then the reference contains a Fixnum. If the final two bits are “10” then it’s a Float. And if the last four bits are “1100” then it’s a Symbol. But the last three of “1100” are still illegal for an allocated pointer, so it works out.

The four “special” values (true, false, undef, nil) are all represented by small numbers that will also never be returned by the memory allocator. For completeness, here they are:

ValueHexadecimal valueDecimal value
true0x1420
false0x000
undef0x3452
nil0x088

So Every Integer Ends in 1, Then?

You might reasonably ask… but what about even integers?

I mean, “ends in 1” is a reasonable way to distinguish between pointers and not-pointers. But what if you want to store the number 4 at some point? Its binary representation ends in “00,” not “1.” The number 88 is even worse - like a pointer, it’s a multiple of 8!

It turns out that CRuby stores your integer in just the top 63 bits out of 64. The final “1” bit isn’t part of the integer’s value, it’s just a sign saying, “yup, this is an integer.” So if type-tagging is two values with one tagging the other, then the bottom bit is the tag and the top 63 bits are the “other” piece of data. They’re both crunched together, but… Well, this is C. If you want to crunch up “multiple” pieces of data into one chunk… C isn’t your mother, and it won’t stop you. In fact, that’s what C does with all its arrays anyway. And in this case it makes for pretty fast code, so that’s what CRuby does.

If you’re up for it, here’s the C code for immediate Fixnums - all this code makes heavy use of bitwise operations, as you’d expect.

// Check if a reference is an immediate Fixnum
#define RB_FIXNUM_P(f) (((int)(SIGNED_VALUE)(f))&RUBY_FIXNUM_FLAG)

// Convert a C int into a Ruby immediate Fixnum reference
#define RB_INT2FIX(i) (((VALUE)(i))<<1 | RUBY_FIXNUM_FLAG)

// Convert a Ruby immediate Fixnum into a C int - RSHIFT is just >>
#define RB_FIX2LONG(x) ((long)RSHIFT((SIGNED_VALUE)(x),1))

So It’s All That Simple, Then?

This article can’t cover everything. If you think about symbols for a moment, you’ll realize they have to be a bit more complicated than that - what about a symbol like :thisIsAParticularlyLongName? You can’t fit that in 8 bytes! And yet it’s still an immediate value. Spoiler: Ruby keeps a table that maps the symbol names to fixed-length keys. This is another very old trick, often called String Interning.

And as for what it does to the Float representation… I’ll get into a lot more detail about that, and about what it does to Ruby’s floating-point performance, in a later post.

Wrk: Does It Matter If It Has Native No-Keepalive?

I wrote about a load-tester called Wrk a little while ago. Wrk is unusual among load testers in that it doesn’t have an option for turning off HTTP keepalive. HTTP 1.1 defaults to having KeepAlive, and it helps performance significantly… But shouldn’t you allow testing with both? Some intermediate software might not support KeepAlive, and HTTP 1.0 only supports it in an optional mode. Other load-testers normally allow turning it off. Shouldn’t Wrk allow it too?

Let’s explore that, and run some tests to check how “real” No-KeepAlive performs.

In this post I’m measuring with Rails Simpler Bench, using 110 60-seconds batches of HTTP requests with a 5-second warmup for each. It’s this experiment configuration file, but with more batches.

Does Wrk Allow Turning Off KeepAlive?

First off, Wrk has a workaround. You can supply the “Connection: Close” header, which asks the server to kill the connection when it’s finished processing the request. To be clear, that will definitely turn off KeepAlive. If the server closes the connection after processing each and every request, there is no keepAlive. Wrk also claims in the bug report that you can do it with their Lua scripting. First off, I don’t think that’s true since Wrk’s Lua API doesn’t seem to have any way to directly close a connection. Second off, supplying the header on the command line is easy and writing correct Lua is harder. You could set the header in Lua, but that’s not any better or easier than doing it on the command line, unless you want to somehow do it conditionally, and only some of the time.

(Wondering how to make no-KeepAlive happen, practically speaking? wrk -H ”Connection: Close” will do it.)

Is it the same thing? Is supplying a close header the same as turning off KeepAlive?

Mostly yes, but not quite 100%.

When you supply the “close” header, you’re asking the server to close the connection afterward. Let’s assume the server does that since basically any correct HTTP server will.

But when you turn off KeepAlive on the client, you’re closing it client-side rather than waiting and detecting when the server has closed the socket. So: it’s about who initiates the socket close. Technically wrk will also just keep going with the same connection if the server somehow doesn’t correctly close the socket… But that’s more of a potential bug than an intentional difference.

It’s me writing this, so you may be wondering: does it make a performance difference?

Difference, No Difference, What’s the Difference?

First off, does KeepAlive itself make a difference? Absolutely. And like any protocol-level difference, how much you care depends on what you’re measuring. If you spend 4 seconds per HTTP requests, the overhead from opening the connection seems small. If you’re spending a millisecond per request, suddenly the same overhead looks much bigger. Rails, and even Rack, have pretty nontrivial overhead so I’m going to answer in those terms.

Yeah, KeepAlive makes a big difference.

Specifically, here’s RSB with a simple “hello, world” Rack route with and without the header-based KeepAlive hack:

ConfigThroughputStd Deviation
wrk w/ no extra header13577302.8
wrk -H "Connection: Close"10185263.4


That’s in the general neighborhood of 30% faster with KeepAlive. Admittedly, this is an example with tiny, fast routes and minimal network overhead. But more network overhead may actually make KeepAlive even faster, relatively, because if you turn off KeepAlive it has to make a new network connection for every request.

So whether “hack no-KeepAlive” versus “real no-KeepAlive” makes a difference, definitely “KeepAlive” versus “no KeepAlive” makes a big difference.

What About Client-Disconnect?

KeepAlive isn’t a hard feature to add to a client normally. The logic for “no KeepAlive” is really simple (close the connection after each request.) What if we check client-closed versus server-closed KeepAlive?

I’ve written a very small patch to wrk to turn off KeepAlive with a command-line switch. There’s also a much older PR to wrk that does this using the same logic, so I didn’t file mine separately — I don’t think this change will get upstreamed.

In fact, just in case I broke something, I wound up testing several different wrk configurations with varying results… These are all using the RSB codebase, with 5 different variants for the wrk command line.

Below, I use “new_wrk” to mean my patched version of wrk, while “old_wrk” is wrk without my —no-keepalive patch.

wrk commandThroughput (reqs/sec)Std Deviation
old_wrk13577302.8
old_wrk -H "Connection: Close"10185263.4
new_wrk13532310.9
new_wrk --no-keepalive7087108.3
new_wrk -H "Connection: Close"10193261.7

I see a couple of interesting results here. First off, there should be no difference between old_wrk and new_wrk for the normal and header-based KeepAlive modes… And that’s what I see. If I don’t turn on the new command line arg, the differences are well within the margin of measurement error (13577 vs 13532, 10185 vs 10193.)

However, the new client-disconnected no-KeepAlive mode is around 30% slower than the “hacked” server-disconnected no-KeepAlive! That means it’s around 60% slower than with KeepAlive! I strongly suspect what’s happening is that a server-disconnected KeepAlive mode winds up sending the “close” request alongside the request data, while a client-disconnect winds up making a whole extra network round trip.

A Very Quick Ruby Note - Puma and JRuby

You might reasonably ask if there’s anything Ruby-specific here. Most of this isn’t - it’s experimenting on a load tester and just using a Ruby server to check against, after all.

However, there’s one very important Ruby-specific note for those of you who have been reading carefully.

Most of my posts here are related to work I’m doing on Ruby. This one is no exception.

Puma has some interesting KeepAlive-related bugs, especially in combination with JRuby. If you find yourself getting unreasonably slow results for no reason, especially with Puma and/or JRuby, try turning KeepAlive on or off.

The Puma and JRuby folks are both looking into it. Indeed, I found this bug while working with the JRuby folks.

Conclusions

There are several interesting takeaways here, depending on your existing background.

  • KeepAlive speeds up a benchmark a lot; if there’s no reason to turn it off, keep it on

  • wrk doesn’t have a ‘real’ way to turn off KeepAlive (most load testers do)

  • you can use a workaround to turn off KeepAlive for wrk… and it works great

  • if you turn off KeepAlive, make sure you’re still getting not-utterly-horrible performance

  • be careful combining Puma and/or JRuby with KeepAlive - test your performance

And that’s what I have for this week.

Where Does Rails Spend Its Time?

You may know that I run Rails Ruby Bench and write a fair bit about it. It’s intended to answer performance questions about a large Rails app running in a fairly real-world configuration.

Here’s a basic question I haven’t addressed much in this space: where does RRB actually spend most of its time?

I’ve used the excellent StackProf for the work below. It was both very effective and shockingly painless to use. These numbers are for Ruby 2.6, which is the current stable release in 2019.

(Disclaimer: this will be a lot of big listings and not much with the pretty graphs. So expect fairly dense info-dumps punctuated with interpretation.)

About Profiling

It’s hard to get high-quality profiling data that is both accurate and complete. Specifically, there are two common types of profiling and they have significant tradeoffs. Other methods of profiling fall roughly into these two categories, or a combination of them:

  • Instrumenting Profilers: insert code to track the start and stop points of whatever it measures; very complete, but distorts the accuracy by adding extra statements to the timing; usually high overhead; don’t run them in production

  • Sampling Profilers: every so many milliseconds, take a sample of where the code currently is; statistically accurate and can be quite low-overhead, but not particularly complete; fast parts of the code often receive no samples at all; don’t use them for coverage data; fast ones can be run in production

StackProf is a sampling profiler. It will give us a reasonably accurate picture of what’s going on, but it could easily miss methods entirely if they’re not much of the total runtime. It’s a statistical average of samples, not a Platonic ideal analysis. I’m cool with that - I’m just trying to figure out what bits of the runtime are large. A statistical average of samples is perfect for that.

I’m also running it for a lot of HTTP requests and adding the results together. Again, it’s a statistical average of samples - just what I want here.

Running with a Single Thread

Measuring just one process and one thread is often the least complicated. You don’t have to worry about them interfering with each other, and it makes a good baseline measurement. So let’s start with that. If I run RRB in that mode and collect 10,000 requests, here are the top (slowest) CPU-time entries, as measured by StackProf.

(I’ve removed the “total” columns from this output in favor of just the “samples” columns because “total” counts all methods called by that method, not just the method itself. You can get my original data if you’re curious about both.)

==================================
  Mode: cpu(1000)
  Samples: 4293 (0.00% miss rate)
  GC: 254 (5.92%)
==================================
SAMPLES    (pct)     FRAME
    206   (4.8%)     ActiveRecord::Attribute#initialize
    189   (4.4%)     ActiveRecord::LazyAttributeHash#[]
    122   (2.8%)     block (4 levels) in class_attribute
     98   (2.3%)     ActiveModel::Serializer::Associations::Config#option
     91   (2.1%)     block (2 levels) in class_attribute
     90   (2.1%)     ActiveSupport::PerThreadRegistry#instance
     85   (2.0%)     ThreadSafe::NonConcurrentCacheBackend#[]
     79   (1.8%)     String#to_json_with_active_support_encoder
     70   (1.6%)     ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache
     67   (1.6%)     ActiveModel::Serializer#include?
     65   (1.5%)     SiteSettingExtension#provider
     59   (1.4%)     block (2 levels) in <class:Numeric>
     51   (1.2%)     ActiveRecord::ConnectionAdapters::PostgreSQL::Utils#extract_schema_qualified_name
     50   (1.2%)     ThreadSafe::NonConcurrentCacheBackend#get_or_default
     50   (1.2%)     Arel::Nodes::Binary#hash
     49   (1.1%)     ActiveRecord::Associations::JoinDependency::JoinPart#extract_record
     49   (1.1%)     ActiveSupport::JSON::Encoding::JSONGemEncoder::EscapedString#to_json
     48   (1.1%)     ActiveRecord::Attribute#value
     46   (1.1%)     ActiveRecord::LazyAttributeHash#assign_default_value
     45   (1.0%)     ActiveSupport::JSON::Encoding::JSONGemEncoder#jsonify
     45   (1.0%)     block in define_include_method
     43   (1.0%)     ActiveRecord::Result#hash_rows

There are a number of possibly-interesting things here. I’d probably summarize the results as “6% garbage collection, 17%ish ActiveRecord/ActiveModel/ARel/Postgres, around 4-6% JSON and serialization, and some cache and ActiveSupport various like class_attribute.” That’s not bad - with the understanding that ActiveRecord is kinda slow, and this profiler data definitely reflects that. A fast ORM like Sequel would presumably do better for performance, though it would require rewriting a bunch of code.

Running with Multiple Threads

You may recall that I usually run Rails Ruby Bench with lots of threads. How does that change things? Let’s check.

==================================
  Mode: cpu(1000)
  Samples: 40421 (0.51% miss rate)
  GC: 2706 (6.69%)
==================================
SAMPLES    (pct)     FRAME
   1398   (3.5%)     ActiveRecord::Attribute#initialize
   1169   (2.9%)     ActiveRecord::LazyAttributeHash#[]
    999   (2.5%)     ThreadSafe::NonConcurrentCacheBackend#[]
    923   (2.3%)     block (4 levels) in class_attribute
    712   (1.8%)     ActiveSupport::PerThreadRegistry#instance
    635   (1.6%)     block (2 levels) in class_attribute
    613   (1.5%)     ActiveModel::Serializer::Associations::Config#option
    556   (1.4%)     block (2 levels) in <class:Numeric>
    556   (1.4%)     Arel::Nodes::Binary#hash
    499   (1.2%)     ActiveRecord::Result#hash_rows
    489   (1.2%)     ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache
    480   (1.2%)     ThreadSafe::NonConcurrentCacheBackend#get_or_default
    465   (1.2%)     ActiveModel::Serializer#include?
    436   (1.1%)     Hashie::Mash#convert_key
    433   (1.1%)     SiteSettingExtension#provider
    407   (1.0%)     ActiveRecord::ConnectionAdapters::PostgreSQL::Utils#extract_schema_qualified_name
    378   (0.9%)     String#to_json_with_active_support_encoder
    360   (0.9%)     Arel::Visitors::Reduce#visit
    348   (0.9%)     ActiveRecord::Associations::JoinDependency::JoinPart#extract_record
    343   (0.8%)     ActiveSupport::TimeWithZone#transfer_time_values_to_utc_constructor
    332   (0.8%)     ActiveSupport::JSON::Encoding::JSONGemEncoder#jsonify
    330   (0.8%)     ActiveSupport::JSON::Encoding::JSONGemEncoder::EscapedString#to_json
    328   (0.8%)     ActiveRecord::Type::TimeValue#new_time

This is pretty similar. ActiveRecord is showing around 20%ish rather than 17%, though doesn’t reflect any of the smaller components, anything under 1% of the total (plus it’s sampled.) The serialization is still pretty high, around 4-6%.

If I try to interpret these results, the first thing I should point out is that they’re quite similar. While running with 6 threads/process is adding to (for instance) the amount of time spent on cache contention and garbage collection, it’s not changing it that much. Good. A massive change there is either a huge optimization that wouldn’t be available for single-threaded, or (more likely) a serious error of some kind.

If GC is High, Can We Fix That?

It would be reasonable to point out that 7% is a fair bit for garbage collection. It’s not unexpectedly high and Ruby has a pretty good garbage collector. But it’s high enough that it’s worth looking at - a noticeable change there could be significant.

There’s a special GC profile mode that Ruby can use, where it keeps track of information about each garbage collection that it does. So I went ahead and ran StackProf again with GC profiling turned on - first in the same “concurrent” setup as above, and then with jemalloc turned on to see if it had an effect.

The short version is: not really. Without jemalloc, the GC profiler collected records of 2.4 seconds of GC time over the 10,000 HTTP requests… And with jemalloc, it collected 2.8 seconds of GC time total. I’m pretty sure what we’re seeing is that jemalloc’s primary speed advantage is during allocation and freeing… And with Ruby using a deferred sweep happening in a background thread, it’s a good bet that neither of these things count as garbage collection time.

This is one of those GC::PRofiler reports. You can also get it as a Ruby hash table and then dump that, which makes it a bit easier to analyze in irb later.

This is one of those GC::PRofiler reports. You can also get it as a Ruby hash table and then dump that, which makes it a bit easier to analyze in irb later.

I also took more StackProf results with profiling on, but 1) they’re pretty similar to the other results and 2) GC profiling actually takes enough time to distort the results a bit, so they’re likely to be less accurate than the ones above.

What Does All This Suggest?

There are a few interesting leads we could chase from here.

For instance, could JSON be lower? Looking through Discourse’s code, it’s using the oj gem via MultiJSON. OJ is pretty darn fast, so that’s probably going to be hard to trim to less of the time. And MultiJSON might be adding a tiny bit of overhead, but it shouldn’t be more than that. So we’d probably need a structural or algorithmic change of some kind (e.g. different caching) to lower JSON overhead. And for a very CRUD-heavy app, this isn’t an unreasonable amount of serialization time. Overall, I think Discourse is serializing pretty well, and these results reflect that.

ActiveRecord is a constant performance bugbear in Rails, and Discourse is certainly no exception. I use this for benchmarking and I want “typical” not “blazing fast,” so this is pretty reassuring for me personally - yup, that’s what I’m used to seeing slow down a Rails app. If you’re optimizing rather than benchmarking, the answers are 1) the ActiveRecord team keep making improvements and 2) consider using something other than ActiveRecord, such as Sequel. None of them are 100% API-interoperable with ActiveRecord, but if you’re willing to change a bit of code, some Ruby ORMs are surprisingly fast. ActiveRecord is convenient, flexible, powerful… but not terribly fast.

Since jemalloc’s not making much different in GC… in a real app, the next step would be optimization and trying to create less garbage. Again, for me personally, I’m benchmarking, so lots of garbage per request means I’m doing it right. Interestingly, jemalloc does seem to speed up Rails Ruby Bench significantly, so these results don’t mean it’s not helping. If anything, this may be a sign that StackProf’s measurement doesn’t do very well at measuring jemalloc’s results - perhaps it isn’t catching differences in free() call time? And garbage collection can be hard to measure well in any case.

Methodology

This is mostly just running for 10,000 requests and seeing what they look like added/averaged together. There are many reasons not to take this as a perfect summary, starting with the fact that the server wasn’t restarted to give multiple “batches” the way I normally do for Rails Ruby Bench work. However, I ran it multiple times to make sure the numbers basically hold up, and they basically seem to.

Don’t think of this as a bulletproof and exact summary of where every Rails app spends all its time - it wouldn’t be anyway. It’s a statistical summary, it’s a specific app and so on. Instead, you can think of it as where a lot of time happened to go one time that some guy measured… And I can think of it as grist for later tests and optimizations.

As for specifically how I got StackProf to measure the requests… First, of course, I added the StackProf gem to the Gemfile. Then in config.ru:

use StackProf::Middleware,
  enabled: true,
  mode: :cpu,
  path: "/tmp/stackprof",  # to save results
  interval: 1000,          # ms between samples
  save_every: 50           # save .dump file each this many results

You can see other configuration options in the StackProf::Middleware source.

Conclusions

Here are a few simple takeaways:

  • Even when configured well, a Rails CRUD app will spend a fair bit of time on DB querying, ActiveRecord overhead and serialization,

  • Garbage collection is a lot better than in Ruby 1.9, but it’s still a nontrivial chunk of time; try to produce fewer garbage objects where you can,

  • ActiveRecord adds a fair bit of overhead on top of the DB itself; consider alternatives like Sequel and whether they’ll work for you,

  • StackProf is easy and awesome and it’s worth trying out on your Ruby app

See you in two weeks!

Ruby 2.7 and the Compacting Garbage Collector

Aaron Patterson, aka Tenderlove, has been working on a compacting garbage collector for Ruby for some time. CRuby memory slots have historically been quirky, and may take some tweaking - this makes them a bit simpler since the slot fragmentation problem can (potentially) go away.

Rails Ruby Bench isn’t the very best benchmark for this, but I’m curious what it will show - it tends to show memory savings as speed instead, so it’s not a definitive test for “no performance regressions.” But it can be a good way to check how the performance and memory tradeoffs balance out. (What would be “the best benchmark” for this? Probably something with a single thread of execution, limited memory usage and a nice clear graph of memory usage over time. That is not RRB.)

But RRB is also, not coincidentally, a great torture test to see how stable a new patch is. And with a compacting garbage collector, we care a great deal about that.

How Do I Use It?

Memory compaction doesn’t (yet) happen automatically. You can see debate in the Ruby bug about that, but the short version is that compaction is currently expensive, so it doesn’t (yet) happen without being explicitly invoked. Aaron has some ideas to speed it up - and it’s only just been integrated into a very pre-release Ruby version. So you should expect some changes before the Christmas release of Ruby 2.7.

Instead, if you want compaction to happen, you should call GC.compact. Most of Aaron’s testing is by loading a large Rails application and then calling GC.compact before forking. That way all the class code and the whole set of large, long-term Ruby objects get compacted with only one compaction. The flip side is that newly-allocated objects don’t benefit from the compaction… But in a Rails app, you normally want as many objects preloaded as possible anyway. For Rails, that’s a great way to use it.

How do you make that happen? I just added an initializer in config/initializers containing only the code “GC.compact” that runs after all the others are finished. You could also use a before-fork hook in your application server of choice.

If you aren’t using Rails and expect to allocate slowly over a long time, it’s a harder question. You’ll probably want to periodically call GC.compact but not very often - it’s slower than a full manual GC, for instance, so you wouldn’t do it for every HTTP request. You’re probably better off calling it hourly or daily than multiple times per minute.

Testing Setup

For stability and speed testing, I used Rails Ruby Bench (aka RRB.)

RRB is a big concurrent Rails app processing a lot of requests as fast as it can. You’ve probably read about it here before - I’m not changing that setup significantly. For this test, I used 30 batches of 30,000 HTTP requests/batch for each configuration. The three configurations were “before” (the Ruby commit before GC compaction was added,) “after” (Ruby compiled at the merge commit) and “after with compaction” (Ruby at the merge commit, but I added an initializer to Discourse to actually do compaction.)

For the “before” commit, I used c09e35d7bbb5c18124d7ab54740bef966e145529. For “after”, I used 3ef4db15e95740839a0ed6d0224b2c9562bb2544 - Aaron’s merge of GC compact. That’s SVN commit 67479, from Feature #15626.

Usually I give big pretty graphs for these… But in this case, what I’m measuring is really simple. The question is, do I see any speed difference between these three configurations?

Why would I see a speed difference?

First, GC compaction actually does extra tracking for every memory allocation. I did see a performance regression on an earlier version of the compaction patch, even if I never compacted. And I wanted to make sure that regression didn’t make it into Ruby 2.7.

Second, GC compaction might save enough memory to make RRB faster. So I might see a performance improvement if I call GC.compact during setup.

And, of course, there was a chance that the new changes would cause crashes, either from the memory tracking or only after a compaction had occurred.

Results and Conclusion

The results themselves look pretty underwhelming, in the sense that they don’t have many numbers in them:

“Before” Ruby: median throughput 182.3 reqs/second, variance 43.5, StdDev 6.6

“After” Ruby: median throughput 179.6 reqs/second, variance 0.84, StdDev 0.92

“After” Ruby w/ Compaction: median throughput 180.3 reqs/second, variance 0.97, StdDev 0.98

But what you’re seeing there is very similar performance for all three variants, well within the margin of measurement error. Is it possible that the GC tracking slowed RRB down? It’s possible, yes. You can’t really prove a negative, which in this case means I cannot definitively say “these are exactly equal results.” But I can say that the (large, measurable) earlier regression is gone, but I’m not seeing significant speedups from the (very small) memory savings from GC compaction.

Better yet, I got no crashes in any of the 90 runs. That has become normal and expected for RRB runs… and it says good things about the stability of the new GC compaction patch.

You might ask, “does the much lower variance with GC compaction mean anything?” I don’t think so, no. Variance changes a lot from run to run. It’s imaginable that the lower variance will continue and has some meaning… and it’s just as likely that I happened to get two low-variance runs for the last two “just because.” That happens pretty often. You have to be careful reading too much into “within the margin of error” or you’ll start seeing phantom patterns in everything…

The Future

A lot of compaction’s appeal isn’t about immediate speed. It’s about having a solution for slot fragmentation, and about future improvements to various Ruby features.

So we’ll look forward to automatic periodic compaction happening, likely also in the December 2019 release of Ruby 2.7. And we’ll look forward to certain other garbage collection problems becoming tractable, as Ruby’s memory system becomes more capable and modern.

"Wait, Why is System Returning the Wrong Answer?" - A Debugging Story, and a Deep Dive into Kernel#system

I had a fun bug the other day - it involved a merry chase, many fine wrong answers, a disagreement across platforms… And I thought it was a Ruby bug, but it wasn’t. Instead it’s one of those not-a-bugs you just have to keep in mind as you develop.

And since it’s a non-bug that’s hard to find and hard to catch, perhaps you’d like to hear about it?

So… What Happened?

Old-timers may instantly recognize this problem, but I didn’t. This is one of several ways it can manifest.

I had written some benchmarking code on my Mac, I was running it on Linux, and a particular part of it was misbehaving. Specifically, I was using curl to see if the URL was available - if a server was running and accepting connections yet. Curl will return true if the connection succeeds and gets output, and return false if it can’t connect or gets an error. I also wanted to redirect all output, because I didn’t want a console message. Seems easy enough, right? It worked fine on my Mac.

    def url_available?
      system("curl #{@url} &>/dev/null")  # This doesn't work on Linux
    end

The “&>/dev/null” part redirects both STDOUT and STDERR to /dev/null so you don’t see it on the console.

If you try it out yourself on a Mac it works pretty well. And if you try it on Linux, you’ll find that whether the URL is available or not it returns true (no error), so it’s completely useless.

However, if you remove the output redirect it works great on both platforms. You just get error output to console if it fails.

Wait, What?

I wondered if I had found an error in system() for awhile. Like, I added a bunch of print statements into the Ruby source to try and figure out what was going on. It doesn’t help that I tried several variations of the code and checked $? to see if the process had returned error and… basically confused myself a fair bit. I was nearly convinced that system() was returning true but $?.success? was returning false, which would have been basically impossible and would have meant a bug in Ruby.

Yeah, I ran down a pretty deep rabbit hole on this one.

In fact, the two commands wind up passing the same command line on Linux and MacOS. And if you run the command it passes in bash, you’ll get the same return value in bash - you can check by printing out $?, a lot like in Ruby.

A Quick Dive into Kernel#System

Let’s talk about what Kernel#system does, so I can explain what I did wrong.

If you include any special characters in your command (like the output redirection), Ruby will run your command in a subshell. In fact, system will do a few different things. In fact, system will do many different things.

If your command is just a string with no special characters, it will run it fairly directly: “ls” will simply run “ls”, and “ls bob” will run “ls” with the single argument “bob”. No great surprise.

If your command does have special characters, though, such as ampersand, dollar sign or greater-than, it assumes you’re doing some kind of shell trickery - it runs "/bin/sh” and passes whatever you gave it as an argument ("/bin/sh” with the arguments “-c” and whatever you gave to Kernel#system.)

You can also pass an array for more control - [“ls”, “bob”], for instance, will do the same thing as passing “ls bob” into Kernel#system, but with perhaps a bit more control - you can make sure it’s not running a subshell and you can automatically quote things without adding a bunch of double-quotes.

# Examples
system("ls")                 # runs "ls"
system("ls bob")             # runs "ls" w/ arg "bob"
system(["ls", "bob"])        # runs "ls" w/ arg "bob"
system("ls bob 2>/dev/null") # runs sh -c "ls bob 2>/dev/null"

No Really, What Went Wrong?

My code up above uses special characters. So it uses /bin/sh. I tried it on the Mac, it worked fine. Here’s the important difference that I missed:

On a Mac, /bin/sh is the same as bash. On Linux it isn’t.

Linux includes a much simpler shell it installs as /bin/sh, without a lot of bash-specific features. One of those bash-specific features is the ampersand-greater-than syntax that I used to redirect stdout and stderr at the same time. There’s a way to do it that’s compatible with both, but that version isn’t. And in this specific case, it always winds up returning true for /bin/sh, even if the command fails.

Oops.

So in some sense, I used a bash-specific command and I should fix that. I’ll show how to fix it that way below.

Or in a different sense, I used a big general-purpose hammer (a shell) for something I could have done simply and specifically in Ruby. I’ll fix it that way too, farther down.

How Should I Fix This?

Here’s a way to fix the shell incompatibility, simply and directly:

def url_available?
  system("curl #{@url} 1>/dev/null 2>&1")  # This works on Mac and Linux
end

This will redirect stdout to /dev/null, then redirect stderr to stdout. It works fine, and it’s a syntax that’s compatible with both bash and Linux’s default /bin/sh.

This way is fine. It does what you want. It’s enough. Indeed, as I write this it’s the approach I used to fix it in RSB.

There’s also a cleaner way, though it takes slightly more Ruby code. Let’s talk about Kernel#system a bit more and we can see how. It’s a more complex method, but you get more control over what gets called and how.

System’s Gory Glory

In addition to the command argument above, the one that can be an array or a processed string, there are extra “magic” arguments ahead and behind. There’s also another trick in the first argument - Kernel#system is like one of those “concept” furniture videos where everything unfolds into something else.

You saw above that command can be (documented here):

  • A string with special characters, which will expand into /bin/sh -c “your command”

  • A string with no special characters, which will directly run the command with no wrapping shell

  • An array of strings, which will run array[0] as the command and pass the rest as args

  • An array of strings except array[0] is a two-element array of strings - that will do the same as an array of strings, except the first entry is [ newArgv0Value, commandName ]. If this sounds confusing, you should avoid it.

But you can also pass an optional hash before the command. If you do, that hash will be:

  • A hash of new environment variable values; normally these will be added to the parent process’s environment to get the new child environment. But see “options” below.

And you can also pass an optional hash after the command. If you do, that hash may have different keys to do different things (documented here), including:

  • :unsetenv_others - if true, unset every environment variable you didn’t pass into the first optional hash

  • :close_others - if true, close every file descriptor except stdout, stdin or stderr that isn’t redirected

  • :chdir - a new current directory to start the process in

  • :in, :out, :err, strings, integers, Io objects or arrays - redirect file descriptors, according to a complicated scheme

I won’t go through all the options because there are a lot of them, mostly symbols like the first three above.

But that last one looks promising. How would we do the redirect we want to /dev/null to throw away that output?

In this case, we want to redirect stderr and stdout both to /dev/null. Here’s one way to do that:

def url_available?
  system(["curl", @url], 1 => [:child, 2], 2 => "/dev/null") # This works too
end

That means to redirect the child’s stdout (file descriptor 1) to its own stderr, and direct its stderr to (the file, which will be opened) /dev/null. Which is exactly what we want to do, but also a slightly awkward syntax for it. However, it guarantees that we won’t run an extra shell, and we won’t have to turn the arguments into a string and re-parse them, and we won’t have to worry about escaping the strings for a shell.

Once more, to see documentation for all the bits and bobs that system (and related calls like Kernel#spawn) can accept, here it is.

Here are more examples of system’s “fold-out” syntax with various pieces added:

# Examples
system({'RAILS\_ENV' => 'profile'}, "rails server") # Set an env var first
system(["rails", "server"], pgroup: true) # Run server in a new process group
system("ls *", 2 => [:child, 1]) # runs sh -c "ls *" with stderr and stdout merged
system("ls *", 2 => :close) # runs sh -c "ls *" with stderr closed

Conclusion

Okay, so what’s the takeaway? Several come to mind:

  • /bin/sh is different on Mac (where it’s bash) and Linux (where it’s simpler and smaller)

  • It’s easy to use incompatible shell commands, and hard to test cross-platform

  • Ruby has a lot of shell-like functionality built into Kernel#system and similar calls - use it

  • By doing a bit of the shell’s work yourself (command parsing, redirects) you can save confusion and incompatibility

And that’s all I have for today.

Why is Ruby Slower on Mac? An Early Investigation

Sam Saffron has been investigating Discourse test run times on different platforms. While he laments spending so much extra time by running Windows, what strikes me most is the extra time on Mac — which many, many Rubyists use as their daily driver.

So which bits are slow? This is just a preliminary investigation, and I’m sure I’ll do more looking into it. But at first blush, what seems slow?

I’m curious for multiple reasons. One: is Mac is so slow, is it better to run under Docker, or with an external VM, rather than the Mac? Two: why is it slow? Can we characterize what is so slow and either do less of it or fix it?

First, let’s look at some tools and what they can or can’t tell us.

Ruby-Prof

Ruby-Prof is potentially interesting to show us just the rough outlines of what’s slow. It’s not great for the specifics because it’s an instrumenting profiler rather than a sampling profiler, and that distorts the results a bit. So: only good for the big picture. In general, you should expect an instrumenting profiler to add a bit of time to each method call, so you’d expect it to “flatten” results a bit - fast methods will seem a bit slower, and methods that take a long time won’t seem as much slower as they actually are.

Also, Ruby-Prof takes a long time to write out larger output, which can be a problem if you run it under an application server like Puma - when it starts writing out a large result set, Puma is likely to kill it because the “request” is taking too long. So it also has limited utility for that reason.

As a result, I don’t really trust my current Rails results with it. There’s too much potential for severe sampling bias. Instead, let’s look at what it says about a non-HTTP CPU benchmark, OptCarrot.

I’m testing on very different machines - a MacbookPro laptop running a normal MacOS UI versus a dedicated Amazon EC2 instance (m4.2xlarge) running Linux with no UI. It’s fair to call those unequal — they are, in all sorts of ways. However, they’re actually fairly similar for the question we’re curious about, which goes, “how fast is running tests on my Mac laptop/desktop versus running it on a separate Linux server/VM?”

Some Results

The first question is, how stable are those results? This is a fairly key question — if the results aren’t stable, then what they are relative to each other is a very different question.

For instance, here’s what two typical sets of OptCarrot results from the dedicated instance look next to each other:

I’ve cut out some columns you don’t care about. You’ll see the occasional line switched, but notice that only happens when the %self is very similar.

I’ve cut out some columns you don’t care about. You’ll see the occasional line switched, but notice that only happens when the %self is very similar.

Pretty stable, right? What you’re looking at here is the leftmost column, the percentage of total time, as well as the order of the methods for how much of that time they take. In both cases, these listings are very solidly similar.

In other words, one of the primary Ruby CPU benchmarks used for Ruby 3x3, run on the most common platform for benchmarking, gives pretty solid results. But we were pretty sure of that, right?

How about on Mac, which is not a primary benchmarking platform for Ruby?

This is not Mac vs Linux, it’s Mac vs Mac on the same machine

This is not Mac vs Linux, it’s Mac vs Mac on the same machine

These percentages vary a little more. Different rows switch places more often. What you’re seeing is a “wobblier” result - one where the “same” run just has more variation. I observed the same thing with RSB on Mac, though this is the first time I’ve tried to quantify it a bit.

Is that because the MacOS UI is running? Maybe. The amount of variation here is larger than the amount that Apple shows running in the Activity Monitor, but that doesn’t guarantee anything. And of course “how much is OS overhead?” is a really hard question to answer.

So… What’s not here?

After the wobble is accounted for, I don’t see any one or few methods that are massively slower on Mac. So this doesn’t look like there’s just a few operations here that are slowing everything way down. That’s a bit disappointing — wouldn’t it be nice if we could just fix a couple of things? But it makes sense.

Several things don’t seem to be in the listing above: extra garbage collection time could be distributed across all these categories, or it could manifest as a large spike in just a few places — I don’t see anything like that spike, not on any of my runs. So Mac does not seem to be slower because of a few spikes in garbage collection time. Given that the Mac memory allocator is supposed to be slower, that’s important to check. It could be an overall slower allocator — OptCarrot doesn’t do a lot of memory allocation, but OptCarrot isn’t showing up as a lot slower.

And in fact, I don’t think I’m seeing a huge slowdown. Comparing two different hosts this way isn’t in any way fair or representative, but Sam was seeing around a 2X slowdown on Mac in his Discourse results, and that’s not subtle. I don’t think I’m seeing a slowdown of that magnitude for OptCarrot. Sounds like I should be comparing some Rails and/or RSpec projects like Discourse - perhaps something there is the problem.

(Why didn’t I start with Discourse? Basically, because it’s hard to configure and even harder to configure the same. The odds that I’d spend days chasing down something that wasn’t even his problem are surprisingly high. Also, Docker or no Docker? Docker is now how people configure Discourse on Mac mostly, but is has completely different performance for a lot of common things - like files.)

Basics and Fundamentals

OptCarrot and Ruby-Prof aren’t instantly showing anything useful. So let’s step back a bit. What problems can Ruby fix vs not fix? What’s our basic situation?

Well, what if the Mac is somehow magically slower across the board at everything? Seems a bit unlikely, but we haven’t ruled it out. If the Mac was just as slow with random compiled C binaries, then there’s not much Ruby could do about this. It’s not like we’re going to skip GCC and start emitting our own compiled binaries.

If we wanted to check that, we could do more of an apples-to-apples comparison between Mac and Linux. Comparing a laptop to a virtualized server instance is, of course, not even slightly an apples-to-apples comparison.

But it’s worse than that. Hang on.

Sam strongly suggested installing Linux and Mac on the same machine dual-boot for testing — that’s the only way you’ll be sure you have the same exact speed. Even two of the same model fresh off the line aren’t necessarily the exact same speed as each other, for all sorts of good reasons. Slight CPU variation is the norm, not the exception.

And worse yet: you can’t run OS X headless, not really. Dual-boot will still have more processes running in the background in OS X, and slightly different compiler, and memory allocator, and… Yeah. So the exact same machine with dual-boot won’t give a proper apples-to-apples comparison.

It’s a good thing we don’t need one of those, isn’t it?

What We Can Get

Most of what we want to know is, is Ruby somehow slower than it should be on Mac? And if so, is it because of something at the Ruby level? If it’s not at the Ruby level then we can measure it and warn people, but not much more.

So first off, how do the speed of those two hosts compare? You can check a mid-2015-era Macbook Pro against an EC2 m4.2xlarge on GeekBench.com - and for single-core CPU benchmarks, they seem to think the Macbook is pretty poky - about 2.5 GB/sec while the Linux server gets 3.7 GB/sec. The Mac does better for overall rating (4264 single-core vs 2929 single-core), but it’s hard to tell what that means with so few tests run in common.

Okay, so then how do we compare? I downloaded the Phoronix test suite for both Mac and Linux to compare them and ran the CPU suite. That should at least give some similar results. Here are the tests in common I could easily get:

TestMacbookEC2 Linux instance
x265 3.0 (1080p video encoding)2.98fps2.64 fps
7-Zip Compression19859 MIPS18508 MIPS
Stockfish 97906720 Nodes Per Second7869399 Nodes Per Second


What I’m seeing there is basically that these are not dramatically different processors. And when I run optcarrot on them (also single-core) the Mac runs it at 39-40 fps pretty consistently, while (one core of) the EC2 instance runs it at 30fps. This is not obvious evidence for the Mac being slower at Ruby CPU benchmarks.

So: maybe what’s slow is something about Discourse? Or about Mac memory allocation or garbage collection?

Conclusions and Followups

All of this is initial work, and fairly simple. Expect more from me as I explore further.

What I’ve seen so far is:

  • Mac CPU benchmarks don’t seem especially slow in Ruby as opposed to out of Ruby

  • The relative speed of different operations seems fairly consistent between Linux and Mac Ruby

  • Mac takes a hit on both speed and consistency by running a UI and a fairly “busy” OS

Followups that are likely to be useful:

  • Discourse, most especially its test suite; this is what Sam found to be very slow

  • Other profiling tools like stackprof - ruby-prof’s “flattening” of performance may be hiding a problem

  • Garbage collection and memory performance

  • Filesystem I/O

Look for me from me on this topic in the coming weeks!

JIT Performance with a Simpler Benchmark

There have been some JIT performance improvements in Ruby 2.7, which is still prerelease. And lately I’m using a new, simpler benchmark lately for researching Ruby performance.

Hey - wasn’t JIT supposed to be easier to make work on simpler code? Let’s see how JIT, including the prerelease code, works with that new benchmark.

(Just wanna see graphs? These are fairly simple graphs, but graphs are always good. Scroll down for the graphs.)

The Setup - Methodology

You may remember that Rails Simpler Bench currently uses “hello, world”-type very simple routes that just return a static string. That’s probably the best possible Rails use case for JIT. I’m starting with no concurrency, just a single request at once. That doesn’t show JIT’s full speedup, but it’s the most accurate and more reproducible to measure… And mostly, we want to know if JIT speeds things up at all rather than showing the largest possible speedup. I’m also measuring in both Rails and plain Rack, with Puma, on a dedicated-tenancy AWS EC2 m4.2xlarge instance. There’s no networking happening outside the instance itself, so this should give us nice low-noise results.

I wound up running one set of tests (everything Ruby 2.6.2) on one instance and the other set (everything with new prerelease Ruby) on another - so don’t treat this as an apples-to-apples comparison of prerelease Ruby’s speedup over 2.6.2. That’s okay, there’s all sorts of reasons that’s not a good idea to do anyway. Instead, we’re just checking the relative performance of JIT to no-JIT for each Ruby.

“New prerelease Ruby 2.7” is going to be accurate for a lot of different commits before the release around Christmastime. For this article, I’m using commit 025206d0dd29266771f166eb4f59609af602213a, which was new on May 9th. It’s what “git pull” got when I was getting ready to write this post.

Each of these runs is done with 10 batches of 4 minutes of HTTP requests, after 2 minutes of warmup for the server. I’m using Puma for the app server and wrk as the HTTP load generator. This should sound a lot like the setup for several of my recent blog posts. You can find the benchmark code here, based on a variation of this config file.

The Results

Let’s start with Rails - it’s what gets asked the most often. How does JIT do?

Takashi has made it clear that JIT isn’t expected to be faster for Rails… and that has been my experience as well. But he says the new JIT does better than in 2.6.

So let’s try. How does new prerelease JIT do compared to the released 2.6? First I’ll show you the graph, then I’ll give a bit of interpretation.

That thick line toward the bottom is the X axis, or “rate == 0.”

That thick line toward the bottom is the X axis, or “rate == 0.”

Those pink bars are an indication of the 10th, 50th and 90th percentile from lowest to highest. It’s like a box plot that way.

On the left, for Ruby 2.6.2, the JIT and no-JIT plots are pretty far apart. The medians are 1280 (No JIT) versus 1060 (w/ JIT), for instance. JIT is substantially slower, though not as much slower as for Rails Ruby Bench. That should make sense. JIT has an easier time on simpler code with shorter methods so Rails Ruby Bench is a terrible case for it. Rails Simpler Bench isn’t as bad.

Better yet, on the right you can see that they’re getting quite close for Ruby 2.7 prerelease - only around 5% slower, give or take.

What About Rack?

What should we expect for Rack? Well, if simpler is better for JITting, Rack should have better JIT-versus-not performance. That is, JIT should do relatively better compared to non-JIT by some amount in 2.7 than 2.6.

And that’s roughly what we see:

JIT is still slower than non-JIT, but it’s getting closer. These numbers are much higher because a raw Rack “hello, world” route is very fast compared to Rails.

JIT is still slower than non-JIT, but it’s getting closer. These numbers are much higher because a raw Rack “hello, world” route is very fast compared to Rails.

Conclusions

What you’re seeing above is pretty much what Takashi Kokubun said - while JIT is still slower on Rails (and Rack) than no JIT, the newer changes in 2.7 look promising… And JIT is catching up. We have around a year and a half before Ruby 3x3 is tentatively scheduled for release. This definitely looks like JIT could be a plus for Rails instead of a minus by then, but I wouldn’t expect it to be, say, 30% faster. But Takashi may prove me wrong!

Measuring Rails Overhead

We all know that using Ruby on Rails is slower than just plain Rack. After all, Rack is the simplest, most bare-bones web interface in Ruby, unless you’re willing to do without compatibility between app servers (or unless you’re writing your own.)

But how much overhead does Rails add? Is it getting less as Ruby gets faster?

I’m working with a new, simpler Rails benchmark lately. Let’s see what it can tell us on this topic.

Easy Does It

If we want to measure Rails overhead, let’s start simple - no concurrency (one thread, one process) and a simple Rails “hello, world”-style app, meaning a single route that returns a static string.

That’s pretty easy to measure in RSB. I’ll assume Puma is a solid choice of app server - not necessarily the best possible, but more representative than WEBrick. I’ll also use an Amazon EC2 m4.2xlarge dedicated instance. It’s my normal Rails Ruby Bench baseline, and a solid choice that a modestly successful Ruby startup would be likely to use. I’ll use Rails version 4.2 - not the newest or the best. But it’s the last version that’s still compatible with Ruby 2.0.0, which we need.

We’ll look at one of each Ruby minor version from 2.0 through 2.6. I like to start with Ruby 2.0.0p0 since it’s the baseline for Ruby 3x3. Here are throughputs that RSB gets for each of those versions:

RSB_StaticRouteSingleBG.png

That looks decent - from around 760 iters/second for Ruby 2.0 to around 1000 iters/second for Ruby 2.6. Keep in mind that this is a single-threaded benchmark, so the server is only using one core. You can get much faster numbers with more cores, but then it’s harder to tell exactly what’s going on. We’ll start simple.

Now: how much of that overhead is Ruby on Rails, versus the application server and so on? The easiest way to check that is to run a Rack “hello, world” application with the same configuration and compare it to the Rails app.

Here’s the speed for that:

RSB_RackStaticRouteSingleBG.png

Once again, not bad. You’ll notice that Rails is quite heavy here - the Rack-based app runs far faster. Rails is really not designed for “hello, world”-type applications, just as you’d expect. But we can do a simple mathematical trick to subtract out the Puma and Rack overhead and get just the Rails overhead:

iters_sec_formula.png

Then we can subtract the Puma and app server overhead from Rails. Here’s what that looks like when we do it once for each Ruby version.

RailsTimePerRequestBG.png

And now you can see how long Rails adds to the execution time of each route in your Rails application! You’ll notice the units are “usec”, or microseconds. So to round shamelessly, Rails adds around 1 millisecond (1/1000th of a second) to each request. The Rack requests above happened at more like 12,000/second, or around 83 usec per request — that’s added to the Rails time in the last graph, not subtracted from it.

Other Observations

When you measure, you usually get roughly what you were looking for - in this case, we answered the question, “how much time does Rails take for each request?” But you often get other interesting information as well.

In this case, we get some interesting data points on what gets faster with newer Ruby versions.

You may recall that Discourse, a big Rails app, running with high concurrency, gets about 72% faster from Ruby 2.0.0p0 to Ruby 2.6. Some of the numbers with OptCarrot show huge speedups, 400% and more in a few specific configurations.

The numbers above are less exciting, more in the neighborhood of 30% speedup. Heck, Rack gets only 16%. Why?

I’ll let you in on a secret - when I time with WEBrick instead of Puma, it gets 74% faster. And after that 74% speedup, it’s still slower than Puma.

Puma uses a reactor and the libev event library to spend most of its time in highly-tuned C code in system libraries. As a result, it’s quite fast. It also doesn’t really get faster when Ruby does — that’s not where it spends its time.

WEBrick can get much faster because it’s spending lots of time in Ruby… But only to approach Puma, not really to surpass it.

OptCarrot can do even better - it’s performance-intensive all-Ruby code, it’s processor-bound, and a lot of optimizations are aimed at exactly what it’s doing. So it can make huge gains - tripling its speed or more. You’ll also notice if you explore OptCarrot a bit that it’s harder to see those huge gains if it’s running in optimized mode. There’s just less fat to cut. That should make sense, intuitively.

And highly-tuned code that’s still basically Ruby, like the per-request Rails code, is in between. In this case, you’re seeing it gain around 30%, which is much better than nothing. In fact, it’s quite respectable as a gain to highly-tuned code written in a mature programming language. That 30% savings will save a lot of processor cycles for a lot of Rails users. It just doesn’t make a stunning headline.

Conclusions

We’ve checked Rails’ overhead: it’s around 900usec/request for modern Ruby.

We’ve checked how it’s improved: from about 1200 usec to 900 usec since Ruby 2.0.0p0.

And we’ve observed the range of improvement in Ruby code: glue code like Puma only gains around 16% from Ruby 2.0.0p0 to 2.6, because it barely spends any time in Ruby. Your C extensions aren’t going to magically get faster because they’re waiting on C, not Ruby. And it’s quite usual to get around 72%-74% on “all-Ruby” code, from Discourse to WEBrick. But only in rare CPU-heavy cases are you going to see OptCarrot-like gains of 400% or more… And even then, only if you’re running fairly un-optimized code.

Here’s one possible interpretation of that: optimization isn’t really to take your leanest, meanest, most carefully-tuned code and make it way better. Most optimization lets you write only-okay code and get closer to those lean-and-mean results without as much effort. It’s not about speeding up your already-fastest code - it’s about speeding you up in writing the other 95% of your code.