Ruby Fellow Hired!

At the end of 2015, Matz announced that Ruby 3 will have a focus on performance (a.k.a Ruby 3x3). We welcomed this news and wanted to help.

Today, we are happy to announce that we have hired Noah Gibbs (@codefolio) to work full-time on Ruby performance. Noah has a strong background in C / C++ and low-level programming. He is also no stranger to the Ruby / Rails worlds having worked on a number of large scale Rails applications, presented at conferences / meet-ups, and even written a book on the Rails

We are thrilled to have Noah join "the many performance wonks in the community who work tirelessly to make everything faster."

 

 

 

Ruby 3x3: Ruby 3 will be 3 times faster

At RubyConf 2015 in San Antonio, Yukihiro "Matz" Matsumoto announced Ruby 3x3. The goal of Ruby 3x3 is to make Ruby 3 be 3 times faster than Ruby 2. At AppFolio, we think this is awesome and want to help.

[Update: watch Matz's keynote ]

We have been using Ruby for the past nine years. Like most people, we initially came to Ruby via Rails. One of the main ideals of Rails was that developer happiness and productivity are more important than raw system performance. And Ruby supported that quite well. It's easy to write and easy to read. It follows the principle of least surprise. It's flexible and dynamic and perfect for DSLs. It even supports "seattle style".

But for all of Ruby’s virtues, it lacks speed. This isn't to say that performance hasn't improved over time: it has. But is Ruby fast yet? No. And today, the performance expectations are very different from those of nine years ago. Customers now expect more responsive web applications and APIs, developers expect a faster development environment (running rake, running tests, starting rails, asset compilation), and operations expects better utilization of cores and hardware resources. In other words, today we want developer happiness without sacrificing performance.

So, why hasn't Ruby been able to deliver better performance results? Jared Friedman argues that, in part, it's due to a lack of corporate sponsorship. In the case of PHP, it took Facebook to move it forward. In the case of JavaScript, it took Google to kick off the performance arms race. In the case of Ruby, Heroku has been sponsoring Matz, Koichi Sasada, and Nobuyoshi Nakada. That's great, but that's not enough. The development and maintenance of an entire programming language is a massive job. Performance optimization is a challenging problem, and it's only one of many important concerns in that effort. The Heroku three and the rest of the Ruby core team are going to need more support in order to satisfy today's performance expectations.

So, we at AppFolio would like to help. We have reached out to Matz and are proud to be part of the Ruby 3x3 effort.

RubyConf 2015 keynote.png

Our plan is to hire a full-time engineer to work with the Ruby core team on performance. Ideally, we would find someone who is not only a strong C programmer but also has experience with one of other VMs out there (e.g. JVMs, JavaScript VMs).

So, what are the possible ways to make Ruby faster? Here are a few high level ideas:

  • just-in-time compilation: a Ruby JIT is a very promising idea. After all, Java and JavaScript solved their performance issues with the introduction of JIT compilers. Interestingly, there was one experimental attempt at a Ruby JIT with RuJIT, a trace-based JIT. RuJIT showed 2x5x performance gains but was very memory hungry. Perhaps, the RuJIT work can be revisited and improved. Or perhaps we need a new approach with a method-based JIT.
  • ahead-of-time compilation: the concept is to reduce the time the Ruby VM has to spend before actually running a program. Aaron Patterson has already played around with this approach a bit.
  • concurrency: despite the common misconception about the GIL, Ruby does support concurrency quite well in cases where IO is involved (making HTTP requests, reading a file, querying the database, etc). However, the Thread is a low level primitive, which makes it difficult to write concurrent programs correctly. There are gems like concurrent-ruby that provide higher-level abstractions, but we need these higher-level abstractions in Ruby itself to gain adoption and greater performance.
  • optimize / remove GIL: in order to allow Ruby to be a viable option in the parallel computing realm, we need to remove the GIL.
  • profile-optimize-rinse-repeat: of course, we cannot overlook the bread and butter of profiling existing parts of code and finding ways to optimize bottlenecks. 

Let us know if you or someone you know is qualified and interested in the above problems (paul dot kmiec at appfolio dot com AND matz at ruby-lang dot org). 

We're just one company among many whose businesses are built on Ruby, so we call out to the others:

Join us and help make Ruby 3x3 a reality.

P.S. If you'd like to contribute to the Ruby / Rails community at large, we suggest also taking a look at RubyTogether.org

P.P.S. Thanks to John Yoder and Andrew Mutz for helping shape this post.

Discovery of a MySQL bug

Recently, one of our MySQL instances crashed. We responded to the alert, but by the time we logged into the server, the MySQL instance has already restarted and recovered. So, we started to investigate the cause of the crash. Our MySQL data volume is nfs-mounted on a NetApp filer in order to allow for reliable crash-recovery and data replication. Our monitoring system showed a spike of near-Gb-line-rate output from the MySQL server to the NetApp filer and it quickly became apparent that MySQL crashed because it ran out of disk space. However, the MySQL disk usage did not change. Instead, the space was used by the NetApp snapshots, each on the order of 20GBs. Snapshots, much like source control, keep track of changes made to files allowing you to view past versions. The NetApp is configured to automatically create snapshots and garbage collect old ones. However, as the rate of change grows, the size of snapshots grows as well. The writing of GBs of data caused enormous snapshots to be created. This meant that whatever wrote those massive files in a short time, also deleted them in that same short time.

Additional metrics in our monitoring system pointed us towards MySQL's internal temporary tables. MySQL uses these internal temporary tables to satisfy certain queries. When an internal temporary table becomes too big for memory, MySQL flushes it to disk. Once the query finishes, the internal temporary table is deleted. So, in this case, MySQL was processing a query and needed a 20GB internal temporary table!? Naturally, we need to find out what query is causing this and on which data set (each customer is stored in a separate database using the multi-tenancy principle).

Based on mytop and the MySQL slow query log, we were able to single out one suspicious query. It was a query executed by Solr's data import handler during delta reindex. The application knows which models and which attributes are stored in Solr. Whenever any of those models and attributes change, the application sends a request to Solr asking it to delta reindex. We have an extension in Solr which waits 60 seconds before actually delta reindexing, which effectively coalesces multiple delta reindex requests into a single delta reindex and lightening the load on the MySQL

It turned out that this suspicious query was changed at the beginning of December, but, due to the Christmas / New Year, was only in production for a couple of weeks. The change was rather unusual as it added a join from table A to table X to table B to a query that was already joining from table A to table Y to table B. This was a deliberate change allowing us to migrate from table X to table Y over a few development cycles and the fact that we now have a double join (and a cartesian product) from table A to table B was discussed at development time and was deemed okay. We assumed less than 50 rows resulting from the join of table A to table B, and so in the worse case we'd end up with 2,500 (50 x 50) rows because of the double join.

We found the internal temporary table that MySQL flushed to disk in one of the NetApp's snapshots. We then matched the data in that file to a particular data set. In this data set, the join from table A to table B had 500 rows, which would produce 250,000 (500 x 500) rows because of the double join. Clearly 250,000 rows is significantly larger than the expected 2,500 rows, but still nothing that MySQL should not be able to handle.

We manually explained and executed this query and we saw no issues whatsoever. Hmmm??? Perhaps this query was not the cause after all?

Nevertheless, we proceeded to copy the data set to our staging servers and see whether we can reproduce the issue there. Based on the suspicious delta reindex query, we knew what to edit in the UI in order to trigger that query. We made the edit and waited. The application asked Solr to delta reindex. Solr waited 60 seconds and started to delta reindex. After few seconds, Solr started to execute our suspicious query and after another 30 seconds, MySQL dumped 20GB internal temporary table to disk. Bingo!

So again we tried to manually explain and execute that query and again we saw no issues whatsoever. Wat!?

We speculated there must be some difference between Solr running the query and us running the query manually. We enabled the MySQL full query log and repeated the edit in the UI. This is what Solr executed,

711723 Query /* mysql-connector-java-5.1.21 ( Revision: ${bzr.revision-id} ) */SELECT @@session.auto_increment_increment
711723 Query SHOW COLLATION
711723 Query SET NAMES utf8mb4
711723 Query SET character_set_results = NULL
711723 Query SET autocommit=1
711723 Query SET sql_mode='STRICT_TRANS_TABLES'
711723 Query SET autocommit=0
711723 Query <delta reindex query>

We quickly focused on SET NAMES utf8mb4. Running the delta reindex query from MySQL client by itself did not cause the problem, but running the delta reindex query after SET NAMES utf8mb4 caused the 20GB file. In fact, we tried other encodings, including utf8, and all of them seem to cause the 20GB file. The only one that worked seemed to be latin1.

This led us to realize that by default our MySQL client uses latin1 whereas our application and Solr use utf8 and utf8mb4, respectively. So in theory, if Solr used latin1 encoding it would avoid the 20GB file. We quickly reconfigured Solr to use latin1 and sure enough, it worked.

Our attention now turned to understanding why that particular delta reindex query on that particular data set with utf8mb4 encoding was causing MySQL to write 20GB file. It turns out MySQL has a bug when a GROUP_CONCAT(DISTINCT ...) is too big to fit in memory and requires MySQL to flush to disk. In our case, the bug is triggered because utf8 / utf8mb4 use multiple bytes per character where as latin1 uses one byte per character. You can read more details about this bug here.

Don't make your users wait for GC

One of the weaknesses of Ruby is garbage collection. Although some improvements have been made (tunable GC settings, lazy sweep, bitmap marking), it is still a simple stop-the-world GC. This can be problematic especially for large applications. We have a rather large application running on Passenger 3 using REE 1.8.7 with tuned GC. As the application grew, the GC performance has been degrading until we were faced with the following situation where GC averages 130 ms per request.

It was time to address our obvious GC problem.

We could have continued to tune the GC and optimize memory usage to reduce the impact of GC on response times. But why? Why have the GC impact the response time at all? Instead, we decided to trigger GC after one request finishes but before the next request starts. This is not a new idea. It is actually generically referred to as out of band work (OOBW). It has been implemented by Unicorn and discussed here. However, it is not supported by Passenger. So we decided to add OOBW support to Passenger.

Our patch allows the application to respond with an additional header, namely X-Passenger-OOB-Work. When Passenger sees this header, it will stop sending new requests to that application process and tell that application process to perform the out of band work. The application registers oob_work callback for the work it wants done using Passenger's event mechanism. When the work is done, the application process resumes handling of normal requests.

All the application code can be packaged in an initializer,

PhusionPassenger.on_event(:oob_work) do
 t0 = Time.now
 GC.start
 Rails.logger.info "Out-Of-Bound GC finished in #{Time.now - t0} sec"
end

class RequestOOBWork
 def initialize(app, frequency)
   @app = app
   @frequency = frequency
   @request_count = 0
 end

 def call(env)
   @request_count += 1
   status, headers, body = @app.call(env)
   if @request_count % @frequency == 0
     headers['X-Passenger-Request-OOB-Work'] = 'true'
   end
   [status, headers, body]
 end
end

Rails.application.config.middleware.use RequestOOBWork, 5

After experimentation, we decided to trigger GC after every 5 requests. If you set the parameter too high, then GC will continue to occur in the middle of requests. If you set the parameter too low, then application processes will be busy performing their out-of-band GC causing Passenger to spawn more workers.

With these changes in place, our graphs show 10x improvement in GC average per request,

The patch we implemented and deployed to production is for Passenger 3. It is available here.

Passenger 4 is currently under active development. In fact, beta 1 has already been released. We've ported our patch to Passenger 4. It is merged and expected to be included in beta 2.