Handling Big Data in Ruby

3 min readDec 27, 2021

Have you ever wondered how to efficiently parse big files in Ruby, like 500Mb — 1Gb CSV?

Let’s suppose we have a 10_000_000-row file (~ 540Mb) with the following structure:

"user","revenue"
"f86fe409-42e0-469d-87e6-52b84c7f72dc",0.000492905577346

Our task is to get total revenue from it. It is a fairly simple operation, so we just need to iterate line by line and accumulate total revenue.

Solution 0

I will skip solutions that try to load the whole file and then iterate through it to reduce our time :)

Solution 1

The first thing that comes to mind is good old CSV, a pretty simple solution

sum = 0
CSV.foreach(filename) { |row| sum += row.last.to_f }sum

It took over 1 minute(62 seconds)!!! to iterate through the file, this is too much, let’s try to improve.

Solution 2

Let’s remove overhead that comes from CSV and try to use the plain File class and manual parsing

sum = 0
File.foreach(filename) { |row| sum += row.split(',').last.to_f }sum
# File.open(filename).each gives similar performance

Now it is much faster, 14 seconds on my laptop, almost 5 times quicker with small changes in code. Ok, we can end our post here. Or not?

Solution 3

Meet the ‘red-arrow’ gem and his big brother behind — Apache Arrow.

It is not only a columnar format for storing data, but also the entire ecosystem for computation written in C++, which allows you to parse not only the Arrow format, but also Parquet, CSV and ORC. It has bindings in many programming languages, including Ruby. And I would like to say thank you to Sutou Kouhei, who supports the Ruby version.

Lets’s try to build solution with Arrow.

table = Arrow::Table.load(filename)
sum_function = Arrow::Function.find('sum')
sum_function.execute([table['revenue'].data]).value.value

Boom!💣 less than 1 second!

So what just has happened:

Arrow::Table.load — reads the whole file in memory! in columnar format.
Arrow::Function#execute applies Arrow Compute function to a data frame. This is faster than applying plain Ruby sum.

Of course, speed is not for free. Here are some thoughts about this gem:

Pros:

Loads the whole file in memory, so that you can do any transformations, grouping/filtering.
It automatically does typecasting to the best possible data type. You don’t need to call ‘to_f’.
Blazing fast 🚀 It outperforms Ruby more than 10 times, especially on big datasets.

Cons:

Loads the whole file in memory. If you have 10Gb file — you need 10Gb RAM.
It is not a full data frame solution, you have to do many coding manually, like applying of compute functions.
Some of the C++ functionality is not available in the gem.
Not a lot of projects use this library, so you can find bugs, but I hope it will be adopted in the Ruby community.

In the next post, I will try to cover some examples of how to use Arrow with Ruby.

Below you can find the full script that compares the performance of the three solutions above for 10_000 lines(data10k.csv) and 10_000_000 lines(data10M.csv) files.

If you know how to improve the performance — leave your ideas in the comments!