Handling Big Data in Ruby

Have you ever wondered how to efficiently parse big files in Ruby, like 500Mb — 1Gb CSV?

Let’s suppose we have a 10_000_000-row file (~ 540Mb) with the following structure:

"user","revenue"
"f86fe409-42e0-469d-87e6-52b84c7f72dc",0.000492905577346

Our task is to get total revenue from it. It is a fairly simple operation, so we just need to iterate line by line and accumulate total revenue.

Solution 0

I will skip solutions that try to load the whole file and then iterate through it to reduce our time :)

Solution 1

The first thing that comes to mind is good old CSV, a pretty simple solution

sum = 0
CSV.foreach(filename) { |row| sum += row.last.to_f }
sum

It took over 1 minute(62 seconds)!!! to iterate through the file, this is too much, let’s try to improve.

Solution 2

Let’s remove overhead that comes from CSV and try to use the plain File class and manual parsing

sum = 0
File.foreach(filename) { |row| sum += row.split(',').last.to_f }
sum
# File.open(filename).each gives similar performance

Now it is much faster, 14 seconds on my laptop, almost 5 times quicker with small changes in code. Ok, we can end our post here. Or not?

Solution 3

Meet the ‘red-arrow’ gem and his big brother behind — Apache Arrow.

It is not only a columnar format for storing data, but also the entire ecosystem for computation written in C++, which allows you to parse not only the Arrow format, but also Parquet, CSV and ORC. It has bindings in many programming languages, including Ruby. And I would like to say thank you to Sutou Kouhei, who supports the Ruby version.

Lets’s try to build solution with Arrow.

table = Arrow::Table.load(filename)
sum_function = Arrow::Function.find('sum')
sum_function.execute([table['revenue'].data]).value.value

Boom!💣 less than 1 second!

So what just has happened:

  • Arrow::Table.load — reads the whole file in memory! in columnar format.
  • Arrow::Function#execute applies Arrow Compute function to a data frame. This is faster than applying plain Ruby sum.

Of course, speed is not for free. Here are some thoughts about this gem:

Pros:

  • Loads the whole file in memory, so that you can do any transformations, grouping/filtering.
  • It automatically does typecasting to the best possible data type. You don’t need to call ‘to_f’.
  • Blazing fast 🚀 It outperforms Ruby more than 10 times, especially on big datasets.

Cons:

  • Loads the whole file in memory. If you have 10Gb file — you need 10Gb RAM.
  • It is not a full data frame solution, you have to do many coding manually, like applying of compute functions.
  • Some of the C++ functionality is not available in the gem.
  • Not a lot of projects use this library, so you can find bugs, but I hope it will be adopted in the Ruby community.

In the next post, I will try to cover some examples of how to use Arrow with Ruby.

Below you can find the full script that compares the performance of the three solutions above for 10_000 lines(data10k.csv) and 10_000_000 lines(data10M.csv) files.

If you know how to improve the performance — leave your ideas in the comments!

--

--

--

Ruby Backend Developer with 10+ years of experience and ClickHouse DB fan

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Get the Best Assignment Help on C Sharp

Why Agile Development should be at the core of your business | OneWeb.tech

Managing Your Thinkpad Battery Under Debian Linux With TLP

Lenovo ThinkPad P1

GSoC’21 — Week 3 and 4 — Getting the bot ready for official deployment

5 Types of Applications That Should Be Serverless

How To Reverse a String in Python

Introduction Python Machine Learning

Django Rest Framework — How to Edit Reset Password Link in Dj-Rest-Auth with Dj-AllAuth installed

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Konstantin Ilchenko

Konstantin Ilchenko

Ruby Backend Developer with 10+ years of experience and ClickHouse DB fan

More from Medium

What not to do when using async background jobs(based on rails+sidekiq experience)

Rails: Nesting Data with Custom Serializers using Active Model Serializers

Inside the Ruby Object Model

The Basics of Serializer: A Rails guide