Reading large files in Ruby

I needed to slurp up some very large files into a ruby app recently and noticed some interesting behaviour in IO.foreach method.

While it is supposed to read file line by line without loading it up into memory, memory usage is quite significant compared to reading the files via an offset (IO.read).

Investigation #

10 MB test file:

λ ls -alh xaa
-rw-r--r--@ 1 temikus  staff   9.5M Feb 18 11:24 xaa

Test script:

#!/usr/bin/env ruby
require "benchmark/memory"

BUFFER = 4096

Benchmark.memory do |x|

  x.report("foreach") {
    File.foreach(ARGV[0]) do |line|
      line
    end
  }

  x.report("stream.read") {
    stream = File.new(ARGV[0])

    until stream.eof?
      stream.read(BUFFER)
    end
  }

  x.compare!
end

IO.foreach is using 20x memory:

λ ./cat_compare.rb xaa
Calculating -------------------------------------
             foreach   200.008M memsize (     0.000  retained)
                         5.000M objects (     0.000  retained)
                         1.000  strings (     0.000  retained)
         stream.read    10.111M memsize (     0.000  retained)
                         2.443k objects (     0.000  retained)
                         2.000  strings (     0.000  retained)

Comparison:
         stream.read:   10110978 allocated
             foreach:  200008424 allocated - 19.78x more

Then I remembered that the ruby Garbage Collector is working on “mark-and-sweep” principle. The “mark” stage checks objects to see if they are still in use. If an object is in a variable that can still be used in the current scope, the object (and any object inside that object) is marked for keeping. If the variable is long gone, off in another method, the object isn’t marked. The “sweep” stage then frees objects which haven’t been marked.

That could explain it, so I decided to test that theory by putting a small GC ticker in, running GC.start every 100 000 lines:

    tick = 1

    File.foreach(ARGV[0]) do |line|
      line
      tick +=1
      GC.start if tick % 100000 == 0
    end

Surely enough memory footprint has rapidly decreased:

temikus λ ./cat_compare.rb xaa
Calculating -------------------------------------
             foreach     8.424k memsize (     0.000  retained)
                         1.000  objects (     0.000  retained)
                         0.000  strings (     0.000  retained)
         stream.read    10.111M memsize (     0.000  retained)
                         2.443k objects (     0.000  retained)
                         2.000  strings (     0.000  retained)

Comparison:
             foreach:       8424 allocated
         stream.read:   10110978 allocated - 1200.26x more

However, calling GC explicitly does give a time penalty:

foreach  0.716403   0.004597   0.721000 (  0.722513)
stream.read  0.004264   0.003208   0.007472 (  0.007474)

Conclusion #

If you need to work with large files (>200MB) and read them line-by-line in Ruby, it’s better to read them in chunks via IO.read(CHUNK_SIZE). Best CHUNK_SIZE is usually one that matches a memory page size on your system (often 4KB) as that ensures that OS is efficiently reading the file into memory.


Addendum:

 
8
Kudos
 
8
Kudos

Now read this

Ansible - GCP Dynamic inventory bootstrap

Ansible GCP module bootstrap # The official docs on Ansible GCP module are quite a bit confusing, so I’ve decided to publish some steps to quickly bootstrap a working environment. Main concept is that Ansible can not only use static... Continue →