Reading large files in Ruby
I needed to slurp up some very large files into a ruby app recently and noticed some interesting behaviour in IO.foreach
method.
While it is supposed to read file line by line without loading it up into memory, memory usage is quite significant compared to reading the files via an offset (IO.read
).
Investigation #
10 MB test file:
λ ls -alh xaa
-rw-r--r--@ 1 temikus staff 9.5M Feb 18 11:24 xaa
Test script:
#!/usr/bin/env ruby
require "benchmark/memory"
BUFFER = 4096
Benchmark.memory do |x|
x.report("foreach") {
File.foreach(ARGV[0]) do |line|
line
end
}
x.report("stream.read") {
stream = File.new(ARGV[0])
until stream.eof?
stream.read(BUFFER)
end
}
x.compare!
end
IO.foreach
is using 20x memory:
λ ./cat_compare.rb xaa
Calculating -------------------------------------
foreach 200.008M memsize ( 0.000 retained)
5.000M objects ( 0.000 retained)
1.000 strings ( 0.000 retained)
stream.read 10.111M memsize ( 0.000 retained)
2.443k objects ( 0.000 retained)
2.000 strings ( 0.000 retained)
Comparison:
stream.read: 10110978 allocated
foreach: 200008424 allocated - 19.78x more
Then I remembered that the ruby Garbage Collector is working on “mark-and-sweep” principle. The “mark” stage checks objects to see if they are still in use. If an object is in a variable that can still be used in the current scope, the object (and any object inside that object) is marked for keeping. If the variable is long gone, off in another method, the object isn’t marked. The “sweep” stage then frees objects which haven’t been marked.
That could explain it, so I decided to test that theory by putting a small GC ticker in, running GC.start
every 100 000 lines:
tick = 1
File.foreach(ARGV[0]) do |line|
line
tick +=1
GC.start if tick % 100000 == 0
end
Surely enough memory footprint has rapidly decreased:
temikus λ ./cat_compare.rb xaa
Calculating -------------------------------------
foreach 8.424k memsize ( 0.000 retained)
1.000 objects ( 0.000 retained)
0.000 strings ( 0.000 retained)
stream.read 10.111M memsize ( 0.000 retained)
2.443k objects ( 0.000 retained)
2.000 strings ( 0.000 retained)
Comparison:
foreach: 8424 allocated
stream.read: 10110978 allocated - 1200.26x more
However, calling GC explicitly does give a time penalty:
foreach 0.716403 0.004597 0.721000 ( 0.722513)
stream.read 0.004264 0.003208 0.007472 ( 0.007474)
Conclusion #
If you need to work with large files (>200MB) and read them line-by-line in Ruby, it’s better to read them in chunks via IO.read(CHUNK_SIZE)
. Best CHUNK_SIZE
is usually one that matches a memory page size on your system (often 4KB) as that ensures that OS is efficiently reading the file into memory.
Addendum:
- All tests performed on Ruby 2.6.1
- Old (written by _why!) but still useful article about Ruby’s GC details