Cleaner way to read/gunzip a huge file in python

Cleaner way to read/gunzip a huge file in python - python

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines. Unless you're using a very old Python, just do this:
for line in f:
A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try

Related

Python: How do I estimate the total memory my algorithm will use?

My algorithm reads first a huge sample of texts. Next, I need to split them into lines:
texts = file_content.split("\n")
However, the file is so big that the process immediately goes into SWAP.
I'd like to predict, how much memory I actually need.
Is that possible?

It may be helpful to figure out the size of the file in bytes before you proceed. This will probably give you a rough estimate of the amount of memory you will then need.
To get the size of a file, you could use the getsize(path method from os.path.
import os
size_in_bytes = os.path.getsize('file.txt')
However, you'll probably need twice as much memory of the file size - you'll store the file in memory, as well as the strings that are read from the file.
As Kasra points out, you're better off to read the file line by line through something like an iterator (just the open() method), and perform processing line by line, reducing the need for extra memory.
For example:
with open('file.txt') as f:
line = f.readline()
process(line)

1 Thing is just to optimize your code, but you asked about the memory usage. You can see here a good article http://fa.bianp.net/blog/2013/different-ways-to-get-memory-consumption-or-lessons-learned-from-memory_profiler/ . With library psutil you just get with these line of code the memory usage
import os
import psutil
process = psutil.Process(os.getpid())
print process.memory_info().rss

It's not possible to predicate the amount of memory which your algorithm supposed to use. But instead of reading whole of the text at once and loading it up in memory as a more pythonic way you can use open() that will return a file object which is an iterator like object and doesn't waste your memory. And you can access to the lines simply by looping over the file object.

Read file line by line or use the read() method?

It's recommended in the The Hitchhiker’s Guide to Python that it's better to use:
for line in f:
print line
than:
a = f.read()
print a
where f is a file object.
Although I can see that this is not the main point that the comparison in the article is trying to prove (it's about context managers,) I was wondering what are the differences between those two approaches.
Is it better to use the former method even though I only need the entire file contents, rather than having any kind or processing to do on each line?

This has to do with memory management.
If the file you are working with is large (MB's or even GB's in size), then using the read method is very inefficient because it reads in all of the file's contents at once and stores them as a string object. From the docs:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached.
Emphasis mine. As you can guess, this is not a good thing. Even if you manage to avoid a MemoryError, you will still greatly impact the performance of your program by consuming a huge portion of your available memory.
The for-loop approach however eliminates this problem by working with only one line at a time. Iterating over a file object yields its lines one-by-one like an iterator. From the docs:
A file object is its own iterator, for example iter(f) returns f
(unless f is closed). When a file is used as an iterator, typically in
a for loop (for example, for line in f: print line.strip()), the
next() method is called repeatedly. This method returns the next input
line, or raises StopIteration when EOF is hit
Thus, you do not have to worry about excessive memory consumption because there will only ever be one line in memory at any given time.
Nevertheless, if your file is small, then using the read method is perfectly fine because the memory impact is negligible. In fact, with small files, it is convenient to have all of the data at once so that you can work with it as one piece (call string methods on it such as str.count or str.find, slice it into separate portions, etc.).

read() will load file in to memory, if its not big file that will not be a problem.
If fits a big file (say in GB),you may run out of memory while loading. so for big file looping using file object is better. it will not make you run out of memory and make your pc slow

Parsing large files NOT in binary (Python or C++)

As a disclaimer, I'm hardly a computer scientist, but I've been reading everything I can on the subject of efficient file i/o to try and tackle this facet of a project I'm working on.
I have a very large (10 - 100 GB) log file of comma-separated values that I need to parse through. The first value labels it as "A" or "B"; for every "A" line, I need to examine the line before it and the line after it, and if either line before or after it meets a criterion, I want to store it in memory or write it to a file. The lines are not uniform in size.
That is my specific problem: I can't seem to locate an efficient way to do this in a non-binary file. With a binary file, I'd simply iterate over the file once and rewind to and fro with a logical check. I've investigated memory mapping, but it seems structured for binary files; my current code is Pythonic and takes weeks to run [see disclaimer].
My other question would be-- how easily could parallelism be invoked to help here? I have a notion of how -- map the file out three lines at a time and send each chunk to each node [lines 1,2,3 go to one node; lines 3,4,5 go to another ...], but I have no idea how to go about implementing this.
Any help would be very much appreciated.

Just read the lines in a loop. Keep track of the previous line in memory and examine it when needed.
Pseudocode:
for each line:
previousLine := currentLine
read currentLine from file
do processing...
This is efficient assuming you're already reading every line into memory anyway, and if you use a proper buffering scheme for reading the file (read large chunks at a time into memory).
I don't think parallelism will help in this situation. If properly written, the bottleneck of the program should be disk I/O, and multiple threads/processes can't read from disk any faster than a single thread. Parallelism only improves CPU-bound problems.
For what it's worth you can "seek" in ASCII files the same way you can with binary files. You would just keep track of the file offset each time you begin to read a line, and store that offset so you know where to seek back to later. Depending on how this is implemented this will never perform better than the above, though, and sometimes worse (you would want the file data to be buffered in memory so that the "seeking" is a memory operation and not a disk operation; you definitely want to read the file contents sequentially to maximize cache-ahead benefits).

Here's a first pass. Assumes properly formatted lines of text.
from itertools import chain
with open('your-file') as f:
prev_line = None
cur_line = f.readline()
for next_line in chain(f, [None]):
pieces = cur_line.split(',')
if pieces[0] == 'A':
check_against_criterion_if_not_none(prev_line)
check_against_criterion_if_not_none(next_line)
prev_line, cur_line = cur_line, next_line
A nifty trick is tacking on that extra 'None' at the end of the file, using itertools.chain, so that code properly checks the last line of the file against the 2nd to last line.

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.

Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)

If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()

memory use in large data-structures manipulation/processing

I have a number of large (~100 Mb) files which I'm regularly processing. While I'm trying to delete unneeded data structures during processing, memory consumption is a bit too high. I was wondering if there is a way to efficiently manipulate large data, e.g.:
def read(self, filename):
fc = read_100_mb_file(filename)
self.process(fc)
def process(self, content):
# do some processing of file content
Is there a duplication of data structures? Isn't it more memory efficient to use a class-wide attribute like self.fc?
When should I use garbage collection? I know about the gc module, but do I call it after I del fc for example?
update
p.s. 100 Mb is not a problem in itself. but float conversion, further processing add significantly more to both working set and virtual size (I'm on Windows).

I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.

Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.

Don't read the entire 100 meg file in at a time. Use streams to process a little bit at a time. Check out this blog post that talks about handling large csv and xml files. http://lethain.com/entry/2009/jan/22/handling-very-large-csv-and-xml-files-in-python/
Here is a sample of the code from the article.
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))

So, from your comments I assume that your file looks something like this:
item1,item2,item3,item4,item5,item6,item7,...,itemn
which you all reduce to a single value by repeated application of some combination function. As a solution, only read a single value at a time:
def read_values(f):
buf = []
while True:
c = f.read(1)
if c == ",":
yield parse("".join(buf))
buf = []
elif c == "":
yield parse("".join(buf))
return
else:
buf.append(c)
with open("some_file", "r") as f:
agg = initial
for v in read_values(f):
agg = combine(agg, v)
This way, memory consumption stays constant, unless agg grows in time.
Provide appropriate implementations of initial, parse and combine
Don't read the file byte-by-byte, but read in a fixed buffer, parse from the buffer and read more as you need it
This is basically what the builtin reduce function does, but I've used an explicit for loop here for clarity. Here's the same thing using reduce:
with open("some_file", "r") as f:
agg = reduce(combine, read_values(f), initial)
I hope I interpreted your problem correctly.

First of all, don't touch the garbage collector. That's not the problem, nor the solution.
It sounds like the real problem you're having is not with the file reading at all, but with the data structures that you're allocating as you process the files.
Condering using del to remove structures that you no longer need during processing. Also, you might consider using marshal to dump some of the processed data to disk while you work through the next 100mb of input files.
For file reading, you have basically two options: unix-style files as streams, or memory mapped files. For streams-based files, the default python file object is already buffered, so the simplest code is also probably the most efficient:
with open("filename", "r") as f:
for line in f:
# do something with a line of the files
Alternately, you can use f.read([size]) to read blocks of the file. However, usually you do this to gain CPU performance, by multithreading the processing part of your script, so that you can read and process at the same time. But it doesn't help with memory usage; in fact, it uses more memory.
The other option is mmap, which looks like this:
with open("filename", "r+") as f:
map = mmap.mmap(f.fileno(), 0)
line = map.readline()
while line != '':
# process a line
line = map.readline()
This sometimes outperforms streams, but it also won't improve memory usage.

In your example code, data is being stored in the fc variable. If you don't keep a reference to fc around, your entire file contents will be removed from memory when the read method ends.
If they are not, then you are keeping a reference somewhere. Maybe the reference is being created in read_100_mb_file, maybe in process. If there is no reference, CPython implementation will deallocate it almost immediatelly.
There are some tools to help you find where this reference is, guppy, dowser, pysizer...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.