Why does Python require double the RAM to read a file?

Why does Python require double the RAM to read a file? - python

I am reading a file which is 24 GB in size. I am using
lines = open(fname).read().splitlines()
and it seems that when reading the lines, it always uses ~double the amount of RAM which should be needed. It uses about 50 GB for my current script (after it jumps up to 50 it goes back down to 28) but every time I use this kind of line to read a file in Python it generally uses double the file size before dropping down to a size that I would expect.
Any idea why this is happening or how I can avoid it?

RAM Usage: Filesize * 1: Read the entire file into memory
open(fname).read()
RAM Usage Filesize * 2: Allocate enough space in a list to split the newlines
open(fname).read().splitlines()
After this operation is complete, the RAM usage drops back down to about Filesize * 1 because the full text of the file isn't needed anymore and it can be garbage-collected.
If you don't need the full text of the file at once, and are only operating on lines, then just iterate over the file
with open(filename) as f:
for line in f:
# do something

My guess is that read returns a string of the entire file, which is not garbage collected until a list is returned from splitlines. If you need the file in the memory, try readlines method:
with open(fname) as f:
lines = f.readlines()

read() tries to load the whole file in memory. With overhead and buffers, this can exceed the size of the file. Then you split the contents of the file into lines, because python allocates new memory for each line.
Can your code be refactored to use readline() and process the lines one by one instead? this would reduce the amount of memory that your program uses at once.
with open(filename) as f:
for line in f:
# process a single line, maybe keeping some state elsewhere.
However, if you still need to load all of the lines in memory at once, use readlines() instead:
with open(filename) as f:
lines = f.readlines()

read() is returning a single str with the whole file data in it. splitlines is returning the list of lines with the same data. The whole file data isn't cleaned up until after splitlines creates the list, so you store two copies of the data for a brief period.
If you want to minimize this overhead (and still strip newlines), you can try:
with open(fname) as f:
lines = [line.rstrip('\r\n') for line in f]
If you can process line by line (don't need whole list at once), it's even better:
with open(fname) as f:
for line in f:
line = line.rstrip('\r\n')
which avoids storing more than two lines at a time.

If the file contains 25Gb of data, then file_handle.read() will return a string that is 25Gb in size. When you split that string, you create a list that holds strings that add up to 25Gb of data (plus additional string overhead for each one). So you end up using about twice the memory.
The big string will get reaped almost immediately by the garbage collector making the memory available for new python objects to occupy, but that doesn't mean that the memory is completely freed to the operating system (due to optimizations in python's memory allocator).
A better approach is to accumulate a list of lines one at a time:
with open(filename) as f:
lines = list(f)
You'll only hold approximately one line in memory from the file at a time1 so then your memory use will be mostly just the memory to store the list.
1This isn't exactly true ... pythons internal line buffering will probably have a couple kb of data at any given time buffered...
Of course, there might also be the option to process the file iteratively:
with open(filename) as f:
for line in f:
process(line)

You read the whole file into memory with:
open(fname).read()
In a second step you create of list from this string with .splitlines().
During this time the string stays in memory but you copy the parts of the
string into the list, line-by-line. Only after your are finished creating the
list, the string can be garbage collected. So during this time you store
all information twice and hence need twice the memory.
You could use open(fname).readlines() or read the file line-by-line to reduce
the memory footprint.

Related

When should I ever use file.read() or file.readlines()?

I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.
i.e.
l = open('file','r')
for line in l:
pass (or code)
is much faster than
l = open('file','r')
for line in l.read() / l.readlines():
pass (or code)
The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.
So - when should I ever use the .read() or .readlines()?
Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.

The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.
f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.
Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.
(As noted in the other answer, this documentation is a very good read):
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.

Hope this helps!
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory
Sorry for all the edits!
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
for line in f:
print line,
This is the first line of the file.
Second line of the file

Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.
I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:
def test_read_file_1():
f = open('ml/README.md', 'r')
for line in f.readlines():
print(line)
def test_read_file_2():
f = open('ml/README.md', 'r')
for line in f:
print(line)
def test_time_read_file():
from timeit import timeit
duration_1 = timeit(lambda: test_read_file_1(), number=1000000)
duration_2 = timeit(lambda: test_read_file_2(), number=1000000)
print('duration using readlines():', duration_1)
print('duration using for-loop:', duration_2)
And the results:
duration using readlines(): 78.826229238
duration using for-loop: 69.487692794
The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().

readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].
Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).

#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read() #Reads all the elements of the file
#into a single string(\n characters might be included)
line = file.readline() #Reads the current line where the cursor as a string
#is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings

Does it takes RAM to save a readlines array?

I am using the command lineslist = file.readlines() of a 2GB file.
So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?
Why should I prefer readlines in this case?
Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:
readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;
I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?
We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)
PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.
EDIT:
Ok, I just found. It's not returning any unambiguous output.
len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"
So, it doesn't end with '\n'. But it's not unambiguous output eiter.
Also, no unambiguous output with readline either for the lastline.

If I understood your issue correctly you just want to combine (ie concatenate) files.
If memory is an issue normally for line in f is the way to go.
I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.
Codes:
#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop
#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop
#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop
Knowing this you could write something like:
with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)

file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.
Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.
for line in file_object:
# Process the line
As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.

Yes, readlines() causes reading all file to variable.
Much better it would be to read file line by line:
f = open("file_path", "r")
for line in f:
print f
It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)
As I understood You want to concatenate two files.
target = open("target_file", "w")
f1 = open("f1", "r")
f2 = open("f2", "r")
for line in f1:
print >> target, line
for line in f2:
print >> target, line
target.close()
Or consider using other technology like bash:
cat file1 > target
cat file2 >> target

Cleaner way to read/gunzip a huge file in python

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines. Unless you're using a very old Python, just do this:
for line in f:
A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try

Which is faster while retrieving data from file; [with open() as, and then looping over] or [f.readlines, and then looping over]

I have a big files to read and process.
Which is the faster method to read the file through and process it.
with open('file') as file:
for line in file:
print line
OR
file = open('file')
lines = f.readlines()
file.close()
for line in lines:
print line

The former can use buffered reading; the latter requires reading the entire file into memory first before it can start looping.
In general, it's a better idea to use the former; it's not going to be any slower than the latter and it's better on memory usage.

If you have some line-base large file, I strongly suggest using the following lines to achieve your goal:
file = open('file')
for line in f.readlines():
print line
file.close()
There are 2 points,
Read all content to memory is never a good idea, the right way is read them by trunk(line)
Don't call lines=f.readlines(), this will also cause reading all content to memory
PS: The former "with" statement only is short for try:open();execept:pass; readlines is implemented using iterator, so it won't eat all your memory.

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.

Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)

If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.