How to I read several lines in a file faster using python?

How to I read several lines in a file faster using python? - python

As of now I use the following python code:
file = open(filePath, "r")
lines=file.readlines()
file.close()
Say my file has several lines (10,000 or more), then my program becomes slow if I do this for more than one file. Is there a way to speed this up in Python? Reading various links I understand that readlines stores the lines of file in memory thats why the code gets slow.
I have tried the following code as well and the time gain I got is 17%.
lines=[line for line in open(filePath,"r")]
Is there any other module in python2.4 (which I might have missed).
Thanks,
Sandhya

for line in file:
This gives you an iterator that reads the file object one line at a time and then discards the previous line from memory.
A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit. In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer. New in version 2.3.
Short answer: don't assign the lines to a variable, just perform whatever operations you need inside the loop.

Related

Reading files in Python with for loop

To read a file in Python, the file must be first opened, and then a read() function is needed. Why is that when we use a for loop to read lines of a file, no read() function is necessary?
filename = 'pi_digits.txt'
with open(filename,) as file_object:
for line in file_object:
print(line)
I'm used to the code below, showing the read requirement.
for line in file_object.read():

This is because the file_object class has an "iter" method built in that states how the file will interact with an iterative statement, like a for loop.
In other words, when you say for line in file_object the file object is referencing its __iter__ method, and returning a list where each index contains a line of the file.

Python file objects define special behavior when you iterate over them, in this case with the for loop. Every time you hit the top of the loop it implicitly calls readline(). That's all there is to it.
Note that the code you are "used to" will actually iterate character by character, not line by line! That's because you will be iterating over a string (the result of the read()), and when Python iterates over strings, it goes character by character.

The open command in your with statement handles the reading implicitly. It creates a generator that yields the file a record at a time (the read is hidden within the generator). For a text file, each line is one record.
Note that the read command in your second example reads the entire file into a string; this consumes more memory than the line-at-a-time example.

Read file line by line or use the read() method?

It's recommended in the The Hitchhiker’s Guide to Python that it's better to use:
for line in f:
print line
than:
a = f.read()
print a
where f is a file object.
Although I can see that this is not the main point that the comparison in the article is trying to prove (it's about context managers,) I was wondering what are the differences between those two approaches.
Is it better to use the former method even though I only need the entire file contents, rather than having any kind or processing to do on each line?

This has to do with memory management.
If the file you are working with is large (MB's or even GB's in size), then using the read method is very inefficient because it reads in all of the file's contents at once and stores them as a string object. From the docs:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached.
Emphasis mine. As you can guess, this is not a good thing. Even if you manage to avoid a MemoryError, you will still greatly impact the performance of your program by consuming a huge portion of your available memory.
The for-loop approach however eliminates this problem by working with only one line at a time. Iterating over a file object yields its lines one-by-one like an iterator. From the docs:
A file object is its own iterator, for example iter(f) returns f
(unless f is closed). When a file is used as an iterator, typically in
a for loop (for example, for line in f: print line.strip()), the
next() method is called repeatedly. This method returns the next input
line, or raises StopIteration when EOF is hit
Thus, you do not have to worry about excessive memory consumption because there will only ever be one line in memory at any given time.
Nevertheless, if your file is small, then using the read method is perfectly fine because the memory impact is negligible. In fact, with small files, it is convenient to have all of the data at once so that you can work with it as one piece (call string methods on it such as str.count or str.find, slice it into separate portions, etc.).

read() will load file in to memory, if its not big file that will not be a problem.
If fits a big file (say in GB),you may run out of memory while loading. so for big file looping using file object is better. it will not make you run out of memory and make your pc slow

Is an object file a list by default?

I've encountered two versions of code that both can accomplish the same task with a little difference in the code itself:
with open("file") as f:
for line in f:
print line
and
with open("file") as f:
data = f.readlines()
for line in data:
print line
My question is, is the file object f a list by default just like data? If not, why does the first chunk of code work? Which version is the better practice?

File object is not a list - it's an object that conforms to iterator interface (docs). I.e. it implements __iter__ method that returns an iterator object. That iterator object implements both __iter__ and next methods allowing iteration over the collection.
It happens that the File object is it's own iterator (docs) meaning file.__iter__() returns self.
Both for line in file and lines = file.readlines() are equivalent in that they yield the same result if used to get/iterator over all lines in the file. But, file.next() buffers the contents from the file (it reads ahead) to speed up the process of reading, effectively moving the file descriptor to position exact to or farther than where the last line ended. This means that if you have used for line in file, read some lines and the stopped the iteration (you haven't reach end of the file) and now called file.readlines(), the first line returned might not be the full line following the last line iterated over the for loop.
When you use for x in my_it, the interpreter calls my_it.__iter__(). Now, the next() method is being called on the object returned by the previous call, and for each call it's return value is being assigned to x. When next() raises StopIteration, the loop ends.
Note: A valid iterator implementation should ensure that once StopIteration is raised, it should remain to be risen for all subsequent calls to next().

In both cases, you are getting a file line-by-line. The method is different.
With your first version:
with open("file") as f:
for line in f:
print line
While you are interating over the file line by line, the file contents are not resident fully in memory (unless it is a 1 line file).
The open built-in function returns a file object -- not a list. That object supports iteration; in this case returning individual strings that are each group of characters in the file terminated by either a carriage return or the end of file.
You can write a loop that is similar to what for line in f: print line is doing under the hood:
with open('file') as f:
while True:
try:
line=f.next()
except StopIteration:
break
else:
print line
With the second version:
with open("file") as f:
data = f.readlines() # equivelent to data=list(f)
for line in data:
print line
You are using a method of a file object (file.readlines()) that reads the entire file contents into memory as a list of the individual lines. The code is then iterating over that list.
You can write a similar version of that as well that highlights the iterators under the hood:
with open('file') as f:
data=list(f)
it=iter(data)
while True:
try:
line=it.next()
except StopIteration:
break
else:
print line
In both of your examples, you are using a for loop to loop over items in a sequence. The items are the same in each case (individual lines of the file) but the underlying sequence is different. In the first version, the sequence is a file object; in the second version it is a list. Use the first version if you just want to deal with each line. Use the second if you want a list of lines.
Read Ned Batchelder's excellent overview on looping and iteration for more.

f is a filehandle, not a list. It is iterable.

A file is an iterable. Lots of objects, including lists are iterable, which just means that they can be used in a for loop to sequentially yield an object to bind the for iterator variable to.
Both versions of your code accomplish iteration line by line. The second versions reads the whole file into memory and constructs a list; the first may not read the whole file first. The reason why you might prefer the second is that you want to close the file before something else modifies it; the first might be preferred if the file is very large.

Cleaner way to read/gunzip a huge file in python

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines. Unless you're using a very old Python, just do this:
for line in f:
A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try

Is it possible to modify lines in a file in-place?

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.

fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.

No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.

If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.

You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.