Pythonic way to read file line by line? - python

What's the Pythonic way to go about reading files line by line of the two methods below?
with open('file', 'r') as f:
for line in f:
print line
or
with open('file', 'r') as f:
for line in f.readlines():
print line
Or is there something I'm missing?

File handles are their own iterators (specifically, they implement the iterator protocol) so
with open('file', 'r') as f:
for line in f:
# code
Is the preferred usage. f.readlines() returns a list of lines, which means absorbing the entire file into memory -> generally ill advised, especially for large files.
It should be pointed out that I agree with the sentiment that context managers are worthwhile, and have included one in my code example.

Of the two you presented, the first is recommended practice. As pointed out in the comments, any solution (like that below) which doesn't use a context manager means that the file is left open, which is a bad idea.
Original answer which leaves dangling file handles so shouldn't be followed
However, if you don't need f for any purpose other than reading the lines, you can just do:
for line in open('file', 'r'):
print line

theres' no need for .readlines() method call.
PLUS: About with statement
The execution behavior of with statement is as commented below,
with open("xxx.txt",'r') as f:
// now, f is an opened file in context
for line in f:
// code with line
pass // when control exits *with*, f is closed
print f // if you print, you'll get <closed file 'xxx.txt'>

Related

Parsing large, possibly compressed, files in Python

I am trying to parse a large file, line by line, for relevant information.
I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).
I am using the following code but I feel that, because I am not inside the with statement, I am not parsing the file line by line and am in fact loading the entire file file_content into memory.
if ".gz" in FILE_LIST['INPUT_FILE']:
with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
else:
with open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
for line in file_content:
# do stuff
Any suggestions for how I should handle this?
I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.
readlines reads the file fully. So it's a no-go for big files.
Doing 2 context blocks like you're doing and then using the input_file handle outside them doesn't work (operation on closed file).
To get best of both worlds, I would use a ternary conditional for the context block (which determines if open or gzip.open must be used), then iterate on the lines.
open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
for line in input_file:
note that I have added the "r" mode to make sure to work on text not on binary (gzip.open defaults to binary)
Alternative: open_function can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']:
open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)
once defined, you can reuse it at will
with open_function(FILE_LIST['INPUT_FILE']) as input_file:
for line in input_file:

for each line in file write line to an individual file in python

I have a text file which needs to be separated line by line into individual text files. So if the main file contains the strings:
foo
bar
bla
I would have 3 files which could be named numerically 1.txt (containing the string "foo"), 2.txt (sontaining the string"bar") and 3.txt (containing the string "bla")
The straightforward way to do with would be to open three files for writing and writing line by line into each file. But the problem is when we have lot of lines or we do not know exactly how many there are. It seems painfully unnecessary to have to create
f1=open('main_file', 'r')
f2=open('1.txt', 'w')
f3=open('2.txt', 'w')
f4=open('3.txt', 'w')
is there a way to put a counter in this operation or a library which can handle this type of ask?
Read the lines from the file in a loop, maintaining the line number; open a file with the name derived from the line number, and write the line into the file:
f1 = open('main_file', 'r')
for i,text in enumerate(f1):
open(str(i + 1) + '.txt', 'w').write(text)
You would want something like this. Using with is the preferred way for dealing with files, since it automatically closes them for you after the with scope.
with open('main_file', 'r') as in_file:
for line_number, line in enumerate(in_file):
with open("{}.txt".format(i+1), 'w') as out_file:
out_file.write(line)
Firstly you could read the file into a list, where each element stands for a row in the file.
with open('/path/to/data','r') as f:
data = [line.strip() for line in f]
Then you could use a for loop to write into files separately.
for counter in range(len(data)):
with open('/path/to/file/'+str(counter),'w') as f:
f.write(data[counter])
Notes:
Since you're continuously opening numerous files, I highly suggest using
with open() as f:
#your operation
The advantage of using this is that you can make sure Python release the resources on time.
Details:
What's the advantage of using 'with .. as' statement in Python?

truncating a text file does not change the file

When a novice (like me) asks for reading/processing a text file in python he often gets answers like:
with open("input.txt", 'r') as f:
for line in f:
#do your stuff
Now I would like to truncate everything in the file I'm reading after a special line. After modifying the example above I use:
with open("input.txt", 'r+') as file:
for line in file:
print line.rstrip("\n\r") #for debug
if line.rstrip("\n\r")=="CC":
print "truncating!" #for debug
file.truncate();
break;
and expect it to throw away everything after the first "CC" seen. Running this code on input.txt:
AA
CC
DD
the following is printed on the console (as expected):
AA
CC
truncating!
but the file "input.txt" stays unchanged!?!?
How can that be? What I'm doing wrong?
Edit: After the operation I want the file to contain:
AA
CC
It looks like you're falling victim to a read-ahead buffer used internally by Python. From the documentation for the file.next() method:
A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing). In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.
The upshot is that the file's position is not where you would expect it to be when you truncate. One way around this is to use readline to loop over the file, rather than the iterator:
line = file.readline()
while line:
...
line = file.readline()
In addition to glibdud's answer, truncate() needs the size from where it deletes the content. You can get the current position in your file by the tell() command. As he mentioned, by using the for-loop, the next() prohibits commands like tell. But in the suggested while-loop, you can truncate at the current tell()-position. So the complete code would look like this:
Python 3:
with open("test.txt", 'r+') as file:
line = file.readline()
while line:
print(line.strip())
if line.strip() == "CC":
print("truncating")
file.truncate(file.tell())
break
line = file.readline()

python loop won't iterate on second pass

When I run the following in the Python IDLE Shell:
f = open(r"H:\Test\test.csv", "rb")
for line in f:
print line
#this works fine
however, when I run the following for a second time:
for line in f:
print line
#this does nothing
This does not work because you've already seeked to the end of the file the first time. You need to rewind (using .seek(0)) or re-open your file.
Some other pointers:
Python has a very good csv module. Do not attempt to implement CSV parsing yourself unless doing so as an educational exercise.
You probably want to open your file in 'rU' mode, not 'rb'. 'rU' is universal newline mode, which will deal with source files coming from platforms with different line endings for you.
Use with when working with file objects, since it will cleanup the handles for you even in the case of errors. Ex:
.
with open(r"H:\Test\test.csv", "rU") as f:
for line in f:
...
You can read the data from the file in a variable, and then you can iterate over this data any no. of times you want to in your script. This is better than doing seek back and forth.
f = open(r"H:\Test\test.csv", "rb")
data = f.readlines()
for line in data:
print line
for line in data:
print line
Output:
# This is test.csv
Line1,This is line 1, there are, some numbers here,321423423
Line2,This is line2 , there are some characters here,sdfdsfdsf
# This is test.csv
Line1,This is line 1, there are, some numbers here,321423423
Line2,This is line2 , there are some characters here,sdfdsfdsf
Because you've gone all the way through the CSV file, and the iterator is exhausted. You'll need to re-open it before the second loop.

Python Overwriting files after parsing

I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.
The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)
As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).
The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.
newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))
(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion

Categories

Resources