I've encountered two versions of code that both can accomplish the same task with a little difference in the code itself:
with open("file") as f:
for line in f:
print line
and
with open("file") as f:
data = f.readlines()
for line in data:
print line
My question is, is the file object f a list by default just like data? If not, why does the first chunk of code work? Which version is the better practice?
File object is not a list - it's an object that conforms to iterator interface (docs). I.e. it implements __iter__ method that returns an iterator object. That iterator object implements both __iter__ and next methods allowing iteration over the collection.
It happens that the File object is it's own iterator (docs) meaning file.__iter__() returns self.
Both for line in file and lines = file.readlines() are equivalent in that they yield the same result if used to get/iterator over all lines in the file. But, file.next() buffers the contents from the file (it reads ahead) to speed up the process of reading, effectively moving the file descriptor to position exact to or farther than where the last line ended. This means that if you have used for line in file, read some lines and the stopped the iteration (you haven't reach end of the file) and now called file.readlines(), the first line returned might not be the full line following the last line iterated over the for loop.
When you use for x in my_it, the interpreter calls my_it.__iter__(). Now, the next() method is being called on the object returned by the previous call, and for each call it's return value is being assigned to x. When next() raises StopIteration, the loop ends.
Note: A valid iterator implementation should ensure that once StopIteration is raised, it should remain to be risen for all subsequent calls to next().
In both cases, you are getting a file line-by-line. The method is different.
With your first version:
with open("file") as f:
for line in f:
print line
While you are interating over the file line by line, the file contents are not resident fully in memory (unless it is a 1 line file).
The open built-in function returns a file object -- not a list. That object supports iteration; in this case returning individual strings that are each group of characters in the file terminated by either a carriage return or the end of file.
You can write a loop that is similar to what for line in f: print line is doing under the hood:
with open('file') as f:
while True:
try:
line=f.next()
except StopIteration:
break
else:
print line
With the second version:
with open("file") as f:
data = f.readlines() # equivelent to data=list(f)
for line in data:
print line
You are using a method of a file object (file.readlines()) that reads the entire file contents into memory as a list of the individual lines. The code is then iterating over that list.
You can write a similar version of that as well that highlights the iterators under the hood:
with open('file') as f:
data=list(f)
it=iter(data)
while True:
try:
line=it.next()
except StopIteration:
break
else:
print line
In both of your examples, you are using a for loop to loop over items in a sequence. The items are the same in each case (individual lines of the file) but the underlying sequence is different. In the first version, the sequence is a file object; in the second version it is a list. Use the first version if you just want to deal with each line. Use the second if you want a list of lines.
Read Ned Batchelder's excellent overview on looping and iteration for more.
f is a filehandle, not a list. It is iterable.
A file is an iterable. Lots of objects, including lists are iterable, which just means that they can be used in a for loop to sequentially yield an object to bind the for iterator variable to.
Both versions of your code accomplish iteration line by line. The second versions reads the whole file into memory and constructs a list; the first may not read the whole file first. The reason why you might prefer the second is that you want to close the file before something else modifies it; the first might be preferred if the file is very large.
Related
I've created a while loop to iterate through, read the data.txt file, calculate the perimeter/area/vertices, and the root will print out the answers.
I'm stuck here:
from Polygon import Polygon
from Rectangle import Rectangle
in_file = open('data.txt', 'r')
line = in_file.readlines()
while line:
type_, num_of_sides = line.split( ) ###ERROR 'list' object has no attribute 'split'
sides = [map(float, in_file.readline().split()) for _ in range(int(num_of_sides))]
print(line)
if type_ == 'P':
poly_object = Polygon(sides)
elif type_ == 'R':
poly_object = Rectangle(sides)
print('The vertices are {}'.format(poly_object.vertices))
print('The perimeter is {}'.format(poly_object.perimeter()))
print('The area is {}'.format(poly_object.area()))
print()
line = in_file.readline()
in_file.close()
Should I create a for loop that loops through since readlines is a list of strings and I want the split to read through each line? Or is it just the way that I'm choosing to format, so that's why I'm getting an error?
The immediate issue is use of readlines, not readline in line = in_file.readlines(). This will populate line with a list of all lines from the file, rather than a single line. This will lead to two issues with your program where the wrong type of data will propagate through the loop:
Calling while line, in this instance, after reading a non-empty file will cause the loop to be executed, because a non-empty list evaluates to true.
You call split on line. On the first iteration of the loop, line contains a list of strings (where each string was a line in the file). Lists to do not have a split method – only the individual strings do. This call fails.
If the call to split had not failed and the loop had been allowed to run through once, the subsequent call to line = in_file.readline() would not return the next line in the file, as all lines have been read by the earlier call to readlines and the cursor not reset in the intervening period from EOF. The loop will terminate.
If you did not have the final call in step 3, the loop would instead run forever without termination, as the value of line would never be updated to be a non-empty list.
A minimal change is to adjust the initial call which assigns a value to the line variable to readline() rather than readlines() to ensure only a single line is read from the file. The logic in the code should then work.
You may find the following implementation logic easier to implement and reason about:
collect the machinery for reading from in_file together into a single block before the loop, using a context manager to automatically close the file on leaving the block
producing a list of lines, which a simple for loop is sufficient to iterate over
# This context manager will automatically ensure `in_file` is closed
# on leaving the `with` block.
with open('data.txt', 'r') as in_file:
lines = in_file.readlines()
for line in lines:
# Do something with each line; in particular, you can call
# split() on line without issue here.
#
# You do not require any further calls to in_file.readline()
# in this block.
To read a file in Python, the file must be first opened, and then a read() function is needed. Why is that when we use a for loop to read lines of a file, no read() function is necessary?
filename = 'pi_digits.txt'
with open(filename,) as file_object:
for line in file_object:
print(line)
I'm used to the code below, showing the read requirement.
for line in file_object.read():
This is because the file_object class has an "iter" method built in that states how the file will interact with an iterative statement, like a for loop.
In other words, when you say for line in file_object the file object is referencing its __iter__ method, and returning a list where each index contains a line of the file.
Python file objects define special behavior when you iterate over them, in this case with the for loop. Every time you hit the top of the loop it implicitly calls readline(). That's all there is to it.
Note that the code you are "used to" will actually iterate character by character, not line by line! That's because you will be iterating over a string (the result of the read()), and when Python iterates over strings, it goes character by character.
The open command in your with statement handles the reading implicitly. It creates a generator that yields the file a record at a time (the read is hidden within the generator). For a text file, each line is one record.
Note that the read command in your second example reads the entire file into a string; this consumes more memory than the line-at-a-time example.
This question already has answers here:
file.tell() inconsistency
(3 answers)
Closed 7 years ago.
I process one file: skip the header (comment), process the first line, process other lines.
f = open(filename, 'r')
# skip the header
next(f)
# handle the first line
line = next(f)
process_first_line(line)
# handle other lines
for line in f:
process_line(line)
If line = next(f) is replaced with line = f.readline(), it will encounter the error.
ValueError: Mixing iteration and read methods would lose data
Therefore, I would like to know the differences among next(f), f.readline() and f.next() in Python?
Quoting official Python documentation,
A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing). In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right.
Basically, when the next function is called on a Python's file object, it fetches a certain number of bytes from the file and processes them and returns only the current line (end of current line is determined by the newline character). So, the file pointer is moved. It will not be at the same position where the current returned line ends. So, calling readline on it will give inconsistent result. That is why mixing both of them are not allowed.
I have a file "test.txt":
this is 1st line
this is 2nd line
this is 3rd line
the following code
lines = open("test.txt", 'r')
for line in lines:
print "loop 1:"+line
for line in lines:
print "loop 2:"+line
only prints:
loop 1:this is 1st line
loop 1:this is 2nd line
loop 1:this is 3rd line
It doesn't print loop2 at all.
Two questions:
the file object returned by open(), is it an iterable? that's why it can be used in a for loop?
why loop2 doesn't get printed at all?
It is not only an iterable, it is an iterator, which is why it can only traverse the file once. You may reset the file cursor with .seek(0) as many have suggested but you should, in most cases, only iterate a file once.
Yes, file objects are iterators.
Like all iterators, you can only loop over them once, after which the iterator is exhausted. Your file read pointer is at the end of the file. Re-open the file, or use .seek(0) to rewind the file pointer if you need to loop again.
Alternatively, try to avoid looping over a file twice; extract what you need into another datastructure (list, dictionary, set, heap, etc.) during the first loop.
Yes, file objects are iterables but to go back to the beginning of the file you need to use lines.seek(0), since after the first loop you are at the end of the file.
Your are already at the end of the file. File objects are iterators. Once you iterate over them your are at the final position. Iterating again won't start from the beginning. If you would like to start at line one again, you need to use lines.seek(0).
It would be better, though, to rewrite code so that the file doesn't need to be iterated twice. Read all the lines into a list of some kind, or perform all the processing in a single loop.
As of now I use the following python code:
file = open(filePath, "r")
lines=file.readlines()
file.close()
Say my file has several lines (10,000 or more), then my program becomes slow if I do this for more than one file. Is there a way to speed this up in Python? Reading various links I understand that readlines stores the lines of file in memory thats why the code gets slow.
I have tried the following code as well and the time gain I got is 17%.
lines=[line for line in open(filePath,"r")]
Is there any other module in python2.4 (which I might have missed).
Thanks,
Sandhya
for line in file:
This gives you an iterator that reads the file object one line at a time and then discards the previous line from memory.
A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit. In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer. New in version 2.3.
Short answer: don't assign the lines to a variable, just perform whatever operations you need inside the loop.