how to understand <class '_io.TextIOWrapper'>? - python

The following is the code:
fin=open("myFile.txt","r") # opening a file creates a file handle
for line in fin:
print(line)
fin.close() # close file handle
My question is: how to understand the TextIOWrapper object fin? I mean, is it safe to say it is something with a sequence structure, where each item of the sequence is a line of your file with "\n" added? What else could I say about it? How do you understand it?
Any comments are greatly appreciated.

That expression for line in file splits (streams more accurately) the file by the new line delimiter until it reaches EOF. Think of it a stream that reads until it's a new line, and then returns the characters it just read.

File-like objects are iterators that produce a line of text on each iteration. Iterators in general just means "things you can loop over exactly once"; files differ from this pattern on insofar as they can (depending on what the represent) be seeked, which would reset the iterator to a new position in the file.
To be clear, they are not sequences; the term "sequence" has specific meaning, and includes the ability to index it, iterate it multiple times in a row or in parallel, all without manually fixing up state.

Related

Reading files in Python with for loop

To read a file in Python, the file must be first opened, and then a read() function is needed. Why is that when we use a for loop to read lines of a file, no read() function is necessary?
filename = 'pi_digits.txt'
with open(filename,) as file_object:
for line in file_object:
print(line)
I'm used to the code below, showing the read requirement.
for line in file_object.read():
This is because the file_object class has an "iter" method built in that states how the file will interact with an iterative statement, like a for loop.
In other words, when you say for line in file_object the file object is referencing its __iter__ method, and returning a list where each index contains a line of the file.
Python file objects define special behavior when you iterate over them, in this case with the for loop. Every time you hit the top of the loop it implicitly calls readline(). That's all there is to it.
Note that the code you are "used to" will actually iterate character by character, not line by line! That's because you will be iterating over a string (the result of the read()), and when Python iterates over strings, it goes character by character.
The open command in your with statement handles the reading implicitly. It creates a generator that yields the file a record at a time (the read is hidden within the generator). For a text file, each line is one record.
Note that the read command in your second example reads the entire file into a string; this consumes more memory than the line-at-a-time example.

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!
Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines
This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

Is an object file a list by default?

I've encountered two versions of code that both can accomplish the same task with a little difference in the code itself:
with open("file") as f:
for line in f:
print line
and
with open("file") as f:
data = f.readlines()
for line in data:
print line
My question is, is the file object f a list by default just like data? If not, why does the first chunk of code work? Which version is the better practice?
File object is not a list - it's an object that conforms to iterator interface (docs). I.e. it implements __iter__ method that returns an iterator object. That iterator object implements both __iter__ and next methods allowing iteration over the collection.
It happens that the File object is it's own iterator (docs) meaning file.__iter__() returns self.
Both for line in file and lines = file.readlines() are equivalent in that they yield the same result if used to get/iterator over all lines in the file. But, file.next() buffers the contents from the file (it reads ahead) to speed up the process of reading, effectively moving the file descriptor to position exact to or farther than where the last line ended. This means that if you have used for line in file, read some lines and the stopped the iteration (you haven't reach end of the file) and now called file.readlines(), the first line returned might not be the full line following the last line iterated over the for loop.
When you use for x in my_it, the interpreter calls my_it.__iter__(). Now, the next() method is being called on the object returned by the previous call, and for each call it's return value is being assigned to x. When next() raises StopIteration, the loop ends.
Note: A valid iterator implementation should ensure that once StopIteration is raised, it should remain to be risen for all subsequent calls to next().
In both cases, you are getting a file line-by-line. The method is different.
With your first version:
with open("file") as f:
for line in f:
print line
While you are interating over the file line by line, the file contents are not resident fully in memory (unless it is a 1 line file).
The open built-in function returns a file object -- not a list. That object supports iteration; in this case returning individual strings that are each group of characters in the file terminated by either a carriage return or the end of file.
You can write a loop that is similar to what for line in f: print line is doing under the hood:
with open('file') as f:
while True:
try:
line=f.next()
except StopIteration:
break
else:
print line
With the second version:
with open("file") as f:
data = f.readlines() # equivelent to data=list(f)
for line in data:
print line
You are using a method of a file object (file.readlines()) that reads the entire file contents into memory as a list of the individual lines. The code is then iterating over that list.
You can write a similar version of that as well that highlights the iterators under the hood:
with open('file') as f:
data=list(f)
it=iter(data)
while True:
try:
line=it.next()
except StopIteration:
break
else:
print line
In both of your examples, you are using a for loop to loop over items in a sequence. The items are the same in each case (individual lines of the file) but the underlying sequence is different. In the first version, the sequence is a file object; in the second version it is a list. Use the first version if you just want to deal with each line. Use the second if you want a list of lines.
Read Ned Batchelder's excellent overview on looping and iteration for more.
f is a filehandle, not a list. It is iterable.
A file is an iterable. Lots of objects, including lists are iterable, which just means that they can be used in a for loop to sequentially yield an object to bind the for iterator variable to.
Both versions of your code accomplish iteration line by line. The second versions reads the whole file into memory and constructs a list; the first may not read the whole file first. The reason why you might prefer the second is that you want to close the file before something else modifies it; the first might be preferred if the file is very large.

Is file object in python an iterable

I have a file "test.txt":
this is 1st line
this is 2nd line
this is 3rd line
the following code
lines = open("test.txt", 'r')
for line in lines:
print "loop 1:"+line
for line in lines:
print "loop 2:"+line
only prints:
loop 1:this is 1st line
loop 1:this is 2nd line
loop 1:this is 3rd line
It doesn't print loop2 at all.
Two questions:
the file object returned by open(), is it an iterable? that's why it can be used in a for loop?
why loop2 doesn't get printed at all?
It is not only an iterable, it is an iterator, which is why it can only traverse the file once. You may reset the file cursor with .seek(0) as many have suggested but you should, in most cases, only iterate a file once.
Yes, file objects are iterators.
Like all iterators, you can only loop over them once, after which the iterator is exhausted. Your file read pointer is at the end of the file. Re-open the file, or use .seek(0) to rewind the file pointer if you need to loop again.
Alternatively, try to avoid looping over a file twice; extract what you need into another datastructure (list, dictionary, set, heap, etc.) during the first loop.
Yes, file objects are iterables but to go back to the beginning of the file you need to use lines.seek(0), since after the first loop you are at the end of the file.
Your are already at the end of the file. File objects are iterators. Once you iterate over them your are at the final position. Iterating again won't start from the beginning. If you would like to start at line one again, you need to use lines.seek(0).
It would be better, though, to rewrite code so that the file doesn't need to be iterated twice. Read all the lines into a list of some kind, or perform all the processing in a single loop.

taking a character input in python from a file?

in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.

Categories

Resources