Confusion about interaction of readline and location of file pointer - python

My question is on elaboration of this stackoverflow question:
Why does readline() put the file pointer at the end of the file in Python?
In the given answer, they said:
Calling file.readline() doesn't change this behaviour, especially since readline() calls can use a buffer to read in larger blocks.
However, the OP had a seek function before the readline to tell the file pointer to go back to the start. Does the fact that the file pointer went back to the end have something to do with the readline function tampering with the inner workings?
I tested this with files that have many lines, and using the method the OP provided, the pointer always went to the end of file after the first write command. But using the solution post, the pointer just skips to the next line after a write (which is intended behaviour). What does the readline() function do to mess up the behaviour?

Related

Confusion about with statement python

I have a question regarding the use of with statements in python, as given below:
with open(fname) as f:
np.save(f,MyData)
If I'm not mistaken this opens the file fname in a secure manner, such that if an exception occurs the file is closed properly. then it writes MyData to the file. But what I would do is simply:
np.save(fname,MyData)
This would result in the same, MyData gets written to fname. I'm not sure I understand correctly why the former is better. I don't understand how this one-liner could keep the file "open" after it has ran the line. Therefore I also don't see how this could create issues when my code would crash afterwards.
Maybe this is a stupid/basic question, but I always thought that a cleaner code is a nicer code, so not having the extra with-loop seems just better to me.
numpy.save() handles the opening and closing in its code, however if you supply a file descriptor, it'll leave it open because it assumes you want to do something else with the file and if it closes the file it'll break the functionality for you.
Try this:
f = open(<file>)
f.close()
f.read() # boom
See also the hasattr(file, "write") ("file" as in descriptor or "handle" from file, buffer, or other IO) check, that checks if it's an object with a write() method and judging by that Numpy only assumes it's true.
However NumPy doesn't guarantee misusing its API e.g. if you create a custom structure that's a buffer and doesn't include write(), it'll be treated as a path and thus crash in the open() call.

confused about the readline() return in Python

I am a beginner in python. However, I have some problems when I try to use the readline() method.
f=raw_input("filename> ")
a=open(f)
print a.read()
print a.readline()
print a.readline()
print a.readline()
and my txt file is
aaaaaaaaa
bbbbbbbbb
ccccccccc
However, when I tried to run it on a Mac terminal, I got this:
aaaaaaaaa
bbbbbbbbb
ccccccccc
It seems that readline() is not working at all.
But when I disable print a.read(), the readline() gets back to work.
This confuses me a lot. Is there any solution where I can use read() and readline() at the same time?
When you open a file you get a pointer to some place of the file (by default: the begining). Now whenever you run .read() or .readline() this pointer moves:
.read() reads until the end of the file and moves the pointer to the end (thus further calls to any reading gives nothing)
.readline() reads until newline is seen and sets the pointer after it
.read(X) reads X bytes and sets the pointer at CURRENT_LOCATION + X (or the end)
If you wish you can manually move that pointer by issuing a.seek(X) call where X is a place in file (seen as an array of bytes). For example this should give you the desired output:
print a.read()
a.seek(0)
print a.readline()
print a.readline()
print a.readline()
You need to understand the concept of file pointers. When you read the file, it is fully consumed, and the pointer is at the end of the file.
It seems that the readline() is not working at all.
It is working as expected. There are no lines to read.
when I disable print a.read(), the readline() gets back to work.
Because the pointer is at the beginning of the file, and the lines can be read
Is there any solution that I can use read() and readline() at the same time?
Sure. Flip the ordering of reading a few lines, then the remainder of the file, or seek the file pointer back to a position that you would like.
Also, don't forget to close the file when you are finished reading it
The file object a remembers it's position in the file.
a.read() reads from the current position to end of the file (moving the position to the end of the file)
a.readline() reads from the current position to the end of the line (moving the position to the next line)
a.seek(n) moves to position n in the file (without returning anything)
a.tell() returns the position in the file.
So try putting the calls to readline first. You'll notice that now the read call won't return the whole file, just the remaining lines (maybe none), depending on how many times you called readline. And play around with seek and tell to confirm whats going on.
Details here.

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!
Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines
This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

How to read from file opened in "a+" mode?

By definition, "a+" mode opens the file for both appending and reading. Appending works, but what is the method for reading? I did some searches, but couldn't find it clarified anywhere.
f=open("myfile.txt","a+")
print (f.read())
Tried this, it prints blank.
Use f.seek() to set the file offset to the beginning of the file.
Note: Before Python 2.7, there was a bug that would cause some operating systems to not have the file position always point to the end of the file. This could cause some users to have your original code work. For example, on CentOS 6 your code would have worked as you wanted, but not as it should.
f = open("myfile.txt","a+")
f.seek(0)
print f.read()
when you open the file using f=open(myfile.txt,"a+"), the file can be both read and written to.
By default the file handle points to the start of the file,
this can be determined by f.tell() which will be 0L.
In [76]: f=open("myfile.txt","a+")
In [77]: f.tell()
Out[77]: 0L
In [78]: f.read()
Out[78]: '1,2\n3,4\n'
However, f.write will take care of moving the pointer to the last line before writing.
There are still quirks in newer version of Python dependant on OS and they are due to differences in implementation of the fopen() function in stdio.
Linux's man fopen:
a+ - Open for reading and appending (writing at end of file). The file is created if it does not exist. The initial file position for reading is at the beginning of the file, but output is always appended to the end of the file.
OS X:
``a+'' - Open for reading and writing. The file is created if it does not exist. The stream is positioned at the end of the file. Subsequent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar.
MSDN doesn't really state where the pointer is initially set, just that it moves to the end on writes.
When a file is opened with the "a" or "a+" access type, all write operations occur at the end of the file. The file pointer can be repositioned using fseek or rewind, but is always moved back to the end of the file before any write operation is carried out. Thus, existing data cannot be overwritten.
Replicating the differences on various systems with both Python 2.7.x and 3k are pretty straightforward with .open .tell
When dealing with anything through the OS, it's safer to take precautions like using an explicit .seek(0).
MODES
r+ read and write Starts at the beginning of the file
r read only Starts at the beginning of the file
a+ Read/Append. Preserves file content by writing to the end of the file
Good Luck!
Isabel Ruiz

How to I read several lines in a file faster using python?

As of now I use the following python code:
file = open(filePath, "r")
lines=file.readlines()
file.close()
Say my file has several lines (10,000 or more), then my program becomes slow if I do this for more than one file. Is there a way to speed this up in Python? Reading various links I understand that readlines stores the lines of file in memory thats why the code gets slow.
I have tried the following code as well and the time gain I got is 17%.
lines=[line for line in open(filePath,"r")]
Is there any other module in python2.4 (which I might have missed).
Thanks,
Sandhya
for line in file:
This gives you an iterator that reads the file object one line at a time and then discards the previous line from memory.
A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit. In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer. New in version 2.3.
Short answer: don't assign the lines to a variable, just perform whatever operations you need inside the loop.

Categories

Resources