How does readline() work behind the scenes when reading a text file?

How does readline() work behind the scenes when reading a text file? - python

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!

Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines

This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

Related

How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one

This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.

Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.

If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB

I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...

Large File handling in Python

I am doing some file operations in python. I am using python version Python 3.5.2.
I have a large file of 4GB. And I'am reading the file in chunk say, 2KB.
I have a doubt.
If the any 2KB chunk happens to be in the middle of line (in between 2 newlines) will that line be truncated or the half-read lines' contents be returned ?
-Regards,

Yes, this is a problem. You can see that with a much smaller test:
s = io.BytesIO(b'line\nanother line\nanother\n')
while True:
buf = s.read(10)
if not buf: break
print('*** new buffer')
for line in buf.splitlines():
print(line.decode())
The output is:
*** new buffer
line
anoth
*** new buffer
er line
an
*** new buffer
other
As you can see, the first buffer has a truncated partial line that finishes in the next buffer, exactly what you were worried about. In fact, this will happen not just occasionally, but _most of the time).
The solution is to keep around the overflow (after the last line) from the old buffer, and use it as part of the new buffer. You should try to code this up for yourself, to make sure you understand it (remember to print out the leftover overflow at the end of the file).
But the good news is that you rarely need to do this, because Python file objects do it for you:
s = io.BytesIO(b'line\nanother line\nanother\n')
for line in s:
print(line.decode(), end='')
That's it. You can test this with a real file from open(path, 'rb') in place of BytesIO, it works just as well. Python will read in about a page at a time, and generate lines one by one, automatically handling all the tricky stuff for you. If "about a page" isn't good enough, you can use something more explicit, e.g., passing buffering=2048 to the open function.
In fact, you can do even better. Open the file in text mode, and Python will still read in about a page at a time, split it into lines, and decode them for you on the fly—and probably a lot more efficiently than anything you would have come up with:
for line in open(path):
print(line, end='')

Python - Opening a text file for edition before modifiying it in place with fileinput.input()

I have a python script used to edit a text file. Firstly, the first line of the text file is removed. After that, a line is added to the end of the text file.
I noticed a weird phenomenon, but I cannot explain the reason of this behaviour:
This script works as expected (removes the first line and adds a line at the end of the file):
import fileinput
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f = open('test.txt', 'a+') # <= line that is moved
f.write('test5')
f.close()
But in the following script, as the text file is opened before removing, the removal occurs but the content isn't added (with the write() method):
import fileinput
# file opened before removing
f = open('test.txt', 'a+') # <= line that is moved
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f.write('test5')
f.close()
Note that in the second example, open() is placed a the beginning, whereas in the first it is called after removing the last line of the text file.
What's the explanation of the behaviour?

When using fileinput with the inplace parameter, the modified content is saved in a backup file. The backup file is renamed to the original file when the output file is closed. In your example, you do not close the fileinput file explicitly, relying on the self-triggered closing, which is not documented and might be unreliable.
The behaviour you describe in the first example is best explained if we assume that opening the same file again triggers fileinput.close(). In your second example, the renaming only happens after f.close() is executed, thus overwriting the other changes (adding "test5").
So apparently you should explicitly call fileinput.close() in order to have full control over when your changes are written to disk. (It is generally recommended to release external resources explicitly as soon as they are not needed anymore.)
EDIT:
After more profound testing, this is what I think is happening in your second example:
You open a stream with mode a+ to the text file and bind it to the variable f.
You use fileinput to alter the same file. Under the hood, a new file is created, which is afterwards renamed to what the file was called originally. Note: this doesn't actually change the original file – rather, the original file is made inaccessible, as its former name now points to a new file.
However, the stream f still points to the original file (which has no file name anymore). You can still write to this file and close it properly, but you cannot see it anymore (since it has no filename anymore).
Please note that I'm not an expert in this kind of low-level operations; the details might be wrong and the terminology certainly is. Also, the behaviour might be different across OS and Python implementations. However, it might still help you understand why things go different from what you expected.
In conclusion I'd say that you shouldn't be doing what you do in your second example. Why don't you just read the file into memory, alter it, and then write it back to disk? Actual in-place (on-disk) altering of files is no fun in Python, as it's too high-level a language.

Is it possible to modify lines in a file in-place?

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.

fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.

No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.

If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.

You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.

taking a character input in python from a file?

in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .

I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.

You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.

It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.

To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.