Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.
fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.
No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.
If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.
You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.
Related
To read a file in Python, the file must be first opened, and then a read() function is needed. Why is that when we use a for loop to read lines of a file, no read() function is necessary?
filename = 'pi_digits.txt'
with open(filename,) as file_object:
for line in file_object:
print(line)
I'm used to the code below, showing the read requirement.
for line in file_object.read():
This is because the file_object class has an "iter" method built in that states how the file will interact with an iterative statement, like a for loop.
In other words, when you say for line in file_object the file object is referencing its __iter__ method, and returning a list where each index contains a line of the file.
Python file objects define special behavior when you iterate over them, in this case with the for loop. Every time you hit the top of the loop it implicitly calls readline(). That's all there is to it.
Note that the code you are "used to" will actually iterate character by character, not line by line! That's because you will be iterating over a string (the result of the read()), and when Python iterates over strings, it goes character by character.
The open command in your with statement handles the reading implicitly. It creates a generator that yields the file a record at a time (the read is hidden within the generator). For a text file, each line is one record.
Note that the read command in your second example reads the entire file into a string; this consumes more memory than the line-at-a-time example.
I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!
Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines
This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5
I have a python script used to edit a text file. Firstly, the first line of the text file is removed. After that, a line is added to the end of the text file.
I noticed a weird phenomenon, but I cannot explain the reason of this behaviour:
This script works as expected (removes the first line and adds a line at the end of the file):
import fileinput
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f = open('test.txt', 'a+') # <= line that is moved
f.write('test5')
f.close()
But in the following script, as the text file is opened before removing, the removal occurs but the content isn't added (with the write() method):
import fileinput
# file opened before removing
f = open('test.txt', 'a+') # <= line that is moved
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f.write('test5')
f.close()
Note that in the second example, open() is placed a the beginning, whereas in the first it is called after removing the last line of the text file.
What's the explanation of the behaviour?
When using fileinput with the inplace parameter, the modified content is saved in a backup file. The backup file is renamed to the original file when the output file is closed. In your example, you do not close the fileinput file explicitly, relying on the self-triggered closing, which is not documented and might be unreliable.
The behaviour you describe in the first example is best explained if we assume that opening the same file again triggers fileinput.close(). In your second example, the renaming only happens after f.close() is executed, thus overwriting the other changes (adding "test5").
So apparently you should explicitly call fileinput.close() in order to have full control over when your changes are written to disk. (It is generally recommended to release external resources explicitly as soon as they are not needed anymore.)
EDIT:
After more profound testing, this is what I think is happening in your second example:
You open a stream with mode a+ to the text file and bind it to the variable f.
You use fileinput to alter the same file. Under the hood, a new file is created, which is afterwards renamed to what the file was called originally. Note: this doesn't actually change the original file – rather, the original file is made inaccessible, as its former name now points to a new file.
However, the stream f still points to the original file (which has no file name anymore). You can still write to this file and close it properly, but you cannot see it anymore (since it has no filename anymore).
Please note that I'm not an expert in this kind of low-level operations; the details might be wrong and the terminology certainly is. Also, the behaviour might be different across OS and Python implementations. However, it might still help you understand why things go different from what you expected.
In conclusion I'd say that you shouldn't be doing what you do in your second example. Why don't you just read the file into memory, alter it, and then write it back to disk? Actual in-place (on-disk) altering of files is no fun in Python, as it's too high-level a language.
By definition, "a+" mode opens the file for both appending and reading. Appending works, but what is the method for reading? I did some searches, but couldn't find it clarified anywhere.
f=open("myfile.txt","a+")
print (f.read())
Tried this, it prints blank.
Use f.seek() to set the file offset to the beginning of the file.
Note: Before Python 2.7, there was a bug that would cause some operating systems to not have the file position always point to the end of the file. This could cause some users to have your original code work. For example, on CentOS 6 your code would have worked as you wanted, but not as it should.
f = open("myfile.txt","a+")
f.seek(0)
print f.read()
when you open the file using f=open(myfile.txt,"a+"), the file can be both read and written to.
By default the file handle points to the start of the file,
this can be determined by f.tell() which will be 0L.
In [76]: f=open("myfile.txt","a+")
In [77]: f.tell()
Out[77]: 0L
In [78]: f.read()
Out[78]: '1,2\n3,4\n'
However, f.write will take care of moving the pointer to the last line before writing.
There are still quirks in newer version of Python dependant on OS and they are due to differences in implementation of the fopen() function in stdio.
Linux's man fopen:
a+ - Open for reading and appending (writing at end of file). The file is created if it does not exist. The initial file position for reading is at the beginning of the file, but output is always appended to the end of the file.
OS X:
``a+'' - Open for reading and writing. The file is created if it does not exist. The stream is positioned at the end of the file. Subsequent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar.
MSDN doesn't really state where the pointer is initially set, just that it moves to the end on writes.
When a file is opened with the "a" or "a+" access type, all write operations occur at the end of the file. The file pointer can be repositioned using fseek or rewind, but is always moved back to the end of the file before any write operation is carried out. Thus, existing data cannot be overwritten.
Replicating the differences on various systems with both Python 2.7.x and 3k are pretty straightforward with .open .tell
When dealing with anything through the OS, it's safer to take precautions like using an explicit .seek(0).
MODES
r+ read and write Starts at the beginning of the file
r read only Starts at the beginning of the file
a+ Read/Append. Preserves file content by writing to the end of the file
Good Luck!
Isabel Ruiz
in a py module, I write:
outFile = open(fileName, mode='w')
if A:
outFile.write(...)
if B:
outFile.write(...)
and in these lines, I didn't use flush or close method.
Then after these lines, I want to check whether this "outFile" object is empty or not. How can I do with it?
There are a few problems with your code.
You can't .write to a file that you opened with 'r'. You need to open(fileName, 'w').
If A or B then you've certainly written to the file, so it's not empty!
Barring those. you can get the length of a file with
os.stat(outFile.fileno())
EDIT: I'll explain what flush does. Python is often used to do quite large amounts of file reads and writes, which can be slow. It is thus tweaked to make them as fast as possible. One way that is does so is to "buffer" such writes and then do them all in one big block: when you write a small string, Python will remember it but won't actually write it to the file until it thinks it should.
This means that if you want to tell whether you have written data to the file by inspecting the file, you have to tell Python to write all the data it's remembering first, or else you might not see it. flush is the command to write all the buffered data.
Of course, if you ask Python whether it's written anything to the file, say by inspecting the position in the file (.tell()), then it will know about the buffering.
If you've already written to the file, you can use .tell() to check if the current file position is nonzero:
>>> handle = open('/tmp/file.txt', 'w')
>>> handle.write('foo')
>>> handle.tell()
3
This won't work if you .seek() back to the beginning of the file.
You can use os.stat to get file info:
import os
fileSize = os.stat(fileName).st_size
with open("filename.txt", "r+") as f:
if f.read():
# file isn't empty
f.write("something")
# uncomment this line if you want to delete everything else in the file
# f.truncate()
else:
# file is empty
f.write("somethingelse")
"r+" mode always you to read & write.
"with" will automatically close file