Ideal way to read, process then write a file in python - python

There are a lot of files, for each of them I need to read the text content, do some processing of the text, then write the text back (replacing the old content).
I know I can first open the files as rt to read and process the content, and then close and reopen them as wt, but obviously this is not a good way. Can I just open a file once to read and write? How?

See: http://docs.python.org/2/library/functions.html#open
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.
Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+' truncates the file). Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
So, you can open a file in mode r+, read from it, truncate, then write to the same file object. But you shouldn't do that.
You should open the file in read mode, write to a temporary file, then os.rename the temporary file to overwrite the original file. This way, your actions are atomic; if something goes wrong during the write step (for example, it gets interrupted), you don't end up having lost the original file, and having only partially written out your replacement text.

Check out the fileinput module. It lets you do what others are advising: back up the input file, manipulate its contents, and then write the altered data to the same place.
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently). This makes it possible to write a filter that rewrites its input file in place.
Here's an example. Say I have a text file like:
1
2
3
4
I can do (Python 3):
import fileinput
file_path = r"C:\temp\fileinput_test.txt"
with fileinput.FileInput(files=[file_path], inplace=True) as input_data:
for line in input_data:
# Double the number on each line
s = str(int(line.strip()) * 2)
print(s)
And my file becomes:
2
4
6
8

You can use the 'r+' file mode to open a file for reading and writing at the same time.
example:
with open("file.txt", 'r+') as filehandle:
# can read and write to file here

well, you can choose the "r+w" mode, with which you need only open the file once

Related

replacing characters in existing large file

i have a large file in my local disk which contains some fixed length string in first line. I need to programmatically replace that fixed length string using python without reading whole file in memory .
i have tried opening the file in append mode and seeking to 0 position. And then replace the string which is of 9 bytes. The code is also added here , what i tried .
with open ("largefile.txt", 'a') as f:
f.seek(0,0)
f.write("123456789")
I think you just want to open the file for writing without truncating it, which would be r+. to make this reproducible, we first create a file that matches this format:
with open('many_lines.txt', 'w') as fd:
print('abcdefghi', file=fd)
for i in range(10000):
print(f'line {i:09}', file=fd)
then we basically do what you were doing, but with the correct mode:
with open('many_lines.txt', 'r+') as fd:
print('123456789', file=fd)
or you can use write directly, with:
with open('many_lines.txt', 'r+') as fd:
fd.write('123456789')
Note: I'm opening in r+ so that you'll get an FileNotFoundError if it doesn't exist (or the filename is misspelled) rather than just blindly creating a tiny file
The open modes are directly copied from the C/POSIX API for the fopen so your use of a will trigger behaviour that says:
Subsequent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar

Is it possible to only use "wb" when opening and overwriting a file?

If I want to open a file, unpickle an object inside it, then overwrite it later, is it okay to just use
data = {} #Its a dictionary in my code
file = open("filename","wb")
data = pickle.load(file)
data["foo"] = "bar"
pickle.dump(data,file)
file.close()
Or would I have to use "rb" first and then use "wb" later (using with statements for each) which is what I am doing now. Note that in my program, there is a hashing algorithm in between opening the file and closing it, which is where the dictionary data comes from, and I basically want to be able to only open the file once without having to do two with statements
If you want to read, then write the file, do not use modes involving w at all; all of them truncate the file on opening it.
If the file is known to exist, use mode "rb+", which opens an existing file for both read and write.
Your code only needs to change a tiny bit:
# Open using with statement to ensure prompt/proper closing
with open("filename","rb+") as file:
data = pickle.load(file) # Load from file (moves file pointer to end of file)
data["foo"] = "bar"
file.seek(0) # Move file pointer back to beginning of file
pickle.dump(data, file) # Write new data over beginning of file
file.truncate() # If new dump is smaller, make sure to chop off excess data
You can use wb+ which opens the file for both reading and writing
This question is helpful for understanding the differences between each of pythons read and write conditions, but adding + at the end usually always opens the file for both read and write
Confused by python file mode "w+"

Python: Putting Charater in front or behind a file

I want to write a couple characters into a file where there is already text inside. What would be the code to add characters to the front of the file and to the back of the text file if I want the text that was initially in the file to remain in the center?
To add some text to the end of your file, simply open it in append mode and then write to it as usual.
open('file.txt', 'a')
If you want to add something to the beginning of the file, and you don't mind loading the contents of the file temporarily into memory.
addedText = 'Hello World!'
with open('file.txt', 'r+') as myFile:
filecontents = myFile.read()
myFile.seek(0,0)
f.write(addedText.rstrip('\r\n') + '\n' + filecontents)
When you want to open a file and keep its content you have to open the file in append mode. Also have a look at:
file.seek (can be used to set the files current position)
There is no function in any knows underlying file systems that allows to insert bytes into a file. You can only :
add bytes (characters) at the end of the file (append mode)
rewrite bytes in place anywhere in the file
truncate a file at current position.
So if you want to add anything not at the end of the file, the common way (that is used by many text editors) is :
rename the old file to a temp name (it is known as a backup copy)
create a new file with the original name and write what you want to it (here the prefix, the original content and the postfix)
(optionaly) delete the backup copy.
That way allows you to recover your file even if bad things occur while writing the new copy : you can at least get the previous copy and restart your edition.

What is the difference between opening and reading a file in python?

In python's OS module there is a method to open a file and a method to read a file.
The docs for the open method say:
Open the file file and set various flags according to flags and
possibly its mode according to mode. The default mode is 0777 (octal),
and the current umask value is first masked out. Return the file
descriptor for the newly opened file.
The docs for the read method say;
Read at most n bytes from file descriptor fd. Return a string
containing the bytes read. If the end of the file referred to by fd
has been reached, an empty string is returned.
I understand what it means to read n bytes from a file. But how does this differ from open?
"Opening" a file doesn't actually bring any of the data from the file into your program. It just prepares the file for reading (or writing), so when your program is ready to read the contents of the file it can do so right away.
Opening a file allows you to read or write to it (depending on the flag you pass as the second argument), whereas reading it actually pulls the data from a file that is typcially saved into a variable for processing or printed as output.
You do not always read from a file once it is opened. Opening also allows you to write to a file, either by overwriting all the contents or appending to the contents.
To read from a file:
>>> myfile = open('foo.txt', 'r')
>>> myfile.read()
First you open the file with read permission (r)
Then you read() from the file
To write to a file:
>>> myfile = open('foo.txt', 'r')
>>> myfile.write('I am writing to foo.txt')
The only thing that is being done in line 1 of each of these examples is opening the file. It is not until we actually read() from the file that anything is changed
open gets you a fd (file descriptor), you can read from that fd later.
One may also open a file for other purpose, say write to a file.
It seems to me you can read lines from the file handle without invoking the read method but I guess read() truly puts the data in the variable location. In my course we seem to be printing lines, counting lines, and adding numbers from lines without using read().
The rstrip() method needs to be used, however, because printing the line from the file handle using a for in statement also prints the invisible line break symbol at the end of the line, as does the print statement.
From Python for Everybody by Charles Severance, this is the starter code.
"""
7.2
Write a program that prompts for a file name,
then opens that file and reads through the file,
looking for lines of the form:
X-DSPAM-Confidence: 0.8475
Count these lines and extract the floating point
values from each of the lines and compute the
average of those values and produce an output as
shown below. Do not use the sum() function or a
variable named sum in your solution.
You can download the sample data at
http://www.py4e.com/code3/mbox-short.txt when you
are testing below enter mbox-short.txt as the file name.
"""
# Use the file name mbox-short.txt as the file name
fname = input("Enter file name: ")
fh = open(fname)
for line in fh:
if not line.startswith("X-DSPAM-Confidence:") :
continue
print(line)
print("Done")

What is the difference between rb and r+b modes in file objects [duplicate]

This question already has answers here:
Difference between modes a, a+, w, w+, and r+ in built-in open function?
(9 answers)
Closed last month.
I am using pickle module in Python and trying different file IO modes:
# works on windows.. "rb"
with open(pickle_f, 'rb') as fhand:
obj = pickle.load(fhand)
# works on linux.. "r"
with open(pickle_f, 'r') as fhand:
obj = pickle.load(fhand)
# works on both "r+b"
with open(pickle_f, 'r+b') as fhand:
obj = pickle.load(fhand)
I never read about "r+b" mode anywhere, but found mentioning about it in the documentation.
I am getting EOFError on Linux if I use "rb" mode and on Windows if "r" is used. I just gave "r+b" mode a shot and it's working on both.
What's "r+b" mode? What's the difference between "rb" and "r+b"? Why does it work when the others don't?
r+ is used for reading, and writing mode. b is for binary.
r+b mode is open the binary file in read or write mode.
You can read more here.
r opens for reading, whereas r+ opens for reading and writing. The b is for binary.
This is spelled out in the documentation:
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.
Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+' truncates the file). Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
My understanding is that adding r+ opens for both read and write (just like w+, though as pointed out in the comment, will truncate the file). The b just opens it in binary mode, which is supposed to be less aware of things like line separators (at least in C++).
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
Source: Reading and Writing Files

Categories

Resources