Python - Write in text mode to file opened in binary mode - python

I am asking this out of curiosity.
What I am doing:
creating a temp file
writing data from a Pandas dataframe to it by using to_csv()
pushing the file to a FTP server
As the tempfile is opened in binary mode by default but the to_csv() method by default writes in text mode (which I need because I want to have UTF-8 as format) I am asking myself how you can write in text mode to a file opened in binary mode? I also need the binary format for the transfer to the FTP server.
What I did in detail:
I created a temp file like this:
fp = tempfile.NamedTemporaryFile(delete=False)
As I unterstand from the documentation the file is opened in binary mode.
tempfile.NamedTemporaryFile(mode='w+b', buffering=-1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, delete=True, *, errors=None)
Then I saved my dataframe to the temp file like this:
df.to_csv(fp.name)
fp.flush()
fp.seek(0)
Also the to_csv() method states in the documentation that you need to open the file with newlines='' which only works in text mode. So I couldn't set the newline argument using a file opened in binary mode.
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string. If a file object is passed it should be opened with newline='', disabling universal newlines.
Then I used the storbinary() method from the ftplib to push the temp file to the FTP server. As I understand from the documentation the method requires a binary file.
FTP.storbinary(cmd, fp, blocksize=8192, callback=None, rest=None)
Store a file in binary transfer mode. cmd should be an appropriate STOR command: "STOR filename". fp is a file object (opened in binary mode) which is read until EOF using its read() method in blocks of size blocksize to provide the data to be stored. The blocksize argument defaults to 8192. callback is an optional single parameter callable that is called on each block of data after it is sent. rest means the same thing as in the transfercmd() method.
For completeness I afterwards closed and deleted the file like this:
fp.close()
os.unlink(fp.name)
I thought about opening the tempfile in w+t mode so that it matches the to_csv() method, which recommends opening the file with newlines='' which only works in text mode. Also I need to specify the UTF-8 format for the CSV file which only works in text mode. ftplib's storbinary() method requires a file opened in binary mode. (storlines() method also does) so this doesn't fit.
So I opened the file in binary mode, wrote to it in text mode and transferred it using binary mode. Everything works and the result looks like I want it to but I am a bit confused if I am doing it the right way. How does writing in text mode to a file opened in binary mode work? I kind of assumed I would have to open the file in text mode in order to write in text mode to it using to_csv().
If anyone has a deeper knowledge about this and could clear up my confusion I would be very grateful. I don't like doing things not knowing why they work or if they should work haha.
Thanks!

This is quite broad question. Just briefly. This is all mostly about line endings. That's basically the only distinction between the binary and text modes.
If you "open" a file in the binary mode, all data are written exactly as they are. If you open a file in the text mode, newlines (\n) are converted according to the newline parameter.
I do not think that Pandas need the file to be opened in the text mode. If you open the file in the binary mode, then whatever Pandas writes will end up physically in the file. See line_terminatorstr parameter of the DataFrame.to_csv.
It's mostly the same with FTP. If you use storbinary, the file will be uploaded as is. If you use storlines, you let the FTP server convert the line endings.

Related

Does Open Method in Python Executes an EXE

I wonder if open(file_name, "rb") as binary_file: pass does actually executes a file if it's exe? I am asking because I am reading some malicious files and viruses using Python stored as ".exe" files.
No it doesn't AND the flags 'rb' in your open statement stand for read binary. So it's only reading the file and putting it in a byte like object. So not only is it not executing (because that's not a function of open) it's only going to be opened in read mode.
You can read about the open function in the documentation.

Is there a better way to handle writing csv in text mode and reading in binary mode?

I have code that looks something like this:
import csv
import os
import tempfile
from azure.storage import CloudStorageAccount
account = CloudStorageAccount(
account_name=os.environ['AZURE_ACCOUNT'],
account_key=os.environ['AZURE_ACCOUNT_KEY'],
)
service = account.create_block_blob_service()
with tempfile.NamedTemporaryFile(mode='w') as f:
writer = csv.DictWriter(f, fieldnames=['foo', 'bar'])
writer.writerow({'foo': 'just an example', 'bar': 'of what I do'})
with open(f.name, 'rb') as stream:
service.create_blob_from_stream(
container_name='test',
blob_name='nothing_secret.txt',
stream=stream,
)
Now, this is ugly. I don't like having to open the file twice. I know that the Azure API provides a way to upload text and binary, but my file has the potential to be several hundred MB large so I'm not too interested in sticking the whole thing in memory at a time (not that it would be the end of the world, but still).
Azure doesn't support uploading a file in text mode (that I can see), and csv doesn't seem to support writing to a binary file (at least not text data).
Is there a way that I can have two handles to the same file, one in binary and one in text mode? Of course I could write my own file wrapper, but I'd prefer to use something I don't have to maintain. Is there a better way to do this than what I've got?
Files opened in text mode have a buffer attribute. This object is the same one you would get by opening the file in binary mode, the text mode is just a wrapper on top of it.
Open your file in text mode, use it for read it, then seek the buffer back to the start and use it for uploading. Make sure you use + mode for reading and writing from the same handle.
with tempfile.NamedTemporaryFile(mode='w+') as f:
...
f.seek(0)
service.create_blob_from_stream(
...
stream=f.buffer,
)
You can go the other way too, by opening in binary mode then wrapping with io.TextIOWRapper(f).

How to ensure a file is closed for writing in Python?

The issue described here looked initially like it was solvable by just having the spreadsheet closed in Excel before running the program.
It transpires, however, that having Excel closed is a necessary, but not sufficient, condition. The issue still occurs, but not on every Windows machine, and not every time (sometimes it occurs after a single execution, sometimes two).
I've modified the program such that it now reads from one spreadsheet and writes to a different one, still the issue presents itself. I even go on to programmatically kill any lingering Python processes before running the program. Still no joy.
The openpyxl save() function instantiates ZipFile thus:
archive = ZipFile(filename, 'w', ZIP_DEFLATED, allowZip64=True)
... with Zipfile then using that to attempt to open the file in mode 'wb' thus:
if isinstance(file, basestring):
self._filePassed = 0
self.filename = file
modeDict = {'r' : 'rb', 'w': 'wb', 'a' : 'r+b'}
try:
self.fp = open(file, modeDict[mode])
except IOError:
if mode == 'a':
mode = key = 'w'
self.fp = open(file, modeDict[mode])
else:
raise
According to the docs:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
... which explains why mode 'wb' must be used.
Is there something in Python file opening that could possibly leave the file in some state of "openness"?
Windows: 8
Python: 2.7.10
openpyxl: latest
Two suggestions:
First is to use with to close the file correctly.
with open("some.xls", "wb") as excel_file:
#Do something
At the end of that the file will close on its own (see this).
You can also make a copy of the file and work on the copied file.
import shutil
shutil.copyfile(src, dst)
https://docs.python.org/2/library/shutil.html#shutil.copyfile

Ideal way to read, process then write a file in python

There are a lot of files, for each of them I need to read the text content, do some processing of the text, then write the text back (replacing the old content).
I know I can first open the files as rt to read and process the content, and then close and reopen them as wt, but obviously this is not a good way. Can I just open a file once to read and write? How?
See: http://docs.python.org/2/library/functions.html#open
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.
Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+' truncates the file). Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
So, you can open a file in mode r+, read from it, truncate, then write to the same file object. But you shouldn't do that.
You should open the file in read mode, write to a temporary file, then os.rename the temporary file to overwrite the original file. This way, your actions are atomic; if something goes wrong during the write step (for example, it gets interrupted), you don't end up having lost the original file, and having only partially written out your replacement text.
Check out the fileinput module. It lets you do what others are advising: back up the input file, manipulate its contents, and then write the altered data to the same place.
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently). This makes it possible to write a filter that rewrites its input file in place.
Here's an example. Say I have a text file like:
1
2
3
4
I can do (Python 3):
import fileinput
file_path = r"C:\temp\fileinput_test.txt"
with fileinput.FileInput(files=[file_path], inplace=True) as input_data:
for line in input_data:
# Double the number on each line
s = str(int(line.strip()) * 2)
print(s)
And my file becomes:
2
4
6
8
You can use the 'r+' file mode to open a file for reading and writing at the same time.
example:
with open("file.txt", 'r+') as filehandle:
# can read and write to file here
well, you can choose the "r+w" mode, with which you need only open the file once

What is the difference between rb and r+b modes in file objects [duplicate]

This question already has answers here:
Difference between modes a, a+, w, w+, and r+ in built-in open function?
(9 answers)
Closed last month.
I am using pickle module in Python and trying different file IO modes:
# works on windows.. "rb"
with open(pickle_f, 'rb') as fhand:
obj = pickle.load(fhand)
# works on linux.. "r"
with open(pickle_f, 'r') as fhand:
obj = pickle.load(fhand)
# works on both "r+b"
with open(pickle_f, 'r+b') as fhand:
obj = pickle.load(fhand)
I never read about "r+b" mode anywhere, but found mentioning about it in the documentation.
I am getting EOFError on Linux if I use "rb" mode and on Windows if "r" is used. I just gave "r+b" mode a shot and it's working on both.
What's "r+b" mode? What's the difference between "rb" and "r+b"? Why does it work when the others don't?
r+ is used for reading, and writing mode. b is for binary.
r+b mode is open the binary file in read or write mode.
You can read more here.
r opens for reading, whereas r+ opens for reading and writing. The b is for binary.
This is spelled out in the documentation:
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.
Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+' truncates the file). Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
My understanding is that adding r+ opens for both read and write (just like w+, though as pointed out in the comment, will truncate the file). The b just opens it in binary mode, which is supposed to be less aware of things like line separators (at least in C++).
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
Source: Reading and Writing Files

Categories

Resources