Question:
What is the difference between open(<name>, "w", encoding=<encoding>) and open(<name>, "wb") + str.encode(<encoding>)? They seem to (sometimes) produce different outputs.
Context:
While using PyFPDF (version 1.7.2), I subclassed the FPDF class, and, among other things, added my own output method (taking pathlib.Path objects). While looking at the source of the original FPDF.output() method, I noticed almost all of it is argument parsing - the only relevant bits are
#Finish document if necessary
if(self.state < 3):
self.close()
[...]
f=open(name,'wb')
if(not f):
self.error('Unable to create output file: '+name)
if PY3K:
# manage binary data as latin1 until PEP461 or similar is implemented
f.write(self.buffer.encode("latin1"))
else:
f.write(self.buffer)
f.close()
Seeing that, my own Implementation looked like this:
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
file.write_text(self.buffer, "latin1", "strict")
This seemed to work - a .pdf file was created at the specified path, and chrome opened it. But it was completely blank, even tho I added Images and Text. After hours of experimenting, I finally found a Version that worked (produced a non empty pdf file):
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
# using .write_text(self.buffer, "latin1", "strict") DOES NOT WORK AND I DON'T KNOW WHY
file.write_bytes(self.buffer.encode("latin1", "strict"))
Looking at the pathlib.Path source, it uses io.open for Path.write_text(). As all of this is Python 3.8, io.open and the buildin open() are the same.
Note:
FPDF.buffer is of type str, but holds binary data (a pdf file). Probably because the Library was originally written for Python 2.
Both should be the same (with minor differences).
I like open way, because it is explicit and shorter, OTOH if you want to handle encoding errors (e.g. a way better error to user), one should use decode/encode (maybe after a '\n'.split(s), and keeping line numbers)
Note: if you use the first method (open), you should just use r or w, so without b. For your question's title, it seems you did correct, but check that your example keep b, and probably for this, it used encoding. OTOH the code seems old, and I think the ".encoding" was just done because it would be more natural in Python2 mindset.
Note: I would also replace strict to backslashreplace for debugging. And possibly you may want to check and print (maybe just ord) of the first few characters of the self.buffer on both methods, to see if there are substantial differences before file.write.
I would add a file.flush() on both functions. This is one of the differences: buffering is different, and I'll make sure I close the file. Python will do it, but when debugging, it is important to see the content of the file as quick as possible (and also after an exception). Garbage collector could not guarantee all of this. Maybe you are reading a text file which was not yet flushed.
Aaaand found it: Path.write_bytes() will save the bytes object as is, and str.encoding doesn't touch the line endings.
Path.write_text() will encode the bytes object just like str.encode(), BUT: because the file is opened in text mode, the line endings will be normalized after encoding - in my case converting \n to \r\n because I'm on Windows. And pdfs have to use \n, no matter what platform your on.
Related
I've used Hex-rays IDA to find the bytes of code I need changed in a windows executable. I would like to write a python script that will programmatically edit those bytes.
I know the address (as given in hex-rays IDA) and I know the hexadecimal I wish to overwrite it with. How do I do this in python? I'm sure there is a simple answer, but I can't find it.
(For example: address = 0x00436411, and new hexadecimal = 0xFA)
You just need to open the executable as a file, for writing, in binary mode; then seek to the position you want to write; then write. So:
with open(path, 'r+b') as f:
f.seek(position)
f.write(new_bytes)
If you're going to be changing a lot of bytes, you may find it simpler to use mmap, which lets you treat the file as a giant list:
with open(path, 'r+b') as f:
with contextlib.closing(mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)) as m:
m[first_position] = first_new_byte
m[other_position] = other_new_byte
# ...
If you're trying to write multi-byte values (e.g., a 32-bit int), you probably want to use the struct module.
If what you know is an address in memory at runtime, rather than a file position, you have to be able to map that to the right place in the executable file. That may not even be possible (e.g., a memory-mapped region). But if it is, you should be able to find out from the debugger where it's mapped. From inside a debugger, this is easy; from outside, you need to parse the PE header structures and do a lot of complicated logic, and there is no reason to do that.
I believe when using hex-ray IDA as a static disassembler, with all the default settings, the addresses it gives you are the addresses where the code and data segments will be mapped into memory if they aren't forced to relocate. Those are, obviously, not offsets into the file.
It appears that a write() immediately following a read() on a file opened with r+ (or r+b) permissions in Windows doesn't update the file.
Assume there is a file testfile.txt in the current directory with the following contents:
This is a test file.
I execute the following code:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.write("----")
I would expect the code to print This and update the file contents to this:
This----a test file.
This works fine on at least Linux. However, when I run it on Windows then the message is displayed correctly, but the file isn't altered - it's like the write() is being ignored. If I call tell() on the filehandle it shows that the position has been updated (it's 4 before the write() and 8 afterwards), but no change to the file.
However, if I put an explicit fd.seek(4) just before the write() line then everything works as I'd expect.
Does anybody know the reason for this behaviour under Windows?
For reference I'm using Python 2.7.3 on Windows 7 with an NTFS partition.
EDIT
In response to comments, I tried both r+b and rb+ - the official Python docs seem to imply the former is canonical.
I put calls to fd.flush() in various places, and placing one between the read() and the write() like this:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.write("----")
... yields the following interesting error:
IOError: [Errno 0] Error
EDIT 2
Indirectly that addition of a flush() helped because it lead me to this post describing a similar problem. If one of the commenters on it is correct, it's a bug in the underlying Windows C library.
Python's file operation should follow the libc convention as internally its implemented using C file IO functions.
Quoting from fopen man page or fopen page in cplusplus
For files open for appending (those which include a "+" sign), on
which both input and output operations are allowed, the stream should
be flushed (fflush) or repositioned (fseek, fsetpos, rewind) between
either a writing operation followed by a reading operation or a
reading operation which did not reach the end-of-file followed by a
writing operation.
SO to summarize, if you need to read a file after writing, you need to fflush the buffer and a write operation after read should be preceded by a fseek, as fd.seek(0, os.SEEK_CUR)
So just change your code snippet to
with open("test1.txt", "r+b") as fd:
print fd.read(4)
fd.seek(0, os.SEEK_CUR)
fd.write("----")
The behavior is consistent with how a similar C program would behave
#include <cstdio>
int main()
{
char buffer[5] = {0};
FILE *fp = fopen("D:\\Temp\\test1.txt","rb+");
fread(buffer, sizeof(char), 4, fp);
printf("%s\n", buffer);
/*without fseek, file would not be updated*/
fseek(fp, 0, SEEK_CUR);
fwrite("----",sizeof(char), 4, fp);
fclose(fp);
return 0;
}
It appears that this due to the behaviour of the underlying Windows libraries (which personally I regard to be in error) and nothing wrong with Python. On adding a flush() call between reading and writing (which is apparently good practice) I got an IOError with a zero errno, which is the same issue as discussed in this blog post.
From that post I found this Python issue which mentions the problem and says that the seek() call is actually the best workaround, along with a flush() every time you change from reading to writing.
All that taken into account, it seems the best way to write the code above such that it successfully runs on Windows is:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.seek(4)
fd.write("----")
Might be something to bear in mind for anybody attempting to write portable code.
have you tried flushing ?
fd.flush()
it is OS-dependant, as write uses the filesystem caching mechanism
Is it possible that the implementation missinterpretest "r+b"? Afaik "rb+" is for reading and writing in binary.
I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y . The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?
text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.
Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.
In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!
As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.
I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )
For CSV and Excel upload to django, this site may help.
I'm trying to automate downloading of some text files from a z/os PDS, using Python and ftplib.
Since the host files are EBCDIC, I can't simply use FTP.retrbinary().
FTP.retrlines(), when used with open(file,w).writelines as its callback, doesn't, of course, provide EOLs.
So, for starters, I've come up with this piece of code which "looks OK to me", but as I'm a relative Python noob, can anyone suggest a better approach? Obviously, to keep this question simple, this isn't the final, bells-and-whistles thing.
Many thanks.
#!python.exe
from ftplib import FTP
class xfile (file):
def writelineswitheol(self, sequence):
for s in sequence:
self.write(s+"\r\n")
sess = FTP("zos.server.to.be", "myid", "mypassword")
sess.sendcmd("site sbd=(IBM-1047,ISO8859-1)")
sess.cwd("'FOO.BAR.PDS'")
a = sess.nlst("RTB*")
for i in a:
sess.retrlines("RETR "+i, xfile(i, 'w').writelineswitheol)
sess.quit()
Update: Python 3.0, platform is MingW under Windows XP.
z/os PDSs have a fixed record structure, rather than relying on line endings as record separators. However, the z/os FTP server, when transmitting in text mode, provides the record endings, which retrlines() strips off.
Closing update:
Here's my revised solution, which will be the basis for ongoing development (removing built-in passwords, for example):
import ftplib
import os
from sys import exc_info
sess = ftplib.FTP("undisclosed.server.com", "userid", "password")
sess.sendcmd("site sbd=(IBM-1047,ISO8859-1)")
for dir in ["ASM", "ASML", "ASMM", "C", "CPP", "DLLA", "DLLC", "DLMC", "GEN", "HDR", "MAC"]:
sess.cwd("'ZLTALM.PREP.%s'" % dir)
try:
filelist = sess.nlst()
except ftplib.error_perm as x:
if (x.args[0][:3] != '550'):
raise
else:
try:
os.mkdir(dir)
except:
continue
for hostfile in filelist:
lines = []
sess.retrlines("RETR "+hostfile, lines.append)
pcfile = open("%s/%s"% (dir,hostfile), 'w')
for line in lines:
pcfile.write(line+"\n")
pcfile.close()
print ("Done: " + dir)
sess.quit()
My thanks to both John and Vinay
Just came across this question as I was trying to figure out how to recursively download datasets from z/OS. I've been using a simple python script for years now to download ebcdic files from the mainframe. It effectively just does this:
def writeline(line):
file.write(line + "\n")
file = open(filename, "w")
ftp.retrlines("retr " + filename, writeline)
You should be able to download the file as a binary (using retrbinary) and use the codecs module to convert from EBCDIC to whatever output encoding you want. You should know the specific EBCDIC code page being used on the z/OS system (e.g. cp500). If the files are small, you could even do something like (for a conversion to UTF-8):
file = open(ebcdic_filename, "rb")
data = file.read()
converted = data.decode("cp500").encode("utf8")
file = open(utf8_filename, "wb")
file.write(converted)
file.close()
Update: If you need to use retrlines to get the lines and your lines are coming back in the correct encoding, your approach will not work, because the callback is called once for each line. So in the callback, sequence will be the line, and your for loop will write individual characters in the line to the output, each on its own line. So you probably want to do self.write(sequence + "\r\n") rather than the for loop. It still doesn' feel especially right to subclass file just to add this utility method, though - it probably needs to be in a different class in your bells-and-whistles version.
Your writelineswitheol method appends '\r\n' instead of '\n' and then writes the result to a file opened in text mode. The effect, no matter what platform you are running on, will be an unwanted '\r'. Just append '\n' and you will get the appropriate line ending.
Proper error handling should not be relegated to a "bells and whistles" version. You should set up your callback so that your file open() is in a try/except and retains a reference to the output file handle, your write call is in a try/except, and you have a callback_obj.close() method which you use when retrlines() returns to explicitly file_handle.close() (in a try/except) -- that way you get explict error handling e.g. messages "can't (open|write to|close) file X because Y" AND you save having to think about when your files are going to be implicitly closed and whether you risk running out of file handles.
Python 3.x ftplib.FTP.retrlines() should give you str objects which are in effect Unicode strings, and you will need to encode them before you write them -- unless the default encoding is latin1 which would be rather unusual for a Windows box. You should have test files with (1) all possible 256 bytes (2) all bytes that are valid in the expected EBCDIC codepage.
[a few "sanitation" remarks]
You should consider upgrading your Python from 3.0 (a "proof of concept" release) to 3.1.
To facilitate better understanding of your code, use "i" as an identifier only as a sequence index and only if you irredeemably acquired the habit from FORTRAN 3 or more decades ago :-)
Two of the problems discovered so far (appending line terminator to each character, wrong line terminator) would have shown up the first time you tested it.
Use retrlines of ftplib to download file from z/os, each line has no '\n'.
It's different from windows ftp command 'get xxx'.
We can rewrite the function 'retrlines' to 'retrlines_zos' in ftplib.py.
Just copy the whole code of retrlines, and chane the 'callback' line to:
...
callback(line + "\n")
...
I tested and it worked.
you want a lambda function and a callback. Like so:
def writeLineCallback(line, file):
file.write(line + "\n")
ftpcommand = "RETR {}{}{}".format("'",zOsFile,"'")
filename = "newfilename"
with open( filename, 'w' ) as file :
callback_lambda = lambda x: writeLineCallback(x,file)
ftp.retrlines(ftpcommand, callback_lambda)
This will download file 'zOsFile' and write it to 'newfilename'