How does rfile.read() work?

How does rfile.read() work? - python

I'm sending a text file with a string in a python script via POST to my server:
fo = open('data'.txt','a')
fo.write("hi, this is my testing data")
fo.close()
with open('data.txt', 'rb') as f:
r = requests.post("http://XXX.XX.X.X", data = {'data.txt':f})
f.close()
And receiving and handling it here in my server handler script, built off an example found online:
def do_POST(self):
data = self.rfile.read(int(self.headers.getheader('Content-Length')))
empty = [data]
with open('processing.txt', 'wb') as file:
for item in empty:
file.write("%s\n" % item)
file.close()
self._set_headers()
self.wfile.write("<html><body><h1>POST!</h1></body></html>")
My question is, how does:
self.rfile.read(int(self.headers.getheader('Content-Length')))
take the length of my data (an integer, # of bytes/characters) and read my file? I am confused how it knows what my data contains. What is going on behind the scenes with HTTP?
It outputs data.txt=hi%2C+this+is+my+testing+data
to my processing.txt, but I am expecting "hi this is my testing data"
I tried but failed to find documentation for what exactly rfile.read() does, and if simply finding that answers my question I'd appreciate it, and I could just delete this question.

Your client code snippet reads contents from the file data.txt and makes a POST request to your server with data structured as a key-value pair. The data sent to your server in this case is one key data.txt with the corresponding value being the contents of the file.
Your server code snippet reads the entire HTTP Request body and dumps it into a file. The key-value pair structured and sent from the client comes in a format that can be decoded by Python's built in library urlparse.
Here is a solution that could work:
def do_POST(self):
length = int(self.headers.getheader('content-length'))
field_data = self.rfile.read(length)
fields = urlparse.parse_qs(field_data)
This snippet of code was shamefully borrowed from: https://stackoverflow.com/a/31363982/705471
If you'd like to extract the contents of your text file back, adding the following line to the above snippet could help:
data_file = fields["data.txt"]
To learn more about how such information is encoded for the purposes of HTTP, read more at: https://en.wikipedia.org/wiki/Percent-encoding

Related

File corrupted when using send_file() from flask, data from pymongo gridfs

Well my English is not good, and the title may looks weird.
Anyway, I'm now using flask to build a website that can store files, and mongodb is the database.
The file upload, document insert functions have no problems, the weird thing is that the file sent from flask send_file() was truncated for no reasons. Here's my code
from flask import ..., send_file, ...
import pymongo
import gridfs
#...
#app.route("/record/download/<record_id>")
def api_softwares_record_download(record_id):
try:
#...
file = files_gridfs.find_one({"_id": record_id})
file_ext = filetype.guess_extension(file.read(2048))
filename = "{}-{}{}".format(
app["name"],
record["version"],
".{}".format(file_ext) if file_ext else "",
)
response = send_file(file, as_attachment=True, attachment_filename=filename)
return response
except ...
The original image file, for example, is 553KB. But the response body returns 549.61KB, and the image was broken. But if I just directly write the file to my disk
#...
with open('test.png', 'wb+') as file:
file.write(files_gridfs.find_one({"_id": record_id}).read())
The image file size is 553KB and the image is readable.
When I compare the two files with VS Code's text editor, I found that the correct file starts with �PNG, but the corrupted file starts with �ϟ8���>�L�y
search the corrupted file head in the correct file
And I tried to use Blob object and download it from the browser. No difference.
Is there any wrong with my code or I misused send_file()? Or should I use flask_pymongo?

And it's interesting that I have found what is wrong with my code.
This is how I solved it
...file.read(2048)
file.seek(0)
...
file.read(2048)
file.seek(0)
...
response = send_file(file, ...)
return response
And here's why:
For some reasons, I use filetype to detect the file's extension name and mime type, so I sent 2048B to filetype for detection.
file_ext = filetype.guess_extension(file.read(2048))
file_mime = filetype.guess_mime(file.read(2048)) #this line wasn't copied in my question. My fault.
And I have just learned from the pymongo API that python (or pymongo or gridfs, completely unknown to this before) reads file by using a cursor. When I try to find the cursor's position using file.seek(), it returns 4096. So when I call file.read() again in send_file(), the cursor reads from 4096B away to the file head. 549+4=553, and here's the problem.
Finally I set the cursor to position 0 after every read() operation, and it returns the correct file.
Hope this can help if you made the same mistake just like me.

Parse excel attachment from .eml file in python

I'm trying to parse a .eml file. The .eml has an excel attachment that's currently base 64 encoded. I'm trying to figure out how to decode it into XML so that I can later turn it into a CSV I can do stuff with.
This is my code right now:
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
c_type = part.get_content_type()
c_disp = part.get('Content Disposition')
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
excelContents = part.get_payload(decode = True)
print excelContents
The problem is
When I try to decode it, it spits back something looking like this.
I've used this post to help me write the code above.
How can I get an email message's text content using Python?
Update:
This is exactly following the post's solution with my file, but part.get_payload() returns everything still encoded. I haven't figured out how to access the decoded content this way.
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True))
f.close()
print part.get("content-transfer-encoding")

As is clear from this table (and as you have already concluded), this file is an .xlsx. You can't just decode it with unicode or base64: you need a special package. Excel files specifically are a bit tricker (for e.g. this one does PowerPoint and Word, but not Excel). There are a few online, see here - xlrd might be the best.

Here is my solution:
I found 2 things out:
1.) I thought .open() was going inside the .eml and changing the selected decoded elements. I thought I needed to see decoded data before moving forward. What's really happening with .open() is it's creating a new file in the same directory of that .xlsx file. You must open the attachment before you will be able to deal with the data.
2.) You must open an xlrd workbook with the file path.
import email
import xlrd
data = file('EmailFileName.eml').read()
msg = email.message_from_string(data) # entire message
if msg.is_multipart():
for payload in msg.get_payload():
bdy = payload.get_payload()
else:
bdy = msg.get_payload()
attachment = msg.get_payload()[1]
# open and save excel file to disk
f = open('excelFile.xlsx', 'wb')
f.write(attachment.get_payload(decode=True))
f.close()
xls = xlrd.open_workbook(excelFilePath) # so something in quotes like '/Users/mymac/thisProjectsFolder/excelFileName.xlsx'
# Here's a bonus for how to start accessing excel cells and rows
for sheets in xls.sheets():
list = []
for rows in range(sheets.nrows):
for col in range(sheets.ncols):
list.append(str(sheets.cell(rows, col).value))

Fully streaming XML parser

I'm trying to consume the Exchange GetAttachment webservice using requests, lxml and base64io. This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general.
I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB.
I have tried something like this:
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
with open('foo.txt', 'wb') as f:
for action, elem in lxml.etree.iterparse(GzipFile(fileobj=r.raw)):
if elem.tag == 't:Content':
b64_encoder = Base64IO(BytesIO(elem.text))
f.write(b64_encoder.read())
but lxml still stores a copy of the attachment as elem.text. Is there any way I can create a fully streaming XML parser that also streams the content of an element directly from the input stream?

Don't use iterparse in this case. The iterparse() method can only issue element start and end events, so any text in an element is given to you when the closing XML tag has been found.
Instead, use a SAX parser interface. This is a general standard for XML parsing libraries, to pass on parsed data to a content handler. The ContentHandler.characters() callback is passed character data in chunks (assuming that the implementing XML library actually makes use of this possibility). This is a lower level API from the ElementTree API, and and the Python standard library already bundles the Expat parser to drive it.
So the flow then becomes:
wrap the incoming request stream in a GzipFile for easy decompression. Or, better still, set response.raw.decode_content = True and leave decompression to the requests library based on the content-encoding the server has set.
Pass the GzipFile instance or raw stream to the .parse() method of a parser created with xml.sax.make_parser(). The parser then proceeds to read from the stream in chunks. By using make_parser() you first can enable features such as namespace handling (which ensures your code doesn't break if Exchange decides to alter the short prefixes used for each namespace).
The content handler characters() method is called with chunks of XML data; check for the correct element start event, so you know when to expect base64 data. You can decode that base64 data in chunks of (a multiple of) 4 characters at a time, and write it to a file. I'd not use base64io here, just do your own chunking.
A simple content handler could be:
from xml.sax import handler
from base64 import b64decode
class AttachmentContentHandler(handler.ContentHandler):
types_ns = 'http://schemas.microsoft.com/exchange/services/2006/types'
def __init__(self, filename):
self.filename = filename
def startDocument(self):
self._buffer = None
self._file = None
def startElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# we can expect base64 data next
self._file = open(self.filename, 'wb')
self._buffer = []
def endElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# all attachment data received, close the file
try:
if self._buffer:
raise ValueError("Incomplete Base64 data")
finally:
self._file.close()
self._file = self._buffer = None
def characters(self, data):
if self._buffer is None:
return
self._buffer.append(data)
self._decode_buffer()
def _decode_buffer(self):
remainder = ''
for data in self._buffer:
available = len(remainder) + len(data)
overflow = available % 4
if remainder:
data = (remainder + data)
remainder = ''
if overflow:
remainder, data = data[-overflow:], data[:-overflow]
if data:
self._file.write(b64decode(data))
self._buffer = [remainder] if remainder else []
and you'd use it like this:
import requests
from xml.sax import make_parser, handler
parser = make_parser()
parser.setFeature(handler.feature_namespaces, True)
parser.setContentHandler(AttachmentContentHandler('foo.txt'))
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
r.raw.decode_content = True # if content-encoding is used, decompress as we read
parser.parse(r.raw)
This will parse the input XML in chunks of up to 64KB (the default IncrementalParser buffer size), so attachment data is decoded in at most 48KB blocks of raw data.
I'd probably extend the content handler to take a target directory and then look for <t:Name> elements to extract the filename, then use that to extract the data to the correct filename for each attachment found. You'd also want to verify that you are actually dealing with a GetAttachmentResponse document, and handle error responses.

Python O365 send email with HTML file

I'm using O365 for Python.
Sending an email and building the body my using the setBodyHTML() function. However at the present I need to write the actual HTML code inside the function. I don't want to do that. I want to just have python look at an HTML file I saved somewhere and send an email using that file as the body. Is that possible? Or am I confined to copy/pasting my HTML into that function? I'm using office365 for business. Thanks.
In other words instead of this: msg.setBodyHTML("<h3>Hello</h3>") I want to be able to do this: msg.setBodyHTML("C:\somemsg.html")

I guess you can assign the file content to a variable first, i.e.:
file = open('C:/somemsg.html', 'r')
content = file.read()
file.close()
msg.setBodyHTML(content)

You can do this via a simple reading of that file into a string, which you then can pass to the setBodyHTML function.
Here's a quick function example that will do the trick:
def load_html_from_file(path):
contents = ""
with open(path, 'r') as f:
contents = f.read()
return contents
Later, you can do something along the lines of
msg.setBodyHTML(load_html_from_file("C:\somemsg.html"))
or
html_contents = load_html_from_file("C:\somemsg.html")
msg.setBodyHTML(html_contents)

how to read a file content from tornado http filebody?

My file is like this, but I can't exec the content correctly. I've spent my whole afternoon on this, and still so confused. The main reason is that I don't know what does that [file_obj[0]['body']] looks like.
here is part of my code
# user_file content
"uid = 'h123456789'"
"data = [something]"
# end of user_file
# code piece
file_obj = req.request.files.get('user_file', None)
for i in file_obj[0]['body']:
i.strip('\n') # I tried comment out this line, still can't work
exec(i)
# I failed
Can you tell me what does the user_file conentent would looks like in the file_obj body? So that I can figure out the solution maybe. I submitted it with http form to tornado.
Really thanks.

Maybe this will help.
#first file object in request.
file1 = self.request.files['file1'][0]
#where the file content actually placed.
content = file1['body']
#split content into lines, unix line terminals assumed.
lines = content.split(b'\n')
for l in lines:
#after decoding into strings, you're free to execute them.
try:
exec(l.decode())
except:
pass

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does rfile.read() work? - python

Related

File corrupted when using send_file() from flask, data from pymongo gridfs

Parse excel attachment from .eml file in python

Fully streaming XML parser

Python O365 send email with HTML file

how to read a file content from tornado http filebody?

Categories

Resources