Parse excel attachment from .eml file in python

Parse excel attachment from .eml file in python - python

I'm trying to parse a .eml file. The .eml has an excel attachment that's currently base 64 encoded. I'm trying to figure out how to decode it into XML so that I can later turn it into a CSV I can do stuff with.
This is my code right now:
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
c_type = part.get_content_type()
c_disp = part.get('Content Disposition')
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
excelContents = part.get_payload(decode = True)
print excelContents
The problem is
When I try to decode it, it spits back something looking like this.
I've used this post to help me write the code above.
How can I get an email message's text content using Python?
Update:
This is exactly following the post's solution with my file, but part.get_payload() returns everything still encoded. I haven't figured out how to access the decoded content this way.
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True))
f.close()
print part.get("content-transfer-encoding")

As is clear from this table (and as you have already concluded), this file is an .xlsx. You can't just decode it with unicode or base64: you need a special package. Excel files specifically are a bit tricker (for e.g. this one does PowerPoint and Word, but not Excel). There are a few online, see here - xlrd might be the best.

Here is my solution:
I found 2 things out:
1.) I thought .open() was going inside the .eml and changing the selected decoded elements. I thought I needed to see decoded data before moving forward. What's really happening with .open() is it's creating a new file in the same directory of that .xlsx file. You must open the attachment before you will be able to deal with the data.
2.) You must open an xlrd workbook with the file path.
import email
import xlrd
data = file('EmailFileName.eml').read()
msg = email.message_from_string(data) # entire message
if msg.is_multipart():
for payload in msg.get_payload():
bdy = payload.get_payload()
else:
bdy = msg.get_payload()
attachment = msg.get_payload()[1]
# open and save excel file to disk
f = open('excelFile.xlsx', 'wb')
f.write(attachment.get_payload(decode=True))
f.close()
xls = xlrd.open_workbook(excelFilePath) # so something in quotes like '/Users/mymac/thisProjectsFolder/excelFileName.xlsx'
# Here's a bonus for how to start accessing excel cells and rows
for sheets in xls.sheets():
list = []
for rows in range(sheets.nrows):
for col in range(sheets.ncols):
list.append(str(sheets.cell(rows, col).value))

Related

failed: Network error while downloading Excel file generated by jupyter notebook

my jupyter notebook is saving a dataframe(having styles) to an excel file. then I have created a link to download this excel file:
df=df.to_excel('ABC.xlsx', index=True)
filename ='ABC.xlsx'
file_link = "<a href='{href}' download='ABC.xlsx'> Download ABC.xlsx</a>"
html = HTML(file_link.format(href=filename))
dispaly(html)
but when i click on link-Download ABC.xlsx, I am getting- Failed: Network error.
On the contrary it is working fine when i am downloading CSV file the same way
Adding csv code, there is some base64 encoding added in csv code without which csv code is also not working:
def func(df,title="Download csv file",filename="ABC.csv"):
csv=df.to_csv(index=True)
b64 =base64.b64encode(csv.encode())
payload=b64.decode()
html = "{title}"
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
i tried editing this function for excel file:
def func(df,title="Download excel file",filename="ABC.xlsx"):
xls=df.to_excel("xyz.xlsx",index=True)
b64 =base64.b64encode(xls.encode())
payload=b64.decode()
html = "{title}"
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
for excel code it giving error: 'NoneType' object has no attribute 'encode'

In you csv code, you use csv=df.to_csv(index=True), according to docs
If path_or_buf is None, returns the resulting csv format as a string.
Otherwise returns None.
here you didn't specify path_or_buf, so return value is csv content. this is why you can download csv.
Now to_excel doc desn't say it has any return value. so your payload don't contain anything at all.
To solve, you can manually open file again and read as base64 format string:
def file_to_base64(file):
#file should be the actual file name you wrote
with open(file, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read())
return encoded_string.decode()
replace the two lines
b64 =base64.b64encode(xls.encode())
payload=b64.decode()
with:
payload = file_to_base64(file)

Django download a BinaryField as a CSV file

I am working with a legacy project and we need to implement a Django Admin that helps download a csv report that was stored as a BinaryField.
The model is something like this:
class MyModel(models.Model):
csv_report = models.BinaryField(blank=True,null=True)
Everything seems to being stored as expected but I have no clue how to decode the field back to a csv file for later use.
I am using something like these (as an admin action on MyModelAdmin class)
class MyModelAdmin(admin.ModelAdmin):
...
...
actions = ["download_file",]
def download_file(self, request,queryset):
# just getting one for testing
contents = queryset[0].csv_report
encoded_data = base64.b64encode(contents).decode()
with open("report.csv", "wb") as binary_file:
# Write bytes to file
decoded_image_data = base64.decodebytes(encoded_data)
binary_file.write(decoded_image_data)
response = HttpResponse(encoded_data)
response['Content-Disposition'] = 'attachment; filename=report.csv'
return response
download_file.short_description = "file"
But all I download is a scrambled csv file. I don't seem to understand if it is a problem of the format I am using to decode (.decode('utf-8') does nothing either )
PD:
I know it is a bad practice to use BinaryField for this. But requirements are requirements. Nothing to do about it.
EDIT:
As #TimRoberts pointed out, encoding and then decoding is REALLY silly :$. I've changed the method like so:
def download_file(self, request,queryset):
# print(self,request)
contents = queryset[0].csv_report
# print(type(contents))
encoded_data = base64.b64decode(contents)
with open("my_file.csv", "wb") as binary_file:
binary_file.write(encoded_data)
response = HttpResponse(encoded_data)
response['Content-Disposition'] = 'attachment; filename=blob.csv'
return response
download_file.short_description = "file"
Still I am getting a csv file with something like this:

A big fat case of the old RTFM: I was getting carried away by the all base64.. Obviously I didn't have any idea of what I was doing.
After tampering with the shell and reading the docs, I just changed my method to:
def download_file(self, request,queryset):
**contents = bytes(queryset[0].csv_report)**
response = HttpResponse(contents)
response['ContentDisposition']='attachment;filename=report.csv'
return response
Note that I was scrambling the data on purpose by doing the encoded_data = base64.b64decode(contents) stuff. I just needed to apply bytes on my BinaryField and voilá

How does rfile.read() work?

I'm sending a text file with a string in a python script via POST to my server:
fo = open('data'.txt','a')
fo.write("hi, this is my testing data")
fo.close()
with open('data.txt', 'rb') as f:
r = requests.post("http://XXX.XX.X.X", data = {'data.txt':f})
f.close()
And receiving and handling it here in my server handler script, built off an example found online:
def do_POST(self):
data = self.rfile.read(int(self.headers.getheader('Content-Length')))
empty = [data]
with open('processing.txt', 'wb') as file:
for item in empty:
file.write("%s\n" % item)
file.close()
self._set_headers()
self.wfile.write("<html><body><h1>POST!</h1></body></html>")
My question is, how does:
self.rfile.read(int(self.headers.getheader('Content-Length')))
take the length of my data (an integer, # of bytes/characters) and read my file? I am confused how it knows what my data contains. What is going on behind the scenes with HTTP?
It outputs data.txt=hi%2C+this+is+my+testing+data
to my processing.txt, but I am expecting "hi this is my testing data"
I tried but failed to find documentation for what exactly rfile.read() does, and if simply finding that answers my question I'd appreciate it, and I could just delete this question.

Your client code snippet reads contents from the file data.txt and makes a POST request to your server with data structured as a key-value pair. The data sent to your server in this case is one key data.txt with the corresponding value being the contents of the file.
Your server code snippet reads the entire HTTP Request body and dumps it into a file. The key-value pair structured and sent from the client comes in a format that can be decoded by Python's built in library urlparse.
Here is a solution that could work:
def do_POST(self):
length = int(self.headers.getheader('content-length'))
field_data = self.rfile.read(length)
fields = urlparse.parse_qs(field_data)
This snippet of code was shamefully borrowed from: https://stackoverflow.com/a/31363982/705471
If you'd like to extract the contents of your text file back, adding the following line to the above snippet could help:
data_file = fields["data.txt"]
To learn more about how such information is encoded for the purposes of HTTP, read more at: https://en.wikipedia.org/wiki/Percent-encoding

How to generate a file without saving it to disk in python?

I'm using Python 2.7 and Django 1.7.
I have a method in my admin interface that generates some kind of a csv file.
def generate_csv(args):
...
#some code that generates a dictionary to be written as csv
....
# this creates a directory and returns its filepath
dirname = create_csv_dir('stock')
csvpath = os.path.join(dirname, 'mycsv_file.csv')
fieldnames = [#some field names]
# this function creates the csv file in the directory shown by the csvpath
newcsv(data, csvheader, csvpath, fieldnames)
# this automatically starts a download from that directory
return HttpResponseRedirect('/media/csv/stock/%s' % csvfile)
All in all I create a csv file, save it somewhere on the disk, and then pass its URL to the user for download.
I was thinking if all this can be done without writing to disc. I googled around a bit and maybe content disposition attachment might help me, but I got lost in documentation a bit.
Anyway if there's an easier way of doing this I'd love to know.

Thanks to #Ragora, you pointed me towards the right direction.
I rewrote the newcsv method:
from io import StringIO
import csv
def newcsv(data, csvheader, fieldnames):
"""
Create a new csv file that represents generated data.
"""
new_csvfile = StringIO.StringIO()
wr = csv.writer(new_csvfile, quoting=csv.QUOTE_ALL)
wr.writerow(csvheader)
wr = csv.DictWriter(new_csvfile, fieldnames = fieldnames)
for key in data.keys():
wr.writerow(data[key])
return new_csvfile
and in the admin:
csvfile = newcsv(data, csvheader, fieldnames)
response = HttpResponse(csvfile.getvalue(), content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=stock.csv'
return response

If it annoys you that you are saving a file to disk, just add the application/octet-stream content-type to the Content-Disposition header then delete the file from disk.
If this header (Content-Disposition) is used in a response with the application/octet- stream content-type, the implied suggestion is that the user agent should not display the response, but directly enter a `save response as...' dialog.

How to save an email attached in another using python smtplib?

I am using python imaplib to download and save attachments in email. But when there is an email with attachment as another email, x.get_payload() is of Nonetype. I think these type of mails are are send using some email clients. Since the filename was missing, I tried changing filename in header followed by 'Content-Disposition'. The renamed file gets opened and when I try to write to that file using
fp.write(part.get_payload(decode=True))
it says string or buffer expected but Nonetype found.
>>>x.get_payload()
[<email.message.Message instance at 0x7f834eefa0e0>]
>>>type(part.get_payload())
<type 'list'>
>>>type(part.get_payload(decode=True))
<type 'NoneType'>
I removed decode=True and I got a list of objects
x.get_payload()[0]
<email.message.Message instance at 0x7f834eefa0e0>
I tried editing the filename in case email found as attachment.
if part.get('Content-Disposition'):
attachment = str(part.get_filename()) #get filename
if attachment == 'None':
attachment = 'somename.mail'
attachment = self.autorename(attachment)#append (no: of occurences) to filename eg:filename(1) in case file exists
x.add_header('Content-Disposition', 'attachment', filename=attachment)
attachedmail = 1
if attachedmail == 1:
fp.write(str(x.get_payload()))
else:
fp.write(x.get_payload(decode=True)) #write contents to the opened file
and the file contains the object name file content is given below
[ < email.message.Message instance at 0x7fe5e09aa248 > ]
How can I write the contents of these attached emails to files?

I solved it myself. as [ < email.message.Message instance at 0x7fe5e09aa248 > ] is a list of email.message.Message instances, each one have .as_string() method. In my case writing the content of .as_string() to a file helped me to extract the whole header data including embedded attachments to a file. Then I inspected the file line by line and saved contents based on the encoding and file type.
>>>x.get_payload()
[<email.message.Message instance at 0x7f834eefa0e0>]
>>>fp=open('header','wb')
>>>fp.write(x.get_payload()[0].as_string())
>>>fp.close()
>>>file_as_list = []
>>>fp=open('header','rb')
>>>file_as_list = fp.readlines()
>>>fp.close()
And then inspecting each lines in file
for x in file_as_list:
if 'Content-Transfer-Encoding: quoted-printable' in x:
print 'qp encoded data found!'
if 'Content-Transfer-Encoding: base64' in x:
print 'base64 encoded data found!'
The encoded data representing inline(embedded) attachments can be skipped as imaplib already captures it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse excel attachment from .eml file in python - python

Related

failed: Network error while downloading Excel file generated by jupyter notebook

Django download a BinaryField as a CSV file

How does rfile.read() work?

How to generate a file without saving it to disk in python?

How to save an email attached in another using python smtplib?

Categories

Resources