Saving attachments from outlook, error when loading with pandas/xlrd - python

I have this script, which has previously worked for other emails, to download attachments:
import win32com.client as win
import xlrd
outlook = win.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder("6")
all_inbox = inbox.Items
subject = 'Email w/Attachment'
attachment1 = 'Attachment - 20160715.xls'
for msg in all_inbox:
if msg.subject == subject:
break
for att in msg.Attachments:
if att.FileName == attachment1:
break
att.SaveAsFile('L:\\My Documents\\Desktop\\' + attachment1)
workbook = xlrd.open_workbook('L:\\My Documents\\Desktop\\' + attachment1)
However, when I try and open the file using xlrd reader (or with pandas)I get this:
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\nVisit '
Can anyone explain what's gone wrong here?
Is there a way I can open the attachment, without saving it, and just copy a worksheet and save that copy as a .csv file instead?
Thank you

take a look at this question. It's possible the file you are trying to download is not a true excel file, but a csv saved as an .xls file. The evidence is the error message Expected BOF record; found b'\r\nVisit '. I think an excel file would start with <?xml or something to that effect. You could get around it with a try/catch:
import pandas as pd
try: #try to read it as a .xls file
workbook = xlrd.open_workbook(path)
except XLRDError: #if fails, read as csv
workbook = pd.read_csv(path)

Related

Error open file after saving it with storeFile of pysmb

I am reading an Excel file (.xlsx) with pysmb.
import tempfile
from smb.SMBConnection import SMBConnection
conn = SMBConnection(userID, password, client_machine_name, server_name, use_ntlm_v2 = True)
conn.connect(server_ip, 139)
file_obj = tempfile.TemporaryFile()
file_attributes, filesize = conn.retrieveFile(service_name, test.xlsx, file_obj)
This step works, I am able to transform the file in pandas.DataFrame
import pandas as pd
pd.read_excel(file_obj)
Next, I want to save the file, the file is saved but if I want to open it with Excel, I have an error message "Excel has run into an error"
Here the code to save the file
conn.storeFile(service_name, 'test_save.xlsx', file_obj)
file_obj.close()
How can I save correctly the file and open it with excel ?
Thank you
I tried with a .txt file file and it is working. An error occurs with .xlsx, .xls and .pdf files. I have also tried without extension, same issue, imossible to open the file.
I would like to save the file with .pdf and .xlsx extension, and open it.
Thank you.
I found a solution an I will post it here in case someone face a similar issue.
Excel can be save as a binary stream.
from io import BytesIO
df = pd.read_excel(file_obj)
output = BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
df.to_excel(writer, sheet_name='data', index = False)
writer.save()
output.seek(0)
conn.storeFile(service_name, 'test_save.xlsx', output)

Python - XLRDError: Unsupported format, or corrupt file: Expected BOF record

I am trying to open an excel file which was given to me for my project, the excel file is the file that we get from a SAP system. But when I try opening it using pandas I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
The following is my code:
import pandas as pd
# To open an excel file
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
Dont know whether it will work for you once it had worked for me, but anyway can you try the following:
from __future__ import unicode_literals
from xlwt import Workbook
import io
filename = r'myexcel.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')
I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter. You can call pd.ExcelFile('myexcel.xlsx') after the convertion. The reason is that actually, pandas uses xlrd for reading Excel files and xlrd does not support XML Spreadsheet (*.xml) i.e. NOT in XLS or XLSX format.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()
What worked for me was applying this advice:
How to cope with an XLRDError
There you also find a suitable explanation that was appropiated for me. It says that the problem was a file format not correctly saved. When I opened the xls file, it offered to save it as html.I saved it a ".xlsx" and solved the problem

Parse excel attachment from .eml file in python

I'm trying to parse a .eml file. The .eml has an excel attachment that's currently base 64 encoded. I'm trying to figure out how to decode it into XML so that I can later turn it into a CSV I can do stuff with.
This is my code right now:
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
c_type = part.get_content_type()
c_disp = part.get('Content Disposition')
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
excelContents = part.get_payload(decode = True)
print excelContents
The problem is
When I try to decode it, it spits back something looking like this.
I've used this post to help me write the code above.
How can I get an email message's text content using Python?
Update:
This is exactly following the post's solution with my file, but part.get_payload() returns everything still encoded. I haven't figured out how to access the decoded content this way.
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True))
f.close()
print part.get("content-transfer-encoding")
As is clear from this table (and as you have already concluded), this file is an .xlsx. You can't just decode it with unicode or base64: you need a special package. Excel files specifically are a bit tricker (for e.g. this one does PowerPoint and Word, but not Excel). There are a few online, see here - xlrd might be the best.
Here is my solution:
I found 2 things out:
1.) I thought .open() was going inside the .eml and changing the selected decoded elements. I thought I needed to see decoded data before moving forward. What's really happening with .open() is it's creating a new file in the same directory of that .xlsx file. You must open the attachment before you will be able to deal with the data.
2.) You must open an xlrd workbook with the file path.
import email
import xlrd
data = file('EmailFileName.eml').read()
msg = email.message_from_string(data) # entire message
if msg.is_multipart():
for payload in msg.get_payload():
bdy = payload.get_payload()
else:
bdy = msg.get_payload()
attachment = msg.get_payload()[1]
# open and save excel file to disk
f = open('excelFile.xlsx', 'wb')
f.write(attachment.get_payload(decode=True))
f.close()
xls = xlrd.open_workbook(excelFilePath) # so something in quotes like '/Users/mymac/thisProjectsFolder/excelFileName.xlsx'
# Here's a bonus for how to start accessing excel cells and rows
for sheets in xls.sheets():
list = []
for rows in range(sheets.nrows):
for col in range(sheets.ncols):
list.append(str(sheets.cell(rows, col).value))

Converting Excel to CSV python

I am using xlrd to convert my .xls Excel file to a CSVfile yet when I try to open the workbook my program crashes sending an error message
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xlrd/book.py", line 1224, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'Chrom\tPo'
The Chrom\tPo is part of my header for the excel file yet I don't understand what the error is with the Excel file and how to change it.
The program crashes right when i try to open the excel file using xlrd.open_workbook('Excel File')
I would use openpyxl for this.
import openpyxl
wb = openpyxl.load_workbook(file_name)
ws = wb.worksheets[page_number]
table = []
for row_num in range(ws.get_highest_row()):
temp_row = []
for col_num in range(ws.get_highest_column()):
temp_row.append(ws.cell(row=row_num, col=col_num).value)
table.append(temp_row[:])
This will give you the contents of the sheet as a 2-D list, which you can then write out to a csv or use as you wish.
If you're stuck with xlrd for whatever reason, You may just need to convert your file from xls to xlsx
Here is an answer from a previous question: How to save an Excel worksheet as CSV from Python (Unix)?
The answer goes for openpyxl and xlrd.

Python - download excel file from email attachment then parse it

EDIT - UPDATE
I have created a horrible hack that opens the excel file then saves it down with the same filename before then opening the excel file into pandas. This is really horrible but I can't see any other way to solve the problem as attachment.SaveFileAs creates and endian problem.
I have the following code that finds an email in my outlook then downloads the excel file to a directory. There is a problem when I try and open the file to parse it and use it for another part in my script it comes up with a formatting error.
I know this is caused from the way Python saves it down as when I do it manually it works fine.
Any help greatly appreciated.
from win32com.client import Dispatch
import email
import datetime as date
import pandas as pd
import os
outlook = Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder("6")
all_inbox = inbox.Items
val_date = date.date.today()
sub_today = 'Hi'
att_today = 'Net - Regional.xls'
## loop through inbox attachments
for msg in all_inbox:
yourstring = msg.Subject.encode('ascii', 'ignore').decode('ascii')
if(yourstring.find('Regional Reporting Week') != -1):
break
## get attachments
for att in msg.Attachments:
if att.FileName == att_today:
attachments = msg.Attachments
break
attachment = attachments.Item(1)
fn = os.getcwd() + '\\' + att_today
attachment.SaveASFile(fn)
# terrible hack but workable in the short term
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.DisplayAlerts = False
excel.Visible = True
wb = excel.Workbooks.Open(fn)
wb.SaveAs(fn)
wb.Close(True)
xl = pd.ExcelFile(fn)
data_df = xl.parse("RawData - Global")
print(data_df)
What is the file name string of att_today? Is it using the appropriate extension?
You're saving it as a ".xls" file. Could it possibly be a ".xlsx" extension?
Asides from the ".SaveAsFile()" method, you may want to look into ".ExtractFile" or "WriteToFile".
Lastly, even if Python may be saving it differently from how you manually saved it, you could still possibly use some 3rd-Party Excel packages to read the file properly before re-writing it for manual opening / viewing.
For ".xls" extensions, I would recommend XLRD.
For ".xlsx" extensions, I would recommend OpenPyxl.

Categories

Resources