I want to get one table that are inside the body of one .msg file with Python. I can get the body content, but I need the table separated into dataframe, for example.
I can get the body content, but I can't separe the table of the body
import win32com.client
import os
dir = r"C:\Users\Murilo\Desktop\Emails\030"
file_list = os.listdir(dir)
for file in file_list:
if file.endswith(".msg"):
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(dir + "/" + file)
print(msg.Body)
I need the table that exists in body content, but not all body
If it is an HTML table, use MailItem.HTMLBody (instead of the plain text Body) and extract the table from HTML.
I would look at the extract_msg library. It should allow you to open a .msg file as plain XML and be very easy to extract a table from the content.
msg = extract_msg.Message(fileLoc)
msg_message = msg.body
content = ('Body: {}'.format(msg_message))
The Outlook object model provides three main ways for working with item bodies:
Body.
HTMLBody.
The Word editor. The WordEditor property of the Inspector class returns an instance of the Word Document which represents the message body. So, you can use the Word object model do whatever you need with the message body. The Copy and Paste methods of the Document will do the trick.
See Chapter 17: Working with Item Bodies for more information.
But I think the easiest and cleanest way is to use the Word object model. You can read more how to deal with the Word Object Model and how to use it to extract the table content in the How to read contents of an Table in MS-Word file Using Python? post.
Related
I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !
Try putting http:// behind the links.
When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.
I'm trying to parse a .eml file. The .eml has an excel attachment that's currently base 64 encoded. I'm trying to figure out how to decode it into XML so that I can later turn it into a CSV I can do stuff with.
This is my code right now:
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
c_type = part.get_content_type()
c_disp = part.get('Content Disposition')
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
excelContents = part.get_payload(decode = True)
print excelContents
The problem is
When I try to decode it, it spits back something looking like this.
I've used this post to help me write the code above.
How can I get an email message's text content using Python?
Update:
This is exactly following the post's solution with my file, but part.get_payload() returns everything still encoded. I haven't figured out how to access the decoded content this way.
import email
data = file('Openworkorders.eml').read()
msg = email.message_from_string(data)
for part in msg.walk():
if part.get_content_type() == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True))
f.close()
print part.get("content-transfer-encoding")
As is clear from this table (and as you have already concluded), this file is an .xlsx. You can't just decode it with unicode or base64: you need a special package. Excel files specifically are a bit tricker (for e.g. this one does PowerPoint and Word, but not Excel). There are a few online, see here - xlrd might be the best.
Here is my solution:
I found 2 things out:
1.) I thought .open() was going inside the .eml and changing the selected decoded elements. I thought I needed to see decoded data before moving forward. What's really happening with .open() is it's creating a new file in the same directory of that .xlsx file. You must open the attachment before you will be able to deal with the data.
2.) You must open an xlrd workbook with the file path.
import email
import xlrd
data = file('EmailFileName.eml').read()
msg = email.message_from_string(data) # entire message
if msg.is_multipart():
for payload in msg.get_payload():
bdy = payload.get_payload()
else:
bdy = msg.get_payload()
attachment = msg.get_payload()[1]
# open and save excel file to disk
f = open('excelFile.xlsx', 'wb')
f.write(attachment.get_payload(decode=True))
f.close()
xls = xlrd.open_workbook(excelFilePath) # so something in quotes like '/Users/mymac/thisProjectsFolder/excelFileName.xlsx'
# Here's a bonus for how to start accessing excel cells and rows
for sheets in xls.sheets():
list = []
for rows in range(sheets.nrows):
for col in range(sheets.ncols):
list.append(str(sheets.cell(rows, col).value))
I'm using O365 for Python.
Sending an email and building the body my using the setBodyHTML() function. However at the present I need to write the actual HTML code inside the function. I don't want to do that. I want to just have python look at an HTML file I saved somewhere and send an email using that file as the body. Is that possible? Or am I confined to copy/pasting my HTML into that function? I'm using office365 for business. Thanks.
In other words instead of this: msg.setBodyHTML("<h3>Hello</h3>") I want to be able to do this: msg.setBodyHTML("C:\somemsg.html")
I guess you can assign the file content to a variable first, i.e.:
file = open('C:/somemsg.html', 'r')
content = file.read()
file.close()
msg.setBodyHTML(content)
You can do this via a simple reading of that file into a string, which you then can pass to the setBodyHTML function.
Here's a quick function example that will do the trick:
def load_html_from_file(path):
contents = ""
with open(path, 'r') as f:
contents = f.read()
return contents
Later, you can do something along the lines of
msg.setBodyHTML(load_html_from_file("C:\somemsg.html"))
or
html_contents = load_html_from_file("C:\somemsg.html")
msg.setBodyHTML(html_contents)
I am using python imaplib to download and save attachments in email. But when there is an email with attachment as another email, x.get_payload() is of Nonetype. I think these type of mails are are send using some email clients. Since the filename was missing, I tried changing filename in header followed by 'Content-Disposition'. The renamed file gets opened and when I try to write to that file using
fp.write(part.get_payload(decode=True))
it says string or buffer expected but Nonetype found.
>>>x.get_payload()
[<email.message.Message instance at 0x7f834eefa0e0>]
>>>type(part.get_payload())
<type 'list'>
>>>type(part.get_payload(decode=True))
<type 'NoneType'>
I removed decode=True and I got a list of objects
x.get_payload()[0]
<email.message.Message instance at 0x7f834eefa0e0>
I tried editing the filename in case email found as attachment.
if part.get('Content-Disposition'):
attachment = str(part.get_filename()) #get filename
if attachment == 'None':
attachment = 'somename.mail'
attachment = self.autorename(attachment)#append (no: of occurences) to filename eg:filename(1) in case file exists
x.add_header('Content-Disposition', 'attachment', filename=attachment)
attachedmail = 1
if attachedmail == 1:
fp.write(str(x.get_payload()))
else:
fp.write(x.get_payload(decode=True)) #write contents to the opened file
and the file contains the object name file content is given below
[ < email.message.Message instance at 0x7fe5e09aa248 > ]
How can I write the contents of these attached emails to files?
I solved it myself. as [ < email.message.Message instance at 0x7fe5e09aa248 > ] is a list of email.message.Message instances, each one have .as_string() method. In my case writing the content of .as_string() to a file helped me to extract the whole header data including embedded attachments to a file. Then I inspected the file line by line and saved contents based on the encoding and file type.
>>>x.get_payload()
[<email.message.Message instance at 0x7f834eefa0e0>]
>>>fp=open('header','wb')
>>>fp.write(x.get_payload()[0].as_string())
>>>fp.close()
>>>file_as_list = []
>>>fp=open('header','rb')
>>>file_as_list = fp.readlines()
>>>fp.close()
And then inspecting each lines in file
for x in file_as_list:
if 'Content-Transfer-Encoding: quoted-printable' in x:
print 'qp encoded data found!'
if 'Content-Transfer-Encoding: base64' in x:
print 'base64 encoded data found!'
The encoded data representing inline(embedded) attachments can be skipped as imaplib already captures it.
I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file.
Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read.
To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found.
Anybody have any ideas?
The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.
The advantage of this technique is that you don't need any extra python libraries installed.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('<path to docx file>') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.
In answer to a comment below,
Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.
For example adding to the script above:
IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'
for image in tree.iter(IMAGE):
print image.attrib
outputs:
{'id': '1', 'name': 'Picture 1'}
I'm no expert at the openxml format but I hope this helps.
I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.
To search in a document with python-docx
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
You also have a function to get the text of a document:
https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)
Using https://github.com/mikemaccana/python-docx
It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code.
If anyone needs additional details, please say so in the comments.
A more simple library with image extraction capability.
pip install docx2txt
Then use below code to read docx file.
import docx2txt
text = docx2txt.process("file.docx")
Extracting text from doc/docx file using python
import os
import docx2txt
from win32com import client as wc
def extract_text_from_docx(path):
temp = docx2txt.process(path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
final_text = ' '.join(text)
return final_text
def extract_text_from_doc(doc_path):
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open(file_path)
doc.SaveAs(save_file_name, 16)
doc.Close()
w.Quit()
joinedPath = os.path.join(root_path,save_file_name)
text = extract_text_from_docx(joinedPath)
return text
def extract_text(file_path, extension):
text = ''
if extension == '.docx':
text = extract_text_from_docx(file_path)
else extension == '.doc':
text = extract_text_from_doc(file_path)
return text
file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)