I am trying to do some text mining on emails which I have exported from my email client (Mail in OS X) just by copying and pasting to a rtf file.
When I attempt to run tf-idf on the files either in python or rapidminer I get features which are clearly not in the message content itself. I wonder where they come from or how I can get rid of them. Perhaps from the headers? For example features such as: fonttbl, colortbl,cocoa rtf,paperw etc. Clearly they are some properties of the email. Where do they come from and how can I remove that more the files or extract only the email contents from the original email messages?
Perhaps this is an encoding issue??
Thanks!
Related
I am trying to extract body of the email from my gmail accounts based off their email ids. I am able to acquire the body, however I am not able to generalize the approach and have to resort to hardcoding.
The following code snippet is modification from here.
if 'parts' in email['payload']:
## path 1
data = email['payload']['parts'][0]['parts'][0]["body"]["data"]
### Only one of the paths will work. Need to find other patterns!
## path 2
data = email['payload']['parts'][0]["body"]["data"]
else:
data = email['payload']['body']['data']
My problem is, so far I have found only two patterns. But I want to generalize the approach to get the body and not rely on hardcoding the paths.
Assumption: The email body is in HTML and not simple text. Thus, sending a simple text will not work, I've tried.
The API documentation for the message structure is found here. I have created a test file containing the different json structures that I can send over if anyone wanna help me in the investigation.
.
I am trying to read a .msg file to get the sender, recipients, and title.
I'm making this script for my workplace where I'm only allowed to install default python libraries so I want to use the email module to do this.
On the python website I found some examples of using the email module. https://docs.python.org/3/library/email.examples.html
Near the end of the page it talks about getting the sender, subject and recipient. I've tried using this code like this:
# Import the email modules we'll need
from email import policy
from email.parser import BytesParser
with open('test_email.msg', 'rb') as fp:
msg = BytesParser(policy=policy.default).parse(fp)
# Now the header items can be accessed as a dictionary, and any non-ASCII will
# be converted to unicode:
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])
This results in an output:
To: None
From: None
Subject: None
I checked the file test_email.msg, it is a valid email.
When I add a line of code
print(msg)
I get an output of a garbled email the same as if I opened the .msg file in notepad.
Can anybody suggest why the email module isn't finding the sender/recipient/subject correctly?
You are apparently attempting to read some sort of proprietary binary format. The Python email library does not support this; it only handles traditional (basically text) RFC822 / RFC5322 format.
To read Microsoft's OLE formats, you will need a third-party module, and some patience, voodoo, and luck.
Also, for the record, there is no unambigious definition of .msg. Outlook uses this file extension for its files, but it is used on other files in other formats as well, including also traditional RFC822 files.
(The second link attempts to link to the MS-OXMSG spec on MSDN; but Microsoft have in the past regarded URLs as some sort of depletable resource which runs out when you use it, so the link will probably stop working if enough people click on it.)
I'm using Getmail (http://pyropus.ca/software/getmail/) to check a POP3 account regularlly and download arriving mail to a folder.
On a second step I parse those mails using Python and use the sent data.
The problem is that, as the received mails are in spanish, latin characters may arrive and when saved to a file they are replaced. I can't find how to avoid that replacement or even which encoding the file is in. Examples follow:
Ñ is replaced by =F1
á is replaced by =E1
Line breaks are replaced by =20
It looks like URL encoding but it uses the = sign instead of the %.
I also tried to download the mails using Python's poplib and I also get the characters replaces, that leads me to think that the problem can be a configuration on my machine.
Hope someone can help me with this.
Thanks!!
As part of some email batch processing, we need to decode and clean up the messages. One critical part of that process is separating the mail bodies of a message and the mail attachments. The trickiest part is to determine when a Conent-Disposition: inline part is to be considered a message body alternative or a file.
So far, this code seems to handle most of the cases:
from email import message_from_string
def split_parts(raw):
msg = message_from_string(raw)
bodies = []
files = []
for sub in msg.walk():
if sub.is_multipart():
continue
cd = sub.get("Content-Disposition", "")
if cd.startswith("attachment") or (cd.startswith("inline") and
sub.get_filename()):
files.append(sub)
else:
bodies.append(sub)
return bodies, files
Note the reliance on the inline parts to have a filename specified in the headers, which Outlook seems to do for all its multipart/related messages. The Content-ID could also be used as a hint, but according to the RFC 2387 it is not such an indicator.
Therefore, if an embedded image is encoded as a message part that has Content-Disposition: inline, defines a Content-ID and doesn't have a filename then the above code can mistakenly classify it as a message body alternative.
From what I've read from the RFC's, there's not much hope on finding an easy check (specially since coding according to the RFCs is almost useless in the real world, because nobody does it); but I was wondering how big the chances are to hit the misclassification case.
Rationale
I could have a set of functions to treat each multipart/* case and let them indirectly recurse. However, we don't care so much about a faithful display; as a matter of fact, we filter all HTML messages through tidy. Instead, we are more interested in chosing one of the message body alternatives and saving as many attachments as possible, even if they are intended to be embedded.
Furthermore, some user agents do really weird things when composing multipart/alternative messages with embedded attachments that are not intended to be displayed inline (such as PDF files), as a result of the user dragging and dropping an arbitrary file into the composition window.
I'm not quite following you, but, if you want bodies, I would assume anything with a text/plain or text/html content type, with an inline content disposition, with no file name or no content-id, could be a body part.
Hello I have a Python program and it is capable of fetching emails from gmail. Everything works fine, except for the fact that there are a bunch of
......"ransition99/xhtml">=0D=0A<head>=0D=
=0A<ml; =0D=0A=0D=0Acharset=3DUTF-8" />=0D=0A<title>Untitled Document</title>=0D=0A</head>=0D=0A=0D=0A=0D=0A<body>=0D=0A=0D=0A<p>=0D=0A border=3D=
"0" =0D=0A=0D=0Asrc=3"......
that kind of stuff. Would stripping the email of HTML clean this up? I'm not even sure how to refer to this content, is there a particular language that emails are written in?
ps.. I had to delete some, because i cant post images.
It's encoded in quoted-printable.
>>> quopri.decodestring('''=0D=0A border=3D=
... "0"''')
'\r\n border="0"'