Can not read email content from gmail after extracting with python program

Can not read email content from gmail after extracting with python program - python

Hello I have a Python program and it is capable of fetching emails from gmail. Everything works fine, except for the fact that there are a bunch of
......"ransition99/xhtml">=0D=0A<head>=0D=
=0A<ml; =0D=0A=0D=0Acharset=3DUTF-8" />=0D=0A<title>Untitled Document</title>=0D=0A</head>=0D=0A=0D=0A=0D=0A<body>=0D=0A=0D=0A<p>=0D=0A border=3D=
"0" =0D=0A=0D=0Asrc=3"......
that kind of stuff. Would stripping the email of HTML clean this up? I'm not even sure how to refer to this content, is there a particular language that emails are written in?
ps.. I had to delete some, because i cant post images.

It's encoded in quoted-printable.
>>> quopri.decodestring('''=0D=0A border=3D=
... "0"''')
'\r\n border="0"'

Related

Reading attributes of .msg file

I am trying to read a .msg file to get the sender, recipients, and title.
I'm making this script for my workplace where I'm only allowed to install default python libraries so I want to use the email module to do this.
On the python website I found some examples of using the email module. https://docs.python.org/3/library/email.examples.html
Near the end of the page it talks about getting the sender, subject and recipient. I've tried using this code like this:
# Import the email modules we'll need
from email import policy
from email.parser import BytesParser
with open('test_email.msg', 'rb') as fp:
msg = BytesParser(policy=policy.default).parse(fp)
# Now the header items can be accessed as a dictionary, and any non-ASCII will
# be converted to unicode:
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])
This results in an output:
To: None
From: None
Subject: None
I checked the file test_email.msg, it is a valid email.
When I add a line of code
print(msg)
I get an output of a garbled email the same as if I opened the .msg file in notepad.
Can anybody suggest why the email module isn't finding the sender/recipient/subject correctly?

You are apparently attempting to read some sort of proprietary binary format. The Python email library does not support this; it only handles traditional (basically text) RFC822 / RFC5322 format.
To read Microsoft's OLE formats, you will need a third-party module, and some patience, voodoo, and luck.
Also, for the record, there is no unambigious definition of .msg. Outlook uses this file extension for its files, but it is used on other files in other formats as well, including also traditional RFC822 files.
(The second link attempts to link to the MS-OXMSG spec on MSDN; but Microsoft have in the past regarded URLs as some sort of depletable resource which runs out when you use it, so the link will probably stop working if enough people click on it.)

Py3 imaplib: get only immediate body (no reply) of email [duplicate]

There are two pre-existing questions on the site.
One for Python, one for Java.
Java How to remove the quoted text from an email and only show the new text
Python Reliable way to only get the email text, excluding previous emails
I want to be able to do pretty much exactly the same (in PHP). I've created a mail proxy, where two people can have a correspondance together by emailing a unique email address.
The problem I am finding however, is that when a person receives the email and hits reply, I am struggling to accurately capture the text that he has written and discard the quoted text from previous correspondance.
I'm trying to find a solution that will work for both HTML emails and Plaintext email, because I am sending both.
I also have the ability if it helps to insert some <*****RESPOND ABOVE HERE*******> tag if neccessary in the emails meaning that I can discard everything below.
What would you recommend I do? Always add that tag to the HTML copy and the plaintext copy then grab everything above it?
I would still then be left with the scenario of knowing how each mail client creates the response. Because for example Gmail would do this:
On Wed, Nov 2, 2011 at 10:34 AM, Message Platform <35227817-7cfa-46af-a190-390fa8d64a23#dev.example.com> wrote:
## In replies all text above this line is added to your message conversation ##
Any suggestions or recommendations of best practices?
Or should I just grab the 50 most popular mail clients, and start creating custom Regex for each. Then for each of these clients, also a bizallion different locale settings since I'm guessing the locale of the user will also influence what is added.
Or should I just remove the preceding line always if it contains a date?.. etc

Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.
Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".
There are three solutions you might find acceptable:
Leave It Alone
The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.
Separate the Reply from the Older Message
The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).
Strip Out Quoted Text
The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:
$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);
There are some problems using this simpler method:
Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.
Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.

There are many libraries out there that can help you extract the reply/signature from a message:
Ruby: https://github.com/github/email_reply_parser
Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
JavaScript: https://github.com/turt2live/node-email-reply-parser
Java: https://github.com/Driftt/EmailReplyParser
PHP: https://github.com/willdurand/EmailReplyParser
I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/
Hope this helps!

Possibly helpful: quotequail is a Python library that helps identify quoted text in emails

Afaik, (standard) emails should quote the whole text by adding a ">" in front of every line. Which you could strip by using strstr(). Otherwise, did you trie to port that Java example to php? It's nothing else than Regex.
Even pages like Github and Facebook do have this problem.

Just an idea: You have the text which was originally sent, so you can look for it and remove it and additional surrounding noise from the reply. It is not trivial, because additional line breaks, HTML elements, ">" characters are added by the mail client application.
The regex is definitely better if it works, because it is simple and it perfectly cuts the original text, but if you find that it frequently does not work then this can be a fallback method.

I agree that quoted text or reply is just a TEXT. So there's no accurate way to fetch it. Anyway you can use regexp replace like this.
$filteringMessage = preg_replace('/.*\n\n((^>+\s{1}.*$)+\n?)+/mi', '', $message);
Test
https://regex101.com/r/xO8nI1/2

Email message extract has encoding characters

I am trying to do some text mining on emails which I have exported from my email client (Mail in OS X) just by copying and pasting to a rtf file.
When I attempt to run tf-idf on the files either in python or rapidminer I get features which are clearly not in the message content itself. I wonder where they come from or how I can get rid of them. Perhaps from the headers? For example features such as: fonttbl, colortbl,cocoa rtf,paperw etc. Clearly they are some properties of the email. Where do they come from and how can I remove that more the files or extract only the email contents from the original email messages?
Perhaps this is an encoding issue??
Thanks!

reading Lotus Notes documents via COM

I am trying to read e-mails in a Lotus Notes database via python and com. (using pythonwin and win32com)
I can connect to the database and read NotesDocument items but
doc = folder.GetFirstDocument()
doc.GetItemValue('Body')
returns the plain text contents of the email. I can get the headers, subject, date, etc but body is plaintext. I'm trying to fetch the HTML source of the email which includes links and other formatting. I know the stuff is there because within Notes I can view-->show--> page source.
I've tried
doc.GetMIMEEntity('Body')
but this returns None.

Try adding this line right after where you get the session:
session.ConvertMIME = False
Update:
Barry commented that it worked this way:
doc.GetFirstItem("Body").GetMIMEEntity()

The body is a rich text item. You won't be able to access the HTML version of the body field, but you can navigate around the rich text item using the NotesRichText... classes.
The NotesRichTextNavigator class has an example to get you started. It is unfortunately not very easy to get around in that object.

Extra Tabs in IMAP HTML Text

I'm using Python and imaplib to obtain emails from a IMAP server (supports all kinds of IMAP servers - GMail, etc.).
My problem is: Using the IMAP BODY[INDEX] command to fetch a specific body part, the HTML comes with extra tabs. As in:
(...)</a>\t\t\t\t\t\t\t\t<a>(...)
When showing the HTML the tabs are obviously extra:
(The screenshot is in the Portuguese language but I believe that is not relevant.
I have searched IMAP documentation but found nothing that helps. I am guessing these \t are always following tag closes (such as \t\t\t\t\t), so I could just find all tabs that come after a tag close and delete them, but I don't know if that would be a reliable method at all.
Thank you

I found a solution (for the time being at least).
When receiving data from a IMAP call response, there are \\r\\n characters delimiting the lines. I remove these.
However, I discovered that besides these there are also \\t characters coupled with these in some instances. For example:
\\r\\n\\t\\t\\t\t
If I remove the \\t together with the \\r\\n, the HTML is presented perfectly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.