How can I get an email message's text content using Python?

How can I get an email message's text content using Python? - python

Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this:
message = email.message_from_string(raw_message)
if has_mime_part(message, "text/plain"):
mime_part = get_mime_part(message, "text/plain")
text_content = decode_mime_part(mime_part)
elif has_mime_part(message, "text/html"):
mime_part = get_mime_part(message, "text/html")
html = decode_mime_part(mime_part)
text_content = render_html_to_plaintext(html)
else:
# fallback
text_content = str(message)
return text_content
Of these things, I have get_mime_part and has_mime_part down pat, but I'm not quite sure how to get the decoded text from the MIME part. I can get the encoded text using get_payload(), but if I try to use the decode parameter of the get_payload() method (see the doc) I get an error when I call it on the text/plain part:
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/
email/message.py", line 189, in get_payload
raise TypeError('Expected list, got %s' % type(self._payload))
TypeError: Expected list, got <type 'str'>
In addition, I don't know how to take HTML and render it to text as closely as possible.

In a multipart e-mail, email.message.Message.get_payload() returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part:
import email
msg = email.message_from_string(raw_message)
for part in msg.walk():
# each part is a either non-multipart, or another multipart message
# that contains further parts... Message is organized like a tree
if part.get_content_type() == 'text/plain':
print part.get_payload() # prints the raw text
For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type.
msg = email.message_from_string(raw_message)
msg.get_payload()
If the content is encoded, you need to pass None as the first parameter to get_payload(), followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment:
msg = email.message_from_string(raw_message)
for part in msg.walk():
if part.get_content_type() == 'application/msword':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True)) # You need None as the first param
# because part.is_multipart()
# is False
f.close()
As for getting a reasonable plain-text approximation of an HTML part, I've found that html2text works pretty darn well.

Flat is better than nested ;)
from email.mime.multipart import MIMEMultipart
assert isinstance(msg, MIMEMultipart)
for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']:
print _

To add on #Jarret Hardie's excellent answer:
I personally like to transform that kind of data structures to a dictionary that I can reuse later, so something like this where the content_type is the key and the payload is the value:
import email
[...]
email_message = {
part.get_content_type(): part.get_payload()
for part in email.message_from_bytes(raw_email).walk()
}
print(email_message["text/plain"])

#This is what I have for a gmail account using the app password method.
from imap_tools import MailBox
import email
my_email = "your email"
my_pass = "app password"
mailbox = MailBox('imap.gmail.com').login(my_email, my_pass)
for msg in mailbox.fetch('Subject " "', charset='utf8'):
print("Message id:",msg.uid)
print("Message Subject:",msg.subject)
print("Message Date:", msg.date)
print("Message Text:", msg.text)

Try my lib for IMAP: https://github.com/ikvk/imap_tools
from imap_tools import MailBox, AND
# Get date, subject and body len of all emails from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd') as mailbox:
for msg in mailbox.fetch():
print(msg.date, msg.subject, len(msg.text or msg.html))
See html2text: https://pypi.org/project/html2text/.
And may be msg.text is enough

Related

AttributeError: "NoneType" object has no attribute 'decode'

I am trying to read an email from my gmail inbox in Python3. So I followed this tutorial : https://www.thepythoncode.com/article/reading-emails-in-python
My code is the following :
username = "*****#gmail.com"
password = "******"
# create an IMAP4 class with SSL
imap = imaplib.IMAP4_SSL("imap.gmail.com")
# authenticate
imap.login(username, password)
status, messages = imap.select("INBOX")
# total number of emails
messages = int(messages[0])
for i in range(messages, 0, -1):
# fetch the email message by ID
res, msg = imap.fetch(str(i), "(RFC822)")
for response in msg:
if isinstance(response, tuple):
# parse a bytes email into a message object
msg = email.message_from_bytes(response[1])
# decode the email subject
subject = decode_header(msg["Subject"])[0][0]
if isinstance(subject, bytes):
# if it's a bytes, decode to str
subject = subject.decode()
# email sender
from_ = msg.get("From")
# if the email message is multipart
if msg.is_multipart():
# iterate over email parts
for part in msg.walk():
# extract content type of email
content_type = part.get_content_type()
content_disposition = str(part.get("Content-Disposition"))
# get the email body
body = part.get_payload(decode=True).decode()
print(str(body))
imap.close()
imap.logout()
print('DONE READING EMAIL')
The libraries I am using is :
import imaplib
import email
from email.header import decode_header
However, when I execute it I get the following error message, which I don't understand because I never have an empty email in my inbox ...
Traceback (most recent call last):
File "<ipython-input-19-69bcfd2188c6>", line 38, in <module>
body = part.get_payload(decode=True).decode()
AttributeError: 'NoneType' object has no attribute 'decode'
Anyone has an idea what my problem could be ?

From the documentation:
Optional decode is a flag indicating whether the payload should be
decoded or not, according to the Content-Transfer-Encoding header.
When True and the message is not a multipart, the payload will be
decoded if this header’s value is quoted-printable or base64. If some
other encoding is used, or Content-Transfer-Encoding header is
missing, or if the payload has bogus base64 data, the payload is
returned as-is (undecoded). If the message is a multipart and the
decode flag is True, then None is returned. The default for decode is
False.
(Note: this link is for python2 - for whatever reason the corresponding page for python3 doesn't seem to mention get_payload.)
So it sounds like either some part of some message:
is missing the content-transfer-encoding (gives email.message no indication how it is meant to be decoded), or
is using an encoding other than QP or base64 (email.message does not support decoding it), or
claims to be base-64 encoded but contains a wrongly encoded string that cannot be decoded
The best thing to do is probably just to skip over it.
Replace:
body = part.get_payload(decode=True).decode()
with:
payload = part.get_payload(decode=True)
if payload is None:
continue
body = payload.decode()
Although I am not sure whether the decode() method that you are calling on the payload is doing anything useful beyond the decoding that get_payload has already done when using decode=True. You should probably test this, and if you find that this call to decode does not do anything (i.e. if body and payload are equal), then you would probably omit this step entirely:
body = part.get_payload(decode=True)
if body is None:
continue
If you add some print statements regarding from_ and subject, you should be able to identify the affected message(s), and then go to "show original" in gmail to compare, and see exactly what is going on.

High level imap lib may helps here:
from imap_tools import MailBox
# get emails from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
for msg in mailbox.fetch():
msg.uid # str or None: '123'
msg.subject # str: 'some subject 你 привет'
msg.from_ # str: 'sender#ya.ru'
msg.to # tuple: ('iam#goo.ru', 'friend#ya.ru', )
msg.cc # tuple: ('cc#mail.ru', )
msg.bcc # tuple: ('bcc#mail.ru', )
msg.reply_to # tuple: ('reply_to#mail.ru', )
msg.date # datetime.datetime: 1900-1-1 for unparsed, may be naive or with tzinfo
msg.date_str # str: original date - 'Tue, 03 Jan 2017 22:26:59 +0500'
msg.text # str: 'Hello 你 Привет'
msg.html # str: '<b>Hello 你 Привет</b>'
msg.flags # tuple: ('SEEN', 'FLAGGED', 'ENCRYPTED')
msg.headers # dict: {'Received': ('from 1.m.ru', 'from 2.m.ru'), 'AntiVirus': ('Clean',)}
for att in msg.attachments: # list: [Attachment]
att.filename # str: 'cat.jpg'
att.content_type # str: 'image/jpeg'
att.payload # bytes: b'\xff\xd8\xff\xe0\'
https://github.com/ikvk/imap_tools

Python: Encoding message as base64 to solve "!" and line length issue

BACKGROUND
Regarding the following articles:
https://www.drupal.org/project/mimemail/issues/31524
Exclamation Point Randomly In Result of PHP HTML-Email
https://sourceforge.net/p/phpmailer/bugs/53/
All the problems and solutions refer to PHP issue, but I have run into this problem in Python.
If I send the emails directly to recipients, all is well, no exclamation marks appear, and the message displays properly.
However, utilizing our "Sympa" (https://www.sympa.org/) system that the University uses for it "mailing list" solution, emails from this system have the exclamation marks and line breaks inserted in the message and HTML breaks causing display issues.
The problem stems from line length. Any line longer than a magical 998 character length line gets this exclamation marks and line breaks inserted.
NOW THE QUESTION
One of the solutions they mention is encoding the body of a message in base64, which apparently is immune to the line length issue. However, I can not figure out how to properly form a message in Python and have the proper headers and encoding happen so the message will display properly in an email client.
Right now, I have only succeed in sending emails with base64 encode bodies as attached files. Bleck!
I need to send HTML encoded emails (tables and some formatting). I create one very long concatenated string of all the html squished together. It is ugly but will display properly.
HELP?!
NOTE: If anyone else has had this problem and has a solution that will allow me to send emails that are not plagued by line length issue, I am all ears!
Source Code as Requested
# Add support for emailing reports
import smtplib
# from email.mime.text import MIMEText
from email.mime.message import MIMEMessage
from email.encoders import encode_base64
from email.message import Message
... ...
headerData = {"rDate": datetime.datetime.now().strftime('%Y-%m-%d')}
msg_body = msg_header.format(**headerData) + contact_table + spacer + svc_table
theMsg = Message()
theMsg.set_payload(msg_body)
encode_base64(theMsg)
theMsg.add_header('Content-Transfer-Encoding', 'base64')
envelope = MIMEMessage(theMsg, 'html')
theSubject = "Audit for: "+aService['description']
envelope['Subject'] = theSubject
from_addr = "xxx#xxx"
envelope['From'] = from_addr
to_addrs = "xxx#xxxx"
# to_addrs = aService['contact_email']
envelope['To'] = to_addrs
# Send the message via our own SMTP server.
s = smtplib.SMTP('x.x.x.x')
s.sendmail(from_addr, to_addrs, envelope.as_string())
s.quit()
SOLUTION, thank you #Serge Ballesta
Going back to MIMEText, and specifying a character set seemed to do the trick:
envelope = MIMEText(msg_body, 'html', _charset='utf-8')
assert envelope['Content-Transfer-Encoding'] == 'base64'
envelope['Subject'] = "Audit for: "+aService['description']
from_addr = "f5-admins#utlists.utexas.edu"
envelope['From'] = from_addr
to_addrs = "xx-test#xx.xx.edu"
envelope['To'] = to_addrs
# Send the message via our own SMTP server.
s = smtplib.SMTP('xx.xx.xx.edu')
s.sendmail(from_addr, to_addrs, envelope.as_string())
s.quit()
Apparently I was just stabbing around and did not account for character set. Using MIMEText and not MIMEMessage.

Normally, email.mime.MIMEText automatically sets the Content-Transfert-Encoding to base64 if the body is not declared to be plain ASCII. So, assuming that body contains the HTML text of the body of the message (no mail headers there), declaring it as utf-8 should be enough:
msg = email.mime.text.MIMEText(body, 'html', _charset='utf-8')
# assert the cte:
assert msg['Content-Transfer-Encoding'] == 'base64'
theSubject = "Audit for: "+aService['description']
msg['Subject'] = theSubject
from_addr = "xxx#xxx"
msg['From'] = from_addr
to_addrs = "xxx#xxxx"
# to_addrs = aService['contact_email']
msg['To'] = to_addrs
# optionally set other headers
# msg['Date']=datetime.datetime.now().isoformat()
# Send the message
s = smtplib.SMTP('x.x.x.x')
s.sendmail(from_addr, to_addrs, msg.as_bytes())
s.quit()

python imap read email body return None after get_payload

Hello there I trying to read my email and code is :
FROM_EMAIL = "emailadd"
FROM_PWD = "pasword"
SMTP_SERVER = "imapaddress"
SMTP_PORT = 111
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL,FROM_PWD)
mail.select('inbox')
type,data = mail.search(None, '(SUBJECT "IP")')
msgList = data[0].split()
last=msgList[len(msgList)-1]
type1,data1 = mail.fetch(last, '(RFC822)')
msg=email.message_from_string(data1[0][1])
content = msg.get_payload(decode=True)
mail.close()
mail.logout()
when I print content it will give me back as None but my email has body text
anyone can help me ?

From the documentation,
If the message is a multipart and the decode flag is True, then None is returned.
Moral: Don't set the decode flag when you fetch multipart messages.
If you are going to parse multipart messages, you might become familiar with the relevant RFC. Meanwhile, this quick-and-dirty might get you the data you need:
msg=email.message_from_string(data1[0][1])
# If we have a (nested) multipart message, try to get
# past all of the potatoes and straight to the meat
# For production, you might want a more thought-out
# approach, but maybe just fetching the first item
# will be sufficient for your needs
while msg.is_multipart():
msg = msg.get_payload(0)
content = msg.get_payload(decode=True)

Building on Rob's answer, here is slightly more sophisticated code:
msg=email.message_from_string(data1[0][1])
if msg.is_multipart():
for part in email_message.walk():
ctype = part.get_content_maintype()
cdispo = str(part.get('Content-Disposition'))
# skip any text/plain (txt) attachments
if ctype == 'text' and 'attachment' not in cdispo:
body = part.get_payload(decode=True) # decode
break
# not multipart - plain text
else:
body = msg.get_payload(decode=True)
This code is mainly taken from this answer.

Python HTML Email : How to insert list items in HTML email

I tried hard to find solution to this issues but all in vein, finally i have to ask you guys. I have HTML email (using Python's smtplib). Here is the code
Message = """From: abc#abc.com>
To: abc#abc.com>
MIME-Version: 1.0
Content-type: text/html
Subject: test
Hello,
Following is the message
""" + '\n'.join(mail_body) + """
Thank you.
"""
In above code, mail_body is a list which contains lines of output from a process. Now what i want is, to display these lines (line by line) in HTML email. What is happening now its just appending line after line. i.e.
I am storing the output(of process) like this :
for line in cmd.stdout.readline()
mail_body.append()
Current Output in HTML email is:
Hello,
abc
Thank you.
What i want :
Hello,
a
b
c
Thank you.
I just want to attach my process output in HTML email line by line. Can my output be achieved in any way?
Thanks and Regards

You could generate the email content to send using email package (from stdlib) e.g.:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from cgi import escape
from email.header import Header
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from smtplib import SMTP_SSL
login, password = 'me#example.com', 'my password'
# create message
msg = MIMEMultipart('alternative')
msg['Subject'] = Header('subject…', 'utf-8')
msg['From'] = login
msg['To'] = ', '.join([login,])
# Create the body of the message (a plain-text and an HTML version).
text = "Hello,\nFollowing is the message\n%(somelist)s\n\nThank you." % dict(
somelist='\n'.join(["- " + item for item in mail_body]))
html = """<!DOCTYPE html><title></title><p>Hello,<p>Following is the message
<ul>%(somelist)s</ul><p>Thank you. """ % dict(
somelist='\n'.join(["<li> " + escape(item) for item in mail_body]))
# According to RFC 2046, the last part of a multipart message, in this case
# the HTML message, is best and preferred.
msg.attach(MIMEText(text, 'plain', 'utf-8'))
msg.attach(MIMEText(html, 'html', 'utf-8'))
# send it
s = SMTP_SSL('smtp.mail.example.com', timeout=10) # no cert check on Python 2
s.set_debuglevel(0)
try:
s.login(login, password)
s.sendmail(msg['From'], msg['To'], msg.as_string())
finally:
s.quit()

in HTML, a new line is not \n it is <br> for "line break" but since you are also not using HTML tags in this email, you also need to know that in MIME messages, a newline is \r\n and not just \n
So you should write:
'\r\n'.join(mail_body)
For newlines that deal with the MIME message, but if you are going to use the HTML for formatting, then you need to know that <br> is the line break, and it would be:
'<br>'.join(mail_body)
To be comprehensive, you could try:
'\r\n<br>'.join(mail_body)
But I do now know what that would like like...

How do you check an email server (gmail) for a certain address and run something based on that address?

I've been trying to use 'import poplib' to access gmail, since I have Pop turned on in settings- but how do I actually then check the message for its 'from' address and then run something based on it? Also, what would be the command to strip the 'body' text from the message?

Here is how you can get subject and sender of each message in your GMail inbox using imaplib.
import imaplib
from email.parser import HeaderParser
conn = imaplib.IMAP4_SSL('imap.gmail.com')
conn.login('username#gmail.com', 'password')
# Select the mail box
status, messages = conn.select('INBOX')
if status != "OK":
print "Incorrect mail box"
exit()
if int(messages[0]) > 0:
for message_number in range(1,int(messages[0])+1):
data = conn.fetch(message_number, '(BODY[HEADER])')
parser = HeaderParser()
msg = parser.parsestr(data[1][0][1])
print "Subject: %s" % msg['subject']
print "From: %s" % msg['from']
You will probably need more information. Start from the official imaplib documentation.

there is the module rfc822
I guess messages from poplib can be donloaded from the server.
then put into a file
>>> f = StringIO.StringIO(message)
>>> import rfc822
and passed to
>>> rfc822.Message(f)
try this out.. and also check out the module documentation.
I hope it helps.
There is another python module:
>>> import email
>>> email.message_from_string(...)
This should provide you with read access for headers and also support muliple formats of body contents.

From the documentation:
POP3.retr(which)
Retrieve whole message number which, and set its seen flag. Result is in form (response, ['line', ...], octets).
So, assuming you have put the result of retr() into a variable called response, the lines of the message are stored as a list in response[1]. By RFC 2822 we know that the headers are separated from the body of the message by a blank line. The sender of the message will be in the From: header line. So we can just iterate over the lines of the message, stop when we get a blank line, and set a variable to our sender when we see a line that starts with From:.
sender = None
for line in response[1]:
if line.startswith("From: "):
sender = line.partition(" ")[2].strip()
elif line == "":
break
If you plan to do a lot with the headers, it might be useful to put them into a dictionary by header name. Since each header can appear multiple times, each value in the dictionary should be a list.
headers = {}
for line in response[1]:
if line == "":
break
line = line.partition(" ")
key = line[0].strip().rstrip(":")
value = line[2].stirp()
headers.setdefault(key, []).append(value)
After this you can use headers["From"][0] to get the sender of the message.
I wanted to show the basic way of doing this, because it's not very complicated, but Python can do most of the work for you. Again, assuming that your retr() result is in response:
import email
# convert our message back to a string and parse it
headers = email.parsefromstring("\n".join(response[0]), headersonly=True)
print headers["From"] # prints the sender
You can find out more about the message object in the documentation for the email module.
The From: line of an e-mail message may have additional text besides the e-mail address, such as the sender's name. You can extract the email address with a regular expression:
sender = re.find(r".*[ <](.+#.+)\b", headers["From"]).match(1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get an email message's text content using Python? - python

Flat is better than nested ;) from email.mime.multipart import MIMEMultipart assert isinstance(msg, MIMEMultipart) for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']: print _

Related

AttributeError: "NoneType" object has no attribute 'decode'

Python: Encoding message as base64 to solve "!" and line length issue

python imap read email body return None after get_payload

Python HTML Email : How to insert list items in HTML email

How do you check an email server (gmail) for a certain address and run something based on that address?

Categories

Resources