Python IMAP: =?utf-8?Q? in subject string

Python IMAP: =?utf-8?Q? in subject string - python

I am displaying new email with IMAP, and everything looks fine, except for one message subject shows as:
=?utf-8?Q?Subject?=
How can I fix it?

In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:
import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]
Check out the docs for email.header for more details.

This is a MIME encoded-word. You can parse it with email.header:
import email.header
def decode_mime_words(s):
return u''.join(
word.decode(encoding or 'utf8') if isinstance(word, bytes) else word
for word, encoding in email.header.decode_header(s))
print(decode_mime_words(u'=?utf-8?Q?Subject=c3=a4?=X=?utf-8?Q?=c3=bc?='))

The text is encoded as a MIME encoded-word. This is a mechanism defined in RFC2047 for encoding headers that contain non-ASCII text such that the encoded output contains only ASCII characters.
In Python 3.3+, the parsing classes and functions in email.parser automatically decode "encoded words" in headers if their policy argument is set to policy.default
>>> import email
>>> from email import policy
>>> msg = email.message_from_file(open('message.txt'), policy=policy.default)
>>> msg['from']
'Pepé Le Pew <pepe#example.com>'
The parsing classes and functions are:
email.parser.BytesParser
email.parser.Parser
email.message_from_bytes
email.message_from_binary_file
email.message_from_string
email.message_from_file
Confusingly, up to at least Python 3.10, the default policy for these parsing functions is not policy.default, but policy.compat32, which does not decode "encoded words".
>>> msg = email.message_from_file(open('message.txt'))
>>> msg['from']
'=?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>'

Try Imbox
Because imaplib is a very excessive low level library and returns results which are hard to work with
Installation
pip install imbox
Usage
from imbox import Imbox
with Imbox('imap.gmail.com',
username='username',
password='password',
ssl=True,
ssl_context=None,
starttls=False) as imbox:
all_inbox_messages = imbox.messages()
for uid, message in all_inbox_messages:
message.subject

In Python 3, decoding this to an approximated string is as easy as:
from email.header import decode_header, make_header
decoded = str(make_header(decode_header("=?utf-8?Q?Subject?=")))
See the documentation of decode_header and make_header.

High level IMAP lib may be useful here: imap_tools
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
Parsed email message attributes
Query builder for searching emails
Actions with emails: copy, delete, flag, move, seen
Actions with folders: list, set, get, create, exists, rename, delete, status
No dependencies

Related

issues with parsing email using python with "imaplib" library, html lines char limit, and additional non unicode chara

i'm using python 3, and want to validate emails sent to my inbox
im using imaplib,
i've managed to get the email content,
however, the mail is unreadable and kind of corrupted ( variable html123 in code)
after i'm fetching the mail, and getting the content using :
mail_body = email.message_from_string(str(data[1][0][1], 'utf-8'))
this is the original mail i see in mailbox:
dear blabla,
We’ve added new tasks to your account. Please log in to your account to review and.....
this is the mail i get in python
| |
dear blabla,
We=E2=80=99ve added new tasks to your account. Plea= se log in to your account....
so 3 issues in this example, i have much more in the real mail:
1 -the ' was replaced with =E2=80=99
2- the word please cut at end of line, with =
3 -all the signs\char || --- you see above
this is the relevant part in code:
data = self.mail_conn.fetch(str(any_email_id), f'({fetch_protocol})')
mail_body = email.message_from_string(str(data[1][0][1], 'utf-8'))
html123 = mail_body.get_payload()
x1 = (html2text.html2text(html123))

The data you get from imaplib is in "quoted-printable" encoding. https://en.wikipedia.org/wiki/Quoted-printable
To decode you can use the builtin quopri module
import quopri
quopri.decodestring("we=E2=80=99ve").decode() # -> we've

Python mail inserts space every 171 characters

I am trying to write a python script to send an email that uses html formatting and involves a lot of non-breaking spaces. However, when I run it, some of the &nbsp strings are interrupted by spaces that occur every 171 characters, as can be seen by this example:
#!/usr/bin/env python
import smtplib
import socket
from email.mime.text import MIMEText
emails = ["my#email.com"]
sender = "test#{0}".format(socket.gethostname())
message = "<html><head></head><body>"
for i in range(20):
message += " " * 50
message += "<br/>"
message += "</body>"
message = MIMEText(message, "html")
message["Subject"] = "Test"
message["From"] = sender
message["To"] = ", ".join(emails)
mailer = smtplib.SMTP("localhost")
mailer.sendmail(sender, emails, message.as_string())
mailer.quit()
The example should produce a blank email that consists of only spaces, but it ends up looking something like this:
&nbsp ;
&nb sp;
& nbsp;
&nbs p;
&n bsp;
Edit: In case it is important, I am running Ubuntu 15.04 with Postfix for the smtp client, and using python2.6.

I can replicate this in a way but my line breaks come every 999 characters. RFC 821 says maximum length of a line is 1000 characters including the line break so that's probably why.
This post gives a different way to send a html email in python, and i believe the mime type "multipart/alternative" is the correct way.
Sending HTML email using Python

I'm the developer of yagmail, a package that tries to make it easy to send emails.
You can use the following code:
import yagmail
yag = yagmail.SMTP('me#gmail.com', 'mypassword')
for i in range(20):
message += " " * 50
message += "<br/>"
yag.send(contents = message)
Note that by default it will send a HTML message, and that it also adds automatically the alternative part for non HTML browsers.
Also, note that omitting the subject will leave an empty subject, and without a to argument it will send it to self.
Furthermore, note that if you set yagmail up correctly, you can just login using yag.SMTP(), without having to have username & password in the script (while still being secure). Omitting the password will prompt a getpass.
Adding an attachment is as simple as pointing to a local file, e.g.:
yag.send(contents = [message, 'previously a lot of whitespace', '/local/path/file.zip']
Awesome isn't it? Thanks for the allowing me to show a nice use case for yagmail :)
If you have any feature requests, issues or ideas please let me know at github.

Export Gmail using Python imaplib - text mangled with newline issues

I'm using the following code to export all the emails in a specific gmail folder.
Its working well, in that its pulling out all the emails I expect, but it (or me) seems to be mangle the encoding for CR / newlines.
Code:
import imaplib
import email
import codecs
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myUser#gmail.com', 'myPassword') #user / password
mail.list()
mail.select("myFolder") # connect to folder with matching label
result, data = mail.uid('search', None, "ALL") # search and return uids instead
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
email_message = email.message_from_string(raw_email)
save_string = str("C:\\\googlemail\\boxdump\\email_" + str(x) + ".eml") #set to save location
myfile = open(save_string, 'a')
myfile.write(email_message)
myfile.close()
My problem is that by the time I get to the object its littered with '=0A', which I am assuming are incorrectly interpreted newline or carriage return flag.
I can find it in hex, [d3 03 03 0a] but because this is not 'characters' I can't find a way for str.replace() to take the parts out. I don't actually want the newline flags.
I could convert the whole string to hex, and do a do a replace of sorts / regex thing, but that seems like over kill - when the problem is in the encoding / reading of the source data
What I see:
====
CAUTION: This email message and any attachments con= tain information that may be confidential and may be LEGALLY PRIVILEGED. If yo= u are not the intended recipient, any use, disclosure or copying of this messag= e or attachments is strictly prohibited. If you have received this email messa= ge in error please notify us immediately and erase all copies of the message an= d attachments. Thank you.
====
what I want:
====
CAUTION: This email message and any attachments contain information that may be confidential and may be LEGALLY PRIVILEGED. If you are not the intended recipient, any use, disclosure or copying of this message or attachments is strictly prohibited. If you have received this email message in error please notify us immediately and erase all copies of the message and attachments. Thank you.
====

What you are looking at is Quoted Printable encoding.
Try changing:
email_message = email.message_from_string(raw_email)
to:
email_message = str(email.message_from_string(raw_email)).decode("quoted-printable")
For more info see Standard Encodings in the Python codecs module.

Just 2 additional items having gone thought the pain of this for a day.
1 do it at the payload level so you can process your email_message to get email addresses etc from your mail.
2 You need to decode the charset set also, I had trouble with people copying and pasting html from webpages and content from word docs etc into emails that I was then trying to process.
if maintype == 'multipart':
for part in email_message.get_payload():
if part.get_content_type() == 'text/plain':
text += part.get_payload().decode("quoted-printable").decode(part.get_content_charset())
Hope this helps someone !
Dave

How does encoding in email subjects work? (Django/ Python)

I am sending email with EmailMessage object to Gmail box.
The subject of an email looks something like this:
u"You got a letter from Daėrius ęėįęėįęįėęįę---reply3_433441"
When i receive an email, looking at the message info i can see that Subject line looks like this:
Subject: =?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?=
=?utf-8?b?xK/El8SZxK/EmS0tLXJlcGx5M180MzM0NDE=?=
How to decode this subject line?
I have sucesfully decoded email body (tex/plain) with this:
for part in msg.walk():
if part.get_content_type() == 'text/plain':
msg_encoding = part.get_content_charset()
msg_text = part.get_payload().decode('quoted-printable')
msg_text = smart_unicode(msg_text, encoding=msg_encoding, strings_only=False, errors='strict')

See RFC 2047 for a complete description of the format of internationalized email headers. The basic format is "=?" charset "?" encoding "?" encoded-text "?=". So in your case, you have a base-64 encoded UTF-8 string.
You can use the email.header.decode_header and str.decode functions to decode it and get a proper Unicode string:
>>> import email.header
>>> x = email.header.decode_header('=?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?=')
>>> x
[('You got a letter from Da\xc4\x97rius \xc4\x99\xc4\x97\xc4\xaf\xc4\x99\xc4\x97\xc4\xaf\xc4\x99', 'utf-8')]
>>> x[0][0].decode(x[0][1])
u'You got a letter from Da\u0117rius \u0119\u0117\u012f\u0119\u0117\u012f\u0119'

You should look at the email.header module in the Python standard library. In particular, at the end of the documentation, there's a decode_header() function you can use to do most of the hard work for you.

the subject line is utf8 but you're reading it as ASCII, you're safest reading it all as utf8, as ASCII is effectively only as subset of utf8.

emails as email.Message class objects in Python

How do I use poplib, and download mails as message instances from email.Message class from email module in Python?
I am writing a program, which analyzes, all emails for specific information, storing parts of the message into a database. I can download the entire mail as text, howver walking through text searching for attachments is difficult.
idea is to parse messages for information

use the FeedParser class in the email.feedparser module to construct an email.Message object from the messages read from the server with poplib.
specifically:
import poplib
import email
pop = poplib.POP3( "server..." )
[establish connection, authenticate, ...]
raw = pop.retr( 1 )
pop.close()
parser = email.parser.FeedParser()
for line in raw[1]:
parser.feed( str( line+b'\n', 'us-ascii' ) )
message = parser.close()

Doesn't deal with character set issues like Adrien Plisson's answer.
import poplib
import email
pop = poplib.POP3( "server..." )
[establish connection, authenticate, ...]
raw = pop.retr( 1 )
pop.close()
message = email.message_from_string('\n'.join(raw[1]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python IMAP: =?utf-8?Q? in subject string - python

I am displaying new email with IMAP, and everything looks fine, except for one message subject shows as: =?utf-8?Q?Subject?= How can I fix it?

In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this: import email.header text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0] Check out the docs for email.header for more details.

In Python 3, decoding this to an approximated string is as easy as: from email.header import decode_header, make_header decoded = str(make_header(decode_header("=?utf-8?Q?Subject?="))) See the documentation of decode_header and make_header.

Related

issues with parsing email using python with "imaplib" library, html lines char limit, and additional non unicode chara

Python mail inserts space every 171 characters

Export Gmail using Python imaplib - text mangled with newline issues

How does encoding in email subjects work? (Django/ Python)

emails as email.Message class objects in Python

Categories

Resources