Python supports a quite functional MIME-Library called email.mime.
What I want to achieve is to get a MIME Part containing plain UTF-8 text to be encoded as quoted printables and not as base64. Although all functionallity is available in the library, I did not manage to use it:
Example:
import email.mime.text, email.encoders
m=email.mime.text.MIMEText(u'This is the text containing ünicöde', _charset='utf-8')
m.as_string()
# => Leads to a base64-encoded message, as base64 is the default.
email.encoders.encode_quopri(m)
m.as_string()
# => Leads to a strange message
The last command leads to a strange message:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Transfer-Encoding: quoted-printable
GhpcyBpcyB0aGUgdGV4dCBjb250YWluaW5nIMO8bmljw7ZkZQ=3D=3D
This is obviously not encoded as quoted printables, the double transfer-encoding header is strange at last (if not illegal).
How can I get my text encoded as quoted printables in the mime-message?
Okay, I got one solution which is very hacky, but at least it leads into some direction: MIMEText assumes base64 and I don't know how to change this. For this reason I use MIMENonMultipart:
import email.mime, email.mime.nonmultipart, email.charset
m=email.mime.nonmultipart.MIMENonMultipart('text', 'plain', charset='utf-8')
#Construct a new charset which uses Quoted Printables (base64 is default)
cs=email.charset.Charset('utf-8')
cs.body_encoding = email.charset.QP
#Now set the content using the new charset
m.set_payload(u'This is the text containing ünicöde', charset=cs)
Now the message seems to be encoded correctly:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
This is the text containing =C3=BCnic=C3=B6de
One can even construct a new class which hides the complexity:
class MIMEUTF8QPText(email.mime.nonmultipart.MIMENonMultipart):
def __init__(self, payload):
email.mime.nonmultipart.MIMENonMultipart.__init__(self, 'text', 'plain',
charset='utf-8')
utf8qp=email.charset.Charset('utf-8')
utf8qp.body_encoding=email.charset.QP
self.set_payload(payload, charset=utf8qp)
And use it like this:
m = MIMEUTF8QPText(u'This is the text containing ünicöde')
m.as_string()
In Python 3 you do not need your hack:
import email
# Construct a new charset which uses Quoted Printables (base64 is default)
cs = email.charset.Charset('utf-8')
cs.body_encoding = email.charset.QP
m = email.mime.text.MIMEText(u'This is the text containing ünicöde', 'plain', _charset=cs)
print(m.as_string())
Adapted from issue 1525919 and tested on python 2.7:
from email.Message import Message
from email.Charset import Charset, QP
text = "\xc3\xa1 = \xc3\xa9"
msg = Message()
charset = Charset('utf-8')
charset.header_encoding = QP
charset.body_encoding = QP
msg.set_charset(charset)
msg.set_payload(msg._charset.body_encode(text))
print msg.as_string()
will give you:
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
=C3=A1 =3D =C3=A9
Also see this response from a Python committer.
Related
I have a lot of strings from mail bodies, that print as such:
=C3=A9
This should be 'é' for example.
What exactly is this encoding and how to decode it?
I'm using python 3.5
EDIT:
I managed to get the body of the mail properly encoded by applying:
quopri.decodestring(sometext).decode('utf-8')
However I still struggle to get the FROM , TO, SUBJECT, etc... parts get right.
This is how I construct the e-mails:
import imaplib
import email
import quopri
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('mail#gmail.com', '*******')
mail.list()
mail.select('"[Gmail]/All Mail"')
typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))
data[0].split()
print(data[0].split())
for e_mail in data[0].split():
typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
raw_mail = data[0][1]
email_message = email.message_from_bytes(raw_mail)
if email_message.is_multipart():
for part in email_message.walk():
if part.get_content_type() == 'text/plain':
if part.get_content_type() == 'text/plain':
body = part.get_payload()
to = email_message['To']
utf = quopri.decodestring(to)
text = utf.decode('utf-8')
print(text)
.
.
.
I still got this: =?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=
That's called "quoted-printable" encoding. It's defined by RFC 1521. Its purpose is to replace unusual character values by a sequence of normal, safe characters so that the message can be handled safely by the email system.
In fact there are two levels of encoding here. First the letter 'é' was encoded into UTF-8 which produces '\xc3\xa9', and then that UTF-8 was encoded into the quoted-printable form '=C3=A9'
You can undo the quoted-printable step by using the decode or decodestring method of the quopri module, documented at https://docs.python.org/3/library/quopri.html That will look something like:
import quopri
source = '=C3=A9'
print(quopri.decodestring(source))
That will undo the quoted-printable encoding and show you the UTF-8 bytes '\xc3\xa9'. To get back to the letter 'é' you need to use the decode string method and tell Python that those bytes contain a UTF-8 encoding, something like:
utf = quopri.decodestring(source)
text = utf.decode('utf-8')
print(text)
UTF-8 is only one of many possible ways of encoding letters into bytes. For example, if your 'é' had been encoded as ISO-8859-1 it would have had the byte value '\xe9' and its quoted-printable representation would have been '=E9'.
When you're dealing with email, you should see a Content-Type header that tells you what type of content is being sent and which letter-to-bytes encoding was applied to the text of the message (or to an individual MIME part, in a multipart message). If that text was then encoded again by applying the quoted-printable encoding, that additional step should be indicated by a Content-Transfer-Encoding header. So your message with UTF-8 encoded text carried in quoted-printable format should have had headers that look like this:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
This solved it:
from email.header import decode_header
def mail_header_decoder(self,header):
if header != None:
mail_header_decoded = decode_header(header)
l=[]
header_new=[]
for header_part in mail_header_decoded:
l.append(header_part[1])
if all(item == None for item in l):
# print(header)
return header
else:
for header_part in mail_header_decoded:
header_new.append(header_part[0].decode())
header_new = ''.join(header_new) # convert list to string
# print(header_new)
return header_new
I have an issues with some encoding when reading emails using the Gmail API.
First i retrieve the email using this:
message = service.users().messages().get(userId='me', id='169481bce75af185', format='raw').execute()
After that I use these line to get a string out of it and convert it into mime message:
msg_str = str(base64.urlsafe_b64decode(message['raw'].encode('utf-8')).decode('utf-8'))
mime_msg = email.message_from_string(msg_str)
Then I print what I got:
print(mime_msg.get_payload()[0])
However I can see some weird characters in the output for example:
Gesch=C3=A4ftsf=C3=BChrer
In the message header I can see this:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
What did I do wrong and how can I get the correct output without the strange characters?
Thank you for your time
Your data has been encoded as UTF-8 and then made safe for 7-bit transmission by further encoding as quoted-printable. That is what the message header is telling you. Use quopri to undo the quoted-printable and then .decode to get Unicode:
>>> import quopri
>>> print(quopri.decodestring("Gesch=C3=A4ftsf=C3=BChrer").decode("utf-8"))
Geschäftsführer
as BoarGules suggested, it displays the characters properly now. Browsing this site also led me to this useful function:
def decode_email(msg_str):
p = Parser()
message = p.parsestr(msg_str)
decoded_message = ''
for part in message.walk():
charset = part.get_content_charset()
if part.get_content_type() == 'text/plain':
part_str = part.get_payload(decode=1)
decoded_message += part_str.decode(charset)
return decoded_message
Which converts the message string into decoded string display the characted properly.
Here is a method which tries to get the html part of an email message:
from __future__ import absolute_import, division, unicode_literals, print_function
import email
html_mail_quoted_printable=b'''Subject: =?ISO-8859-1?Q?WG=3A_Wasenstra=DFe_84_in_32052_Hold_Stau?=
MIME-Version: 1.0
Content-type: multipart/mixed;
Boundary="0__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253"
--0__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253
Content-type: multipart/alternative;
Boundary="1__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253"
--1__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253
Content-type: text/plain; charset=ISO-8859-1
Content-transfer-encoding: quoted-printable
Freundliche Gr=FC=DFe
--1__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253
Content-type: text/html; charset=ISO-8859-1
Content-Disposition: inline
Content-transfer-encoding: quoted-printable
<html><body>
Freundliche Gr=FC=DFe
</body></html>
--1__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253--
--0__=4EBBF4C4DFD012538f9e8a93df938690918c4EBBF4C4DFD01253--
'''
def get_html_part(msg):
for part in msg.walk():
if part.get_content_type() == 'text/html':
return part.get_payload(decode=True)
msg=email.message_from_string(html_mail_quoted_printable)
html=get_html_part(msg)
print(type(html))
print(html)
Output:
<type 'str'>
<html><body>
Freundliche Gr��e
</body></html>
Unfortunately I get a byte string. I would like to have unicode string.
According to this answer msg.get_payload(decode=True) should do the magic. But it does not in this case.
How to decode a mime part of a message and get a unicode string in Python 2.7?
Unfortunately I get a byte string. I would like to have unicode string.
The decode=True parameter to get_payload only decodes the Content-Transfer-Encoding wrapper, the =-encoding in this message. To get from there to characters is one of the many things the email package makes you do yourself:
bytes = part.get_payload(decode=True)
charset = part.get_content_charset('iso-8859-1')
chars = bytes.decode(charset, 'replace')
(iso-8859-1 being the fallback in case the message specifies no encoding.)
I'm using Python email.mime lib to write emails, and I created two MIMEText objects and then attached them to Message as text (not as attachment), and as a result I got the MIME document as follows, as you can see there are two text objects, one is of type plain and the other is of type html, my question is that I can only see the latter text object (here is the html) in some mail clients, while I can see both text objects in some other mail clients (for example, live.com), so what caused this?
Content-Type: multipart/mixed; boundary="===============0542266593=="
MIME-Version: 1.0
FROM: john.smith#NYU.com
TO: john.smith#live.com, john.smith#gmail.com
SUBJECT: =?utf-8?q?A_Greeting_From_Postman?=
--===============0542266593==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
SGkhCkhvdyBhcmUgeW91PwpIZXJlIGlzIHRoZSBsaW5rIHlvdSB3YW50ZWQ6Cmh0dHA6Ly93d3cu
cHl0aG9uLm9yZw==
--===============0542266593==
Content-Type: text/html; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
ICAgICAgICA8aHRtbD4KICAgICAgICAgIDxoZWFkPjwvaGVhZD4KICAgICAgICAgIDxib2R5Pgog
ICAgICAgICAgICA8cD5IaSE8YnI+CiAgICAgICAgICAgICAgIEhvdyBhcmUgeW91Pzxicj4KICAg
ICAgICAgICAgICAgSGVyZSBpcyB0aGUgPGEgaHJlZj0iaHR0cDovL3d3dy5weXRob24ub3JnIj5s
aW5rPC9hPiB5b3Ugd2FudGVkLgogICAgICAgICAgICA8L3A+CiAgICAgICAgICA8L2JvZHk+CiAg
ICAgICAgPC9odG1sPgogICAgICAgIA==
--===============0542266593==--
You have specified 'multipart/mixed' as the mime type. If you want only one item to be displayed, specify 'multipart/alternative', as so:
email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.encoders import encode_base64
# Note: 'alternative' means only display one of the items.
msg = MIMEMultipart('alternative')
msg['Subject'] = "Hello"
msg['From'] = 'me#example.com'
msg['To'] = 'you#example.com'
msg.attach(MIMEText('Hello!', 'plain'))
msg.attach(MIMEText('<b>Hello!</b>', 'html'))
# Not required, but you had it in your example, so I kept it.
for i in msg.get_payload():
encode_base64(i)
print msg.as_string()
I saved the whole message as xx.eml, but some mails body tells that mail is encoding by base64 at the first line, for example:
charset="utf-8" Content-Transfer-Encoding: base64
charset="gb2312" Content-Transfer-Encoding: base64
I tried to get the keys of body[0][1], but there is no content-transfer-encoding field (only content-type).
How can I process that mails?
def saveMail(conn, num):
typ, body = conn.fetch(num, 'RFC822')
message = open(emldirPath + '\\' + num + '.eml', 'w+')
message.write(str(email.message_from_string(body[0][1])))
print email.message_from_string(body[0][1]).keys()
#['Received', 'Return-Path', 'Received', 'Received', 'Date', 'From', 'To',
# 'Subject', 'Message-ID', 'X-mailer', 'Mime-Version', 'X-MIMETrack',
# 'Content-Type', 'X-Coremail-Antispam']
message.close()
I found the problem, it's not decoding problem.
right mail as follow:
------=_Part_446950_1309705579.1326378953207
Content-Type: text/plain; charset=GBK
Content-Transfer-Encoding: base64
what my program download:
------=_Part_446950_1309705579.1326378953207
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: base64
when my program save the .eml file, it change line after 'text/plain;'
therefore outlook express can't parse the mail
if I edit the line to ""Content-Type: text/html;charset="utf-8"",
it works
Now the question is: how to edit my program to not let it change line?
Emails that are transfered as BASE64 must set Content-Transfer-Encoding. However you are most likely dealing with a MIME/Multipart message (e.g. both text/plain and HTML in the same message), in which case the transfer encoding is set separately for each part. You can test with is_multipart() or if Content-Type is multipart/alternative. If that is the case you use walk to iterate over the different parts.
EDIT: It is quite normal to send text/plain using quoted-printable and HTML using BASE64.
Content-Type: multipart/alternative; boundary="=_d6644db1a848db3cb25f2a8973539487"
Subject: multipart sample
From: Foo Bar <foo#example.net>
To: Fred Flintstone <fred#example.net>
--=_d6644db1a848db3cb25f2a8973539487
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=utf-8
SOME BASE64 HERE
--=_d6644db1a848db3cb25f2a8973539487
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=utf-8
AND SOME OTHER BASE64 HERE