Python UTF-8 Hex decoding [duplicate]

Python UTF-8 Hex decoding [duplicate] - python

I have a lot of strings from mail bodies, that print as such:
=C3=A9
This should be 'é' for example.
What exactly is this encoding and how to decode it?
I'm using python 3.5
EDIT:
I managed to get the body of the mail properly encoded by applying:
quopri.decodestring(sometext).decode('utf-8')
However I still struggle to get the FROM , TO, SUBJECT, etc... parts get right.
This is how I construct the e-mails:
import imaplib
import email
import quopri
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('mail#gmail.com', '*******')
mail.list()
mail.select('"[Gmail]/All Mail"')
typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))
data[0].split()
print(data[0].split())
for e_mail in data[0].split():
typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
raw_mail = data[0][1]
email_message = email.message_from_bytes(raw_mail)
if email_message.is_multipart():
for part in email_message.walk():
if part.get_content_type() == 'text/plain':
if part.get_content_type() == 'text/plain':
body = part.get_payload()
to = email_message['To']
utf = quopri.decodestring(to)
text = utf.decode('utf-8')
print(text)
.
.
.
I still got this: =?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=

That's called "quoted-printable" encoding. It's defined by RFC 1521. Its purpose is to replace unusual character values by a sequence of normal, safe characters so that the message can be handled safely by the email system.
In fact there are two levels of encoding here. First the letter 'é' was encoded into UTF-8 which produces '\xc3\xa9', and then that UTF-8 was encoded into the quoted-printable form '=C3=A9'
You can undo the quoted-printable step by using the decode or decodestring method of the quopri module, documented at https://docs.python.org/3/library/quopri.html That will look something like:
import quopri
source = '=C3=A9'
print(quopri.decodestring(source))
That will undo the quoted-printable encoding and show you the UTF-8 bytes '\xc3\xa9'. To get back to the letter 'é' you need to use the decode string method and tell Python that those bytes contain a UTF-8 encoding, something like:
utf = quopri.decodestring(source)
text = utf.decode('utf-8')
print(text)
UTF-8 is only one of many possible ways of encoding letters into bytes. For example, if your 'é' had been encoded as ISO-8859-1 it would have had the byte value '\xe9' and its quoted-printable representation would have been '=E9'.
When you're dealing with email, you should see a Content-Type header that tells you what type of content is being sent and which letter-to-bytes encoding was applied to the text of the message (or to an individual MIME part, in a multipart message). If that text was then encoded again by applying the quoted-printable encoding, that additional step should be indicated by a Content-Transfer-Encoding header. So your message with UTF-8 encoded text carried in quoted-printable format should have had headers that look like this:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

This solved it:
from email.header import decode_header
def mail_header_decoder(self,header):
if header != None:
mail_header_decoded = decode_header(header)
l=[]
header_new=[]
for header_part in mail_header_decoded:
l.append(header_part[1])
if all(item == None for item in l):
# print(header)
return header
else:
for header_part in mail_header_decoded:
header_new.append(header_part[0].decode())
header_new = ''.join(header_new) # convert list to string
# print(header_new)
return header_new

Related

My API seems to work but only sends a single item and not the entire content [duplicate]

This question already has answers here:
smtplib sends blank message if the message contain certain characters
(3 answers)
Closed 2 years ago.
Before encoding the msg variable, I was getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in
position 4: ordinal not in range(128)
So I did some research, and finally encoded the variable:
msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8')
server.sendmail('...#gmail.com', '...#gmail.com', msg)
Here's the rest of the code on request:
def remind_me(path, time, day_freq):
for filename in glob.glob(os.path.join(path, '*.docx')):
# file_count = sum(len(files))
# random_file = random.randint(0, file_number-1)
doc = docx.Document(filename)
p_number = len(doc.paragraphs)
text = ''
while text == '':
rp = random.randint(0, p_number-1) # random paragraph number
text = doc.paragraphs[rp].text # gives the entire text in the paragraph
base = os.path.basename(filename)
print(os.path.splitext(base)[0] + ': ' + text)
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login('...#gmail.com', 'password')
msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8')
server.sendmail('...#gmail.com', '...#gmail.com', msg)
server.quit()
Now, it sends empty emails instead of delivering the message. Does it return None? If so, why?
Note: Word documents contain some characters like ş, ö, ğ, ç.

The msg argument to smtplib.sendmail should be a bytes sequence containing a valid RFC5322 message. Taking a string and encoding it as UTF-8 is very unlikely to produce one (if it's already ASCII, encoding it does nothing useful; and if it isn't, you are most probably Doing It Wrong).
To explain why that is unlikely to work, let me provide a bit of background. The way to transport non-ASCII strings in MIME messages depends on the context of the string in the message structure. Here is a simple message with the word "Hëlló" embedded in three different contexts which require different encodings, none of which accept raw UTF-8 easily.
From: me <sender#example.org>
To: you <recipient#example.net>
Subject: =?utf-8?Q?H=C3=ABll=C3=B3?= (RFC2047 encoding)
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="fooo"
--fooo
Content-type: text/plain; charset="utf-8"
Content-transfer-encoding: quoted-printable
H=C3=ABll=C3=B3 is bare quoted-printable (RFC2045),
like what you see in the Subject header but without
the RFC2047 wrapping.
--fooo
Content-type: application/octet-stream; filename*=UTF-8''H%C3%ABll%C3%B3
This is a file whose name has been RFC2231-encoded.
--fooo--
There are recent extensions which allow for parts of messages between conforming systems to contain bare UTF-8 (even in the headers!) but I have a strong suspicion that this is not the scenario you are in. Maybe tangentially see also https://en.wikipedia.org/wiki/Unicode_and_email
Returning to your code, I suppose it could work if base is coincidentally also the name of a header you want to add to the start of the message, and text contains a string with the rest of the message. You are not showing enough of your code to reason intelligently about this, but it seems highly unlikely. And if text already contains a valid MIME message, encoding it as UTF-8 should not be necessary or useful (but it clearly doesn't, as you get the encoding error).
Let's suppose base contains Subject and text is defined thusly:
text='''=?utf-8?B?H=C3=ABll=C3=B3?= (RFC2047 encoding)
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="fooo"
....'''
Now, the concatenation base + ': ' + text actually produces a message similar to the one above (though I reordered some headers to put Subject: first for this scenario) but again, I imagine this is not how things actually are in your code.
If your goal is to send an extracted piece of text as the body of an email message, the way to do that is roughly
from email.message import EmailMessage
body_text = os.path.splitext(base)[0] + ': ' + text
message = EmailMessage()
message.set_content(body_text)
message["subject"] = "Extracted text"
message["from"] = "you#example.net"
message["to"] = "me#example.org"
with smtplib.SMTP("smtp.gmail.com", 587) as server:
# ... smtplib setup, login, authenticate?
server.send_message(message)
This answer was updated for the current email library API; the text below the line is the earlier code from the original answer.
The modern Python 3.3+ EmailMessage API rather straightforwardly translates into human concepts, unlike the older API which required you to understand many nitty-gritty details of how the MIME structure of your message should look.
from email.mime.text import MIMEText
body_text = os.path.splitext(base)[0] + ": " + text
sender = "you#example.net"
recipient = "me#example.org"
message = MIMEText(body_text)
message["subject"] = "Extracted text"
message["from"] = sender
message["to"] = recipient
server = smtplib.SMTP("smtp.gmail.com", 587)
# ... smtplib setup, login, authenticate?
server.sendmail(from, to, message.as_string())
The MIMEText() invocation builds an email object with room for a sender, a subject, a list of recipients, and a body; its as_text() method returns a representation which looks roughly similar to the ad hoc example message above (though simpler still, with no multipart structure) which is suitable for transmitting over SMTP. It transparently takes care of putting in the correct character set and applying suitable content-transfer encodings for non-ASCII header elements and body parts (payloads).
Python's standard library contains fairly low-level functions so you have to know a fair bit in order to connect all the pieces correctly. There are third-party libraries which hide some of this nitty-gritty; but you would exepect anything with email to have at the very least both a subject and a body, as well as of course a sender and recipients.

Gmail API - Seeing strange German characters using 'raw' output and decoding into utf-8

I have an issues with some encoding when reading emails using the Gmail API.
First i retrieve the email using this:
message = service.users().messages().get(userId='me', id='169481bce75af185', format='raw').execute()
After that I use these line to get a string out of it and convert it into mime message:
msg_str = str(base64.urlsafe_b64decode(message['raw'].encode('utf-8')).decode('utf-8'))
mime_msg = email.message_from_string(msg_str)
Then I print what I got:
print(mime_msg.get_payload()[0])
However I can see some weird characters in the output for example:
Gesch=C3=A4ftsf=C3=BChrer
In the message header I can see this:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
What did I do wrong and how can I get the correct output without the strange characters?
Thank you for your time

Your data has been encoded as UTF-8 and then made safe for 7-bit transmission by further encoding as quoted-printable. That is what the message header is telling you. Use quopri to undo the quoted-printable and then .decode to get Unicode:
>>> import quopri
>>> print(quopri.decodestring("Gesch=C3=A4ftsf=C3=BChrer").decode("utf-8"))
Geschäftsführer

as BoarGules suggested, it displays the characters properly now. Browsing this site also led me to this useful function:
def decode_email(msg_str):
p = Parser()
message = p.parsestr(msg_str)
decoded_message = ''
for part in message.walk():
charset = part.get_content_charset()
if part.get_content_type() == 'text/plain':
part_str = part.get_payload(decode=1)
decoded_message += part_str.decode(charset)
return decoded_message
Which converts the message string into decoded string display the characted properly.

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

This question already has answers here:
smtplib sends blank message if the message contain certain characters
(3 answers)
Closed 2 years ago.
Before encoding the msg variable, I was getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in
position 4: ordinal not in range(128)
So I did some research, and finally encoded the variable:
msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8')
server.sendmail('...#gmail.com', '...#gmail.com', msg)
Here's the rest of the code on request:
def remind_me(path, time, day_freq):
for filename in glob.glob(os.path.join(path, '*.docx')):
# file_count = sum(len(files))
# random_file = random.randint(0, file_number-1)
doc = docx.Document(filename)
p_number = len(doc.paragraphs)
text = ''
while text == '':
rp = random.randint(0, p_number-1) # random paragraph number
text = doc.paragraphs[rp].text # gives the entire text in the paragraph
base = os.path.basename(filename)
print(os.path.splitext(base)[0] + ': ' + text)
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login('...#gmail.com', 'password')
msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8')
server.sendmail('...#gmail.com', '...#gmail.com', msg)
server.quit()
Now, it sends empty emails instead of delivering the message. Does it return None? If so, why?
Note: Word documents contain some characters like ş, ö, ğ, ç.

The msg argument to smtplib.sendmail should be a bytes sequence containing a valid RFC5322 message. Taking a string and encoding it as UTF-8 is very unlikely to produce one (if it's already ASCII, encoding it does nothing useful; and if it isn't, you are most probably Doing It Wrong).
To explain why that is unlikely to work, let me provide a bit of background. The way to transport non-ASCII strings in MIME messages depends on the context of the string in the message structure. Here is a simple message with the word "Hëlló" embedded in three different contexts which require different encodings, none of which accept raw UTF-8 easily.
From: me <sender#example.org>
To: you <recipient#example.net>
Subject: =?utf-8?Q?H=C3=ABll=C3=B3?= (RFC2047 encoding)
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="fooo"
--fooo
Content-type: text/plain; charset="utf-8"
Content-transfer-encoding: quoted-printable
H=C3=ABll=C3=B3 is bare quoted-printable (RFC2045),
like what you see in the Subject header but without
the RFC2047 wrapping.
--fooo
Content-type: application/octet-stream; filename*=UTF-8''H%C3%ABll%C3%B3
This is a file whose name has been RFC2231-encoded.
--fooo--
There are recent extensions which allow for parts of messages between conforming systems to contain bare UTF-8 (even in the headers!) but I have a strong suspicion that this is not the scenario you are in. Maybe tangentially see also https://en.wikipedia.org/wiki/Unicode_and_email
Returning to your code, I suppose it could work if base is coincidentally also the name of a header you want to add to the start of the message, and text contains a string with the rest of the message. You are not showing enough of your code to reason intelligently about this, but it seems highly unlikely. And if text already contains a valid MIME message, encoding it as UTF-8 should not be necessary or useful (but it clearly doesn't, as you get the encoding error).
Let's suppose base contains Subject and text is defined thusly:
text='''=?utf-8?B?H=C3=ABll=C3=B3?= (RFC2047 encoding)
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="fooo"
....'''
Now, the concatenation base + ': ' + text actually produces a message similar to the one above (though I reordered some headers to put Subject: first for this scenario) but again, I imagine this is not how things actually are in your code.
If your goal is to send an extracted piece of text as the body of an email message, the way to do that is roughly
from email.message import EmailMessage
body_text = os.path.splitext(base)[0] + ': ' + text
message = EmailMessage()
message.set_content(body_text)
message["subject"] = "Extracted text"
message["from"] = "you#example.net"
message["to"] = "me#example.org"
with smtplib.SMTP("smtp.gmail.com", 587) as server:
# ... smtplib setup, login, authenticate?
server.send_message(message)
This answer was updated for the current email library API; the text below the line is the earlier code from the original answer.
The modern Python 3.3+ EmailMessage API rather straightforwardly translates into human concepts, unlike the older API which required you to understand many nitty-gritty details of how the MIME structure of your message should look.
from email.mime.text import MIMEText
body_text = os.path.splitext(base)[0] + ": " + text
sender = "you#example.net"
recipient = "me#example.org"
message = MIMEText(body_text)
message["subject"] = "Extracted text"
message["from"] = sender
message["to"] = recipient
server = smtplib.SMTP("smtp.gmail.com", 587)
# ... smtplib setup, login, authenticate?
server.sendmail(from, to, message.as_string())
The MIMEText() invocation builds an email object with room for a sender, a subject, a list of recipients, and a body; its as_text() method returns a representation which looks roughly similar to the ad hoc example message above (though simpler still, with no multipart structure) which is suitable for transmitting over SMTP. It transparently takes care of putting in the correct character set and applying suitable content-transfer encodings for non-ASCII header elements and body parts (payloads).
Python's standard library contains fairly low-level functions so you have to know a fair bit in order to connect all the pieces correctly. There are third-party libraries which hide some of this nitty-gritty; but you would exepect anything with email to have at the very least both a subject and a body, as well as of course a sender and recipients.

trying to read emails from gmail using pythonista

I am fairly new to Python and am excited that I have access to gmail using imap4
Here the code I am using to access email:
from __future__ import print_function
import getpass
import imaplib
import console
import collections
import re
import email
import codecs
import quopri
console.clear()
mail = imaplib.IMAP4_SSL('imap.gmail.com',993)
my password = getpass.getpass("Password: ")
address = 'sch.e#gmail.com'
print('Which email address (TO) would you like to search: ',end='')
EE = raw_input()
SS = r"(TO "+"\""+EE+"\""+r")"
mail.login(address, mypassword)
mail.select("inbox") #select the box on gmail
print("Checking for e-mails TO ",EE)
typ, messageIDs = mail.search(None,'(SINCE "01-Jan-2014")',SS)
MIDs=messageIDs[0].split()
for mailid in MIDs[::-1]:
resp, data = mail.fetch(mailid,'(RFC822)')
raw_body=data[0][1]
print(raw_body.decode('UTF-8','strict'))
print(quopri.encodestring(raw_body))
msg=email.message_from_string(raw_body)
print(msg)
Unfortunately none of the print statements contains correct Umlaute.
(for example Beste Grüße)
Could someone please give me a hint how to deal with encodings? It looks like Utf-8 encoded text except for the "=' characters,
Thank you!!
Erik

The email's body text has been charset-encoded into bytes that are then encoded into 7bit ASCII using MIME's quoted-printable algorithm. You will have to reverse the QP encoding to get the original bytes, and then you can convert them to a string using the email's charset (which is not utf-8, otherwise the QP encoded text would say Beste Gr=C3=BC=C3=9Fe instead. The charset is more likely iso-8859-1). The email headers will tell you the actual charset, and how the body was encoded (QP, base64, etc). However, you are fetching only the email body, so you will need to also fetch the email headers as well, using RFC822.HEADER.
Let's assume the email is encoded as ISO-8859-1 using quoted-printable (fetch the email headers to verify). Try this more like this to decode it:
raw_body=data[0][1]
raw_body=quopri.decodestring(raw_body)
raw_body=raw_body.decode('ISO-8859-1')

Encode MIMEText as quoted printables

Python supports a quite functional MIME-Library called email.mime.
What I want to achieve is to get a MIME Part containing plain UTF-8 text to be encoded as quoted printables and not as base64. Although all functionallity is available in the library, I did not manage to use it:
Example:
import email.mime.text, email.encoders
m=email.mime.text.MIMEText(u'This is the text containing ünicöde', _charset='utf-8')
m.as_string()
# => Leads to a base64-encoded message, as base64 is the default.
email.encoders.encode_quopri(m)
m.as_string()
# => Leads to a strange message
The last command leads to a strange message:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Transfer-Encoding: quoted-printable
GhpcyBpcyB0aGUgdGV4dCBjb250YWluaW5nIMO8bmljw7ZkZQ=3D=3D
This is obviously not encoded as quoted printables, the double transfer-encoding header is strange at last (if not illegal).
How can I get my text encoded as quoted printables in the mime-message?

Okay, I got one solution which is very hacky, but at least it leads into some direction: MIMEText assumes base64 and I don't know how to change this. For this reason I use MIMENonMultipart:
import email.mime, email.mime.nonmultipart, email.charset
m=email.mime.nonmultipart.MIMENonMultipart('text', 'plain', charset='utf-8')
#Construct a new charset which uses Quoted Printables (base64 is default)
cs=email.charset.Charset('utf-8')
cs.body_encoding = email.charset.QP
#Now set the content using the new charset
m.set_payload(u'This is the text containing ünicöde', charset=cs)
Now the message seems to be encoded correctly:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
This is the text containing =C3=BCnic=C3=B6de
One can even construct a new class which hides the complexity:
class MIMEUTF8QPText(email.mime.nonmultipart.MIMENonMultipart):
def __init__(self, payload):
email.mime.nonmultipart.MIMENonMultipart.__init__(self, 'text', 'plain',
charset='utf-8')
utf8qp=email.charset.Charset('utf-8')
utf8qp.body_encoding=email.charset.QP
self.set_payload(payload, charset=utf8qp)
And use it like this:
m = MIMEUTF8QPText(u'This is the text containing ünicöde')
m.as_string()

In Python 3 you do not need your hack:
import email
# Construct a new charset which uses Quoted Printables (base64 is default)
cs = email.charset.Charset('utf-8')
cs.body_encoding = email.charset.QP
m = email.mime.text.MIMEText(u'This is the text containing ünicöde', 'plain', _charset=cs)
print(m.as_string())

Adapted from issue 1525919 and tested on python 2.7:
from email.Message import Message
from email.Charset import Charset, QP
text = "\xc3\xa1 = \xc3\xa9"
msg = Message()
charset = Charset('utf-8')
charset.header_encoding = QP
charset.body_encoding = QP
msg.set_charset(charset)
msg.set_payload(msg._charset.body_encode(text))
print msg.as_string()
will give you:
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
=C3=A1 =3D =C3=A9
Also see this response from a Python committer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python UTF-8 Hex decoding [duplicate] - python

Related

My API seems to work but only sends a single item and not the entire content [duplicate]

Gmail API - Seeing strange German characters using 'raw' output and decoding into utf-8

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

trying to read emails from gmail using pythonista

Encode MIMEText as quoted printables

Categories

Resources