How does encoding in email subjects work? (Django/ Python) - python

I am sending email with EmailMessage object to Gmail box.
The subject of an email looks something like this:
u"You got a letter from Daėrius ęėįęėįęįėęįę---reply3_433441"
When i receive an email, looking at the message info i can see that Subject line looks like this:
Subject: =?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?=
=?utf-8?b?xK/El8SZxK/EmS0tLXJlcGx5M180MzM0NDE=?=
How to decode this subject line?
I have sucesfully decoded email body (tex/plain) with this:
for part in msg.walk():
if part.get_content_type() == 'text/plain':
msg_encoding = part.get_content_charset()
msg_text = part.get_payload().decode('quoted-printable')
msg_text = smart_unicode(msg_text, encoding=msg_encoding, strings_only=False, errors='strict')

See RFC 2047 for a complete description of the format of internationalized email headers. The basic format is "=?" charset "?" encoding "?" encoded-text "?=". So in your case, you have a base-64 encoded UTF-8 string.
You can use the email.header.decode_header and str.decode functions to decode it and get a proper Unicode string:
>>> import email.header
>>> x = email.header.decode_header('=?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?=')
>>> x
[('You got a letter from Da\xc4\x97rius \xc4\x99\xc4\x97\xc4\xaf\xc4\x99\xc4\x97\xc4\xaf\xc4\x99', 'utf-8')]
>>> x[0][0].decode(x[0][1])
u'You got a letter from Da\u0117rius \u0119\u0117\u012f\u0119\u0117\u012f\u0119'

You should look at the email.header module in the Python standard library. In particular, at the end of the documentation, there's a decode_header() function you can use to do most of the hard work for you.

the subject line is utf8 but you're reading it as ASCII, you're safest reading it all as utf8, as ASCII is effectively only as subset of utf8.

Related

PDF from another API corrupted after conversion and email via Sendgrid

I was working with an API that sends a binary file of the the requested PDF. I wanted to email it as attachment via SendGrid, but I keep running into problems.
So far I have tried:
Using it directly, no encoding what so ever. Result: corrupt file.
Look in Sendgrid's doc. According to Sendgrid, the content should be "The Base64 encoded content of the attachment."
Tried both base64.b64encode(attachment_data) & binascii.b2a_base64(attachment_data) . Result: UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-13: ordinal not in range(128)
Tried a type(attachment_data), which outputs " type 'unicode' ".
Tried both base64.b64encode(attachment_data.encode("utf-8")) & binascii.b2a_base64(attachment_data.encode("utf-8")). Result: corrupt file stating "Adobe Acrobat Reader DC could not open 'property_info.pdf' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
I find some Q&A on both Stackoverflow and SendGrid's own Python Github, but it seems to mostly work with the pdf file, not binary data. Most just seems to say "encode it to utf-8", but in this case it just ends up corrupted. Any clues on what else I can do it send the file correctly?
Here are the function that does the Sendgrid:
def to_sendgrid(my_email, other_email, text, attachment_data):
subject = "Sample pdf attached."
from_email = Email(my_email)
to_email = Email(other_email)
content = Content("text/plain", text)
mail = Mail(from_email, subject, to_email, content)
attachment_data = base64.b64encode(attachment_data.encode("utf-8"))
attachment = Attachment()
attachment.set_content(attachment_data)
attachment.set_type("application/pdf")
attachment.set_filename("sample.pdf")
attachment.set_disposition("attachment")
attachment.set_content_id("Sample")
mail.add_attachment(attachment)
try:
sendgrid_client.client.mail.send.post(request_body=mail.get())
except Exception as e:
print str(e)

curl post request failing in the presence of special characters

Ok, I know there are too many questions on this topic already; reading every one of those hasn't helped me solve my problem.
I have " hello'© " on my webpage. My objective is to get this content as json, strip the "hello" and write back the remaining contents ,i.e, "'©" back on the page.
I am using a CURL POST request to write back to the webpage. My code for getting the json is as follows:
request = urllib2.Request("http://XXXXXXXX.json")
user = 'xxx'
base64string = base64.encodestring('%s:%s' % (xxx, xxx))
request.add_header("Authorization", "Basic %s" % base64string)
result = urllib2.urlopen(request) #send URL request
newjson = json.loads(result.read().decode('utf-8'))
At this point, my newres is unicode string. I discovered that my curl post request works only with percentage-encoding (like "%A3" for £).
What is the best way to do this? The code I wrote is as follows:
encode_dict = {'!':'%21',
'"':'%22',
'#':'%24',
'$':'%25',
'&':'%26',
'*':'%2A',
'+':'%2B',
'#':'%40',
'^':'%5E',
'`':'%60',
'©':'\xa9',
'®':'%AE',
'™':'%99',
'£':'%A3'
}
for letter in text1:
print (letter)
for keyz, valz in encode_dict.iteritems():
if letter == keyz:
print(text1.replace(letter, valz))
path = "xxxx"
subprocess.Popen(['curl','-u', 'xxx:xxx', 'Content-Type: text/html','-X','POST','--data',"text="+text1, ""+path])
This code gives me an error saying " UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if letter == keyz:"
Is there a better way to do this?
The problem was with the encoding. json.loads() returns a stream of bytes and needs to be decoded to unicode, using the decode() fucntion. Then, I replaced all non-ascii characters by encoding the unicode with ascii encoding using encode('ascii','xmlcharrefreplace').
newjson = json.loads(result.read().decode('utf-8').encode("ascii","xmlcharrefreplace"))
Also, learning unicode basics helped me a great deal! This is an excellent tutorial.

Python IMAP: =?utf-8?Q? in subject string

I am displaying new email with IMAP, and everything looks fine, except for one message subject shows as:
=?utf-8?Q?Subject?=
How can I fix it?
In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:
import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]
Check out the docs for email.header for more details.
This is a MIME encoded-word. You can parse it with email.header:
import email.header
def decode_mime_words(s):
return u''.join(
word.decode(encoding or 'utf8') if isinstance(word, bytes) else word
for word, encoding in email.header.decode_header(s))
print(decode_mime_words(u'=?utf-8?Q?Subject=c3=a4?=X=?utf-8?Q?=c3=bc?='))
The text is encoded as a MIME encoded-word. This is a mechanism defined in RFC2047 for encoding headers that contain non-ASCII text such that the encoded output contains only ASCII characters.
In Python 3.3+, the parsing classes and functions in email.parser automatically decode "encoded words" in headers if their policy argument is set to policy.default
>>> import email
>>> from email import policy
>>> msg = email.message_from_file(open('message.txt'), policy=policy.default)
>>> msg['from']
'Pepé Le Pew <pepe#example.com>'
The parsing classes and functions are:
email.parser.BytesParser
email.parser.Parser
email.message_from_bytes
email.message_from_binary_file
email.message_from_string
email.message_from_file
Confusingly, up to at least Python 3.10, the default policy for these parsing functions is not policy.default, but policy.compat32, which does not decode "encoded words".
>>> msg = email.message_from_file(open('message.txt'))
>>> msg['from']
'=?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>'
Try Imbox
Because imaplib is a very excessive low level library and returns results which are hard to work with
Installation
pip install imbox
Usage
from imbox import Imbox
with Imbox('imap.gmail.com',
username='username',
password='password',
ssl=True,
ssl_context=None,
starttls=False) as imbox:
all_inbox_messages = imbox.messages()
for uid, message in all_inbox_messages:
message.subject
In Python 3, decoding this to an approximated string is as easy as:
from email.header import decode_header, make_header
decoded = str(make_header(decode_header("=?utf-8?Q?Subject?=")))
See the documentation of decode_header and make_header.
High level IMAP lib may be useful here: imap_tools
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
Parsed email message attributes
Query builder for searching emails
Actions with emails: copy, delete, flag, move, seen
Actions with folders: list, set, get, create, exists, rename, delete, status
No dependencies

Export Gmail using Python imaplib - text mangled with newline issues

I'm using the following code to export all the emails in a specific gmail folder.
Its working well, in that its pulling out all the emails I expect, but it (or me) seems to be mangle the encoding for CR / newlines.
Code:
import imaplib
import email
import codecs
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myUser#gmail.com', 'myPassword') #user / password
mail.list()
mail.select("myFolder") # connect to folder with matching label
result, data = mail.uid('search', None, "ALL") # search and return uids instead
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
email_message = email.message_from_string(raw_email)
save_string = str("C:\\\googlemail\\boxdump\\email_" + str(x) + ".eml") #set to save location
myfile = open(save_string, 'a')
myfile.write(email_message)
myfile.close()
My problem is that by the time I get to the object its littered with '=0A', which I am assuming are incorrectly interpreted newline or carriage return flag.
I can find it in hex, [d3 03 03 0a] but because this is not 'characters' I can't find a way for str.replace() to take the parts out. I don't actually want the newline flags.
I could convert the whole string to hex, and do a do a replace of sorts / regex thing, but that seems like over kill - when the problem is in the encoding / reading of the source data
What I see:
====
CAUTION: This email message and any attachments con= tain information that may be confidential and may be LEGALLY PRIVILEGED. If yo= u are not the intended recipient, any use, disclosure or copying of this messag= e or attachments is strictly prohibited. If you have received this email messa= ge in error please notify us immediately and erase all copies of the message an= d attachments. Thank you.
====
what I want:
====
CAUTION: This email message and any attachments contain information that may be confidential and may be LEGALLY PRIVILEGED. If you are not the intended recipient, any use, disclosure or copying of this message or attachments is strictly prohibited. If you have received this email message in error please notify us immediately and erase all copies of the message and attachments. Thank you.
====
What you are looking at is Quoted Printable encoding.
Try changing:
email_message = email.message_from_string(raw_email)
to:
email_message = str(email.message_from_string(raw_email)).decode("quoted-printable")
For more info see Standard Encodings in the Python codecs module.
Just 2 additional items having gone thought the pain of this for a day.
1 do it at the payload level so you can process your email_message to get email addresses etc from your mail.
2 You need to decode the charset set also, I had trouble with people copying and pasting html from webpages and content from word docs etc into emails that I was then trying to process.
if maintype == 'multipart':
for part in email_message.get_payload():
if part.get_content_type() == 'text/plain':
text += part.get_payload().decode("quoted-printable").decode(part.get_content_charset())
Hope this helps someone !
Dave

Python: email get_payload decode fails when hitting equal sign?

Running into strangeness with get_payload: it seems to crap out when it sees an equal sign in the message it's decoding. Here's code that displays the error:
import email
data = file('testmessage.txt').read()
msg = email.message_from_string( data )
payload = msg.get_payload(decode=True)
print payload
And here's a sample message: test message.
The message is printed only until the first "=" . The rest is omitted. Anybody know what's going on?
The same script with "decode=False" returns the full message, so it appears the decode is unhappy with the equal sign.
This is under Python 2.5 .
You have a line endings problem. The body of your test message uses bare carriage returns (\r) without newlines (\n). If you fix up the line endings before parsing the email, it all works:
import email, re
data = file('testmessage.txt').read()
data = re.sub(r'\r(?!\n)', '\r\n', data) # Bare \r becomes \r\n
msg = email.message_from_string( data )
payload = msg.get_payload(decode=True)
print payload

Categories

Resources