Export Gmail using Python imaplib - text mangled with newline issues

Export Gmail using Python imaplib - text mangled with newline issues - python

I'm using the following code to export all the emails in a specific gmail folder.
Its working well, in that its pulling out all the emails I expect, but it (or me) seems to be mangle the encoding for CR / newlines.
Code:
import imaplib
import email
import codecs
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myUser#gmail.com', 'myPassword') #user / password
mail.list()
mail.select("myFolder") # connect to folder with matching label
result, data = mail.uid('search', None, "ALL") # search and return uids instead
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
email_message = email.message_from_string(raw_email)
save_string = str("C:\\\googlemail\\boxdump\\email_" + str(x) + ".eml") #set to save location
myfile = open(save_string, 'a')
myfile.write(email_message)
myfile.close()
My problem is that by the time I get to the object its littered with '=0A', which I am assuming are incorrectly interpreted newline or carriage return flag.
I can find it in hex, [d3 03 03 0a] but because this is not 'characters' I can't find a way for str.replace() to take the parts out. I don't actually want the newline flags.
I could convert the whole string to hex, and do a do a replace of sorts / regex thing, but that seems like over kill - when the problem is in the encoding / reading of the source data
What I see:
====
CAUTION: This email message and any attachments con= tain information that may be confidential and may be LEGALLY PRIVILEGED. If yo= u are not the intended recipient, any use, disclosure or copying of this messag= e or attachments is strictly prohibited. If you have received this email messa= ge in error please notify us immediately and erase all copies of the message an= d attachments. Thank you.
====
what I want:
====
CAUTION: This email message and any attachments contain information that may be confidential and may be LEGALLY PRIVILEGED. If you are not the intended recipient, any use, disclosure or copying of this message or attachments is strictly prohibited. If you have received this email message in error please notify us immediately and erase all copies of the message and attachments. Thank you.
====

What you are looking at is Quoted Printable encoding.
Try changing:
email_message = email.message_from_string(raw_email)
to:
email_message = str(email.message_from_string(raw_email)).decode("quoted-printable")
For more info see Standard Encodings in the Python codecs module.

Just 2 additional items having gone thought the pain of this for a day.
1 do it at the payload level so you can process your email_message to get email addresses etc from your mail.
2 You need to decode the charset set also, I had trouble with people copying and pasting html from webpages and content from word docs etc into emails that I was then trying to process.
if maintype == 'multipart':
for part in email_message.get_payload():
if part.get_content_type() == 'text/plain':
text += part.get_payload().decode("quoted-printable").decode(part.get_content_charset())
Hope this helps someone !
Dave

Related

issues with parsing email using python with "imaplib" library, html lines char limit, and additional non unicode chara

i'm using python 3, and want to validate emails sent to my inbox
im using imaplib,
i've managed to get the email content,
however, the mail is unreadable and kind of corrupted ( variable html123 in code)
after i'm fetching the mail, and getting the content using :
mail_body = email.message_from_string(str(data[1][0][1], 'utf-8'))
this is the original mail i see in mailbox:
dear blabla,
We’ve added new tasks to your account. Please log in to your account to review and.....
this is the mail i get in python
| |
dear blabla,
We=E2=80=99ve added new tasks to your account. Plea= se log in to your account....
so 3 issues in this example, i have much more in the real mail:
1 -the ' was replaced with =E2=80=99
2- the word please cut at end of line, with =
3 -all the signs\char || --- you see above
this is the relevant part in code:
data = self.mail_conn.fetch(str(any_email_id), f'({fetch_protocol})')
mail_body = email.message_from_string(str(data[1][0][1], 'utf-8'))
html123 = mail_body.get_payload()
x1 = (html2text.html2text(html123))

The data you get from imaplib is in "quoted-printable" encoding. https://en.wikipedia.org/wiki/Quoted-printable
To decode you can use the builtin quopri module
import quopri
quopri.decodestring("we=E2=80=99ve").decode() # -> we've

pyramid_mailer `Message` and `Content-Transfer-Encoding`

I'm sending emails with pyramid_mailer and found this weird issue that when I use Office365 as my SMTP server it adds random = characters into my message. I don't get that issue with any other mail server (I tested this with gmail and also with my own postfix server)
I send emails like below:
from pyramid_mailer.mailer import Mailer
from pyramid_mailer.message import Attachment, Message
mailer = Mailer()
mailer.smtp_mailer.hostname = "test.mail.at.office365"
mailer.smtp_mailer.username = "my_user"
mailer.smtp_mailer.password = "secret"
mailer.smtp_mailer.port = 587
mailer.smtp_mailer.tls = True
message = Message(
subject="Test",
sender="my_user#my_domain.com",
recipients="test_user#test_domain.com",
body="very long text, at least 75 characters long so Office 365 will break it and insert annoying '=' into message",
html="very long text, at least 75 characters long so Office 365 will break it and insert annoying '=' into message",
)
mailer.send_immediately(message)
I searched on google and found this has something to do with line breaks and Transfer-Content-Encoding. And indeed, if I add \r\n every ~50 characters I won't see = added. But the problem is that I might want to send a hyperlink that will be longer than that...
Pyramid documentation (https://docs.pylonsproject.org/projects/pyramid_mailer/en/latest/) says I can use Attachment rather than plain string. And indeed as soon as I do that I can set this Transfer-Content-Encoding to something like base64 (as suggested here: https://jeremytunnell.com/2009/01/04/really-hairy-problem-with-seemingly-random-crlf-and-spaces-inserted-in-emails/) but my message then shows as attachment, not as regular message...
There seems to be no way to add this Transfer-Content-Encoding to Message object... I tried to use Message.extra_headers = {'Transfer-Content-Encoding': 'base64'} but this did not help.
I'm totally out of ideas, would appreciate any help...
-- Edit --
Thanks to answer below from #Mess:
from pyramid_mailer.message import Attachment
my_message = "very long text, at least 75 characters long so Office 365 will break it and insert annoying '=' into message"
body_html = Attachment(data=my_message, transfer_encoding="base64", disposition='inline')
body_text = Attachment(data=my_message, transfer_encoding="base64", disposition='inline')
Then pass body_html and body_text to Message constructor.

This is "Content-Disposition" header you need to set to control how the content is available to the recipient.
Set it to "attachment" to let download the file, use "inline" to be able to include the content, for example a logo, directly to your email, etc:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition
I hope it will point you to the right direction.
EDIT:
Using pyramid_mailer package it would be something like:
from pyramid_mailer.message import Attachment
attachment = Attachment(data=some_data, transfer_encoding="base64", disposition='inline')

Python Gmail Api Base64 Decode Strange Chars In Email Body

I'm using the Gmail API to retrieve emails from my inbox:
query = 'to:me after:{}'.format(weekStartDate)
unreadEmailsQuery = service.users().messages().list(userId='me', q=query).execute()
# For Each Email
for message in unreadEmailsQuery['messages']:
result = service.users().messages().get(id=message['id'],userId='me').execute()
email_content = ''
if 'data' in result['payload']['body'].keys():
email_content+= result['payload']['body']['data']
else:
for part in result['payload']['parts']:
email_content = part['body']['data'] + email_content
test = bytes(str(email_content),encoding='utf-8')
print(base64.decodebytes(test))
prints out simple plain text messages correctly:
b'Got another one with me
But prints out html messages like this:
b'<body\x03B\x83B\x83B\x83B\x88\x08\x0f\x1bY]\x18H\x1a\x1d\x1d\x1c\x0bY\\]Z]\x8fH\x90\xdb\
I can see that it's okay until the first > from then on the string gets printed incorrectly and I'm not sure why.
I am trying to extract words out of my email so that I can train a classifier but I am stuck.
Any help would be greatly appreciated.

I needed to use the URl safe base64 decoding.
I managed to get this working by changing the last line:
print(base64.decodebytes(test))
to:
print(base64.urlsafe_b64decode(test))

Python mail inserts space every 171 characters

I am trying to write a python script to send an email that uses html formatting and involves a lot of non-breaking spaces. However, when I run it, some of the &nbsp strings are interrupted by spaces that occur every 171 characters, as can be seen by this example:
#!/usr/bin/env python
import smtplib
import socket
from email.mime.text import MIMEText
emails = ["my#email.com"]
sender = "test#{0}".format(socket.gethostname())
message = "<html><head></head><body>"
for i in range(20):
message += " " * 50
message += "<br/>"
message += "</body>"
message = MIMEText(message, "html")
message["Subject"] = "Test"
message["From"] = sender
message["To"] = ", ".join(emails)
mailer = smtplib.SMTP("localhost")
mailer.sendmail(sender, emails, message.as_string())
mailer.quit()
The example should produce a blank email that consists of only spaces, but it ends up looking something like this:
&nbsp ;
&nb sp;
& nbsp;
&nbs p;
&n bsp;
Edit: In case it is important, I am running Ubuntu 15.04 with Postfix for the smtp client, and using python2.6.

I can replicate this in a way but my line breaks come every 999 characters. RFC 821 says maximum length of a line is 1000 characters including the line break so that's probably why.
This post gives a different way to send a html email in python, and i believe the mime type "multipart/alternative" is the correct way.
Sending HTML email using Python

I'm the developer of yagmail, a package that tries to make it easy to send emails.
You can use the following code:
import yagmail
yag = yagmail.SMTP('me#gmail.com', 'mypassword')
for i in range(20):
message += " " * 50
message += "<br/>"
yag.send(contents = message)
Note that by default it will send a HTML message, and that it also adds automatically the alternative part for non HTML browsers.
Also, note that omitting the subject will leave an empty subject, and without a to argument it will send it to self.
Furthermore, note that if you set yagmail up correctly, you can just login using yag.SMTP(), without having to have username & password in the script (while still being secure). Omitting the password will prompt a getpass.
Adding an attachment is as simple as pointing to a local file, e.g.:
yag.send(contents = [message, 'previously a lot of whitespace', '/local/path/file.zip']
Awesome isn't it? Thanks for the allowing me to show a nice use case for yagmail :)
If you have any feature requests, issues or ideas please let me know at github.

Python, IMAP and GMail. Mark messages as SEEN

I have a python script that has to fetch unseen messages, process it, and mark as seen (or read)
I do this after login in:
typ, data = self.server.imap_server.search(None, '(UNSEEN)')
for num in data[0].split():
print "Mensage " + str(num) + " mark"
self.server.imap_server.store(num, '+FLAGS', '(SEEN)')
The first problem is that, the search returns ALL messages, and not only the UNSEEN.
The second problem is that messages are not marked as SEEN.
Can anybody give me a hand with this?
Thanks!

import imaplib
obj = imaplib.IMAP4_SSL('imap.gmail.com', '993')
obj.login('user', 'password')
obj.select('Inbox') <--- it will select inbox
typ ,data = obj.search(None,'UnSeen')
obj.store(data[0].replace(' ',','),'+FLAGS','\Seen')

I think the flag names need to start with a backslash, eg: \SEEN

I am not so familiar with the imaplib but I implement this well with the imapclient module
import imapclient,pyzmail,html2text
from backports import ssl
context=ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
iobj=imapclient.IMAPClient('outlook.office365.com', ssl=True, ssl_context=context)
iobj.login(uname,pwd)# provide your username and password
iobj.select_folder('INBOX',readonly=True)# Selecting Inbox.
unread=iobj.search('UNSEEN')# Selecting Unread messages, you can add more search criteria here to suit your purpose.'FROM', 'SINCE' etc.
print('There are: ',len(unread),' UNREAD emails')
for i in unread:
mail=iobj.fetch(i,['BODY[]'])#I'm fetching the body of the email here.
mcontent=pyzmail.PyzMessage.factory(mail[i][b'BODY[]'])#This returns the email content in HTML format
subject=mcontent.get_subject()# You might not need this
receiver_name,receiver_email=mcontent.get_address('from')
mail_body=html2text.html2text(mcontent.html_part.get_payload().decode(mcontent.html_part.charset))# This returns the email content as text that you can easily relate with.
Let's say I want to just go through the unread emails, reply the sender and mark the email as read. I'd call the smtp function from here to compose and send a reply.
import smtplib
smtpobj=smtplib.SMTP('smtp.office365.com',587)
smtpobj.starttls()
smtpobj.login(uname,pwd)# Your username and password goes here.
sub='Subject: '+str(subject)+'\n\n'# Subject of your reply
msg='Thanks for your email! You're qualified for the next round' #Some random reply :(
fullmsg=sub+new_result
smtpobj.sendmail(uname,test,fullmsg)# This sends the email.
iobj.set_flags(i,['\\Seen','\\Answered'])# This marks the email as read and adds the answered flag
iobj.append('Sent Items', fullmsg)# This puts a copy of your reply in your Sent Items.
iobj.logout()
smtpobj.logout()
I hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Export Gmail using Python imaplib - text mangled with newline issues - python

What you are looking at is Quoted Printable encoding. Try changing: email_message = email.message_from_string(raw_email) to: email_message = str(email.message_from_string(raw_email)).decode("quoted-printable") For more info see Standard Encodings in the Python codecs module.

Related

issues with parsing email using python with "imaplib" library, html lines char limit, and additional non unicode chara

pyramid_mailer `Message` and `Content-Transfer-Encoding`

Python Gmail Api Base64 Decode Strange Chars In Email Body

Python mail inserts space every 171 characters

Python, IMAP and GMail. Mark messages as SEEN

Categories

Resources