I'm parsing email headers and need to use some values from them.
This would be a piece of the code I'm using, in this case for the last received email:
result, data = mail.fetch(mails_list[-1], "(RFC822)")
raw_email = data[0][1]
contenido = email.message_from_bytes(raw_email)
print (contenido['From'])
I'm getting the values as follows:
Google <no-reply#accounts.google.com>
=?UTF-8?Q?ihniwid_ingenier=C3=ADa?= <whatever#gmail.com>
I don't know how to receive them just as values, the email in this case, without the previous "name" and the "<>" wrapping it.
And regarding the "=?UTF-8?Q?" symbols on the second line, I have no idea why it is happening, but it does with every non-english letter, like "í", in that case. I've been searching about the text format and I checked that I have UTF-8 set in every aspect I know, windows config, environment variables...
Any help is appreciated.
Related
I have a bot I'm writing using imaplib in python to fetch emails from gmail and output some useful data from them. I've hit a snag on selecting the inbox, though; the existing sorting system uses custom labels to separate emails from different customers. I've partially replicated this system in my test email, but imaplib.select() throws a "imaplib.IMAP4.error: SELECT command error: BAD [b'Could not parse command']" with custom labels. Screenshot attatched My bot has no problem with the default gmail folders, fetching INBOX or [Gmail]/Spam. In that case, it hits an error later in the code that deals with completely different problem I have yet to fix. The point, though, is that imaplib.select() is succsessful with default inboxes and just not custom labels.
The way my code works is it works through all the available inboxes, compares it to a user-inputted name, and if they match, saves the name and sets a boolean to true to signal that it found a match. It then checks, if there was a match (the user-inputted inbox exists) it goes ahead, otherwise it throws an error message and resets. It then attempts to select the inbox the user entered.
I've verified that the variable the program's saving the inbox name to matches what's listed as the name in the imap.list() command. I have no idea what the issue is.
I could bypass the process by iterating through all mail to find the email's I'm looking for, but it's far more efficient to use the existing sorting system due to the sheer number of emails on the account I'll be using.
Any help is appreciated!
EDIT: Code attached after request. Thank you to the person who told me to do so.
'''
Fetches emails from the specified inbox and outputs them to a popup
'''
def fetchEmails(self):
#create an imap object. Must be local otherwise we can only establish a single connection
#imap states are kinda bad
imap = imaplib.IMAP4_SSL(host="imap.gmail.com", port="993")
#Login and fetch a list of available inboxes
imap.login(username.get(), password.get())
type, inboxList = imap.list()
#Set a reference boolean and iterate through the list
inboxNameExists = False
for i in inboxList:
#Finds the name of the inbox
name = self.inboxNameParser(i.decode())
#If the given inbox name is encountered, set its existence to true and break
if name.casefold().__eq__(inboxName.get().casefold()):
inboxNameExists = True
break
#If the inbox name does not exist, break and give error message
if inboxNameExists != True:
self.logout(imap)
tk.messagebox.showerror("Disconnected!", "That Inbox does not exist.")
return
'''
If/else to correctly feed the imap.select() method the inbox name
Apparently inboxes containing spaces require quoations before and after
Selects the inbox and pushes it to a variable
two actually but the first is unnecessary(?)
imap is weird
'''
if(name.count(" ") > 0):
status, messages = imap.select("\"" + name + "\"")
else:
status, messages = imap.select(name);
#Int containing total number of emails in inbox
messages = int(messages[0])
#If there are no messages disconnect and show an infobox
if messages == 0:
self.logout(imap)
tk.messagebox.showinfo("Disconnected!", "The inbox is empty.")
self.mailboxLoop(imap, messages)
Figured the issue out after a few hours banging through it with a friend. As it turns out the problem was that imap.select() wants quotations around the mailbox name if it contains spaces. So imap.select("INBOX") is fine, but with spaces you'd need imap.select("\"" + "Label Name" + "\"")
You can see this reflected in the code I posted with the last if/else statement.
Python imaplib requires mailbox names with spaces to be surrounded by apostrophes. So imap.select("INBOX") is fine, but with spaces you'd need imap.select("\"" + "Label Name" + "\"").
I working on some analytics for our email help line. I can see the headers and everything that is in them, but I need to separate each header component into its own field/variable. What is the best way to accomplish this.
here is the the code i currently have.
import win32com.client
import win32com
import pandas as pd
M_date = []
M_sender = []
M_sub = []
M_flag = []
M_cat = []
M_folder = []
outlook = win32com.client.Dispatch("outlook.application").GetNamespace("MAPI")
for i in range(0, 20):
try:
inbox = outlook.getdefaultfolder(6).folders[i]
try:
for message in inbox.items:
try:
Folder = str(inbox) + " " + str(i)
Sender= message.sendername
Subject= message.subject
Dates= message.ReceivedTime
M_import = message.Importance
if message.FlagRequest == None :
Flag = ""
else:
Flag = message.FlagRequest
if message.Categories == None:
cat = ""
else:
cat = message.Categories
msg = message.PropertyAccessor.GetProperty("http://schemas.microsoft.com/mapi/proptag/0x007D001F")
print(msg) #debug header
M_folder.append(Folder)
M_date.append(Dates.strftime("%b %d %Y %H:%M:"))
M_sender.append(Sender)
M_sub.append(Subject)
M_flag.append(Flag)
M_cat.append(cat)
except:
pass
except:
pass
except:
pass
df = pd.DataFrame({
'In folder': M_folder,
'Date': M_date,
'Sender': M_sender,
'Subject': M_sub,
'flags': M_flag,
'Categrories': M_cat})
df.to_csv('email_data.csv', index=False)
Thanks
Transport headers is a string which contains properties and their values separated by ":". Basically you need to loop through all lines backwards. If the line starts with space or tab, append it to the previous line and delete the current line. Then loop through all lines and separate them into the header name (left of the first ":") and the header value (right of the first ":").
I do not know Python so I cannot provide any code, but I can tell you about the format of the Transport Message Headers. (I must learn Python, my son-in-law swears by it.)
The Transport Message Headers contain an indefinite number of lines separated by carriage return linefeed. In VBA to access the individual lines, you would have something like:
Dim msgParts() As String
msgParts = Split(msg, vbCrLf)
If a line starts with one or more spaces and or horizontal tabs, it is a continuation of the previous line. Replace all the spaces and tabs at the beginning of a continuation line with one space and append to the previous line.
A line, together with any continuation lines, starts “Xxxx: ”. “Xxxx” will be “To” or “From” or any of the other specified identifiers or a private identifier.
The specification of the lines are RFCs (Request For Comments). I would start with RFC 5321 and follow the references to the related RFCs. Or perhaps I would not.
I have not looked at the RFCs for SMTP (Simple Mail Transfer Protocol) for many years. My recollection is that they were once much simpler. For example, my recollection is that the specification dealt with the continuations and then dealt with the combined line; this would have been standard practice when I was young. I was looking at the specification for email addresses which seemed overly complicated with lots of CRLFs that I did not remember as being allowed within a line. I finally realised that the specification for an email address allowed for a continuation line break between any two elements. In my humble opinion, this made for an unnecessarily complex specification. I would also expect the processing code to be slower since it would be attempting to solve two separate problems at the same time.
In the end, I gave up on the SMTP RFCs. Partly because of the continuation line issue but mainly because they now handle a lot of specialised situations that are quite outside the needs of the simple emails I send and receive. I decided it was easier to analyse the emails I had sent or received than attempt to simplify the specification down to my requirements.
My interest in looking at the Transport Message Headers was because I wanted to identify the other party of every email. For every email in my Outlook folders, I was either the sender or I was one of the recipients. If I was the sender, I wanted the first or only recipient. If I was a recipient, I wanted the sender. This proved difficult or impossible from the properties such as To and From because they usually contain display names. The display names for myself, were every possible variation of my name. If this issue is relevant to you, I am happy to share how I handled it.
I have a problem to compare body emails in python.
I get the body from text files that contain emails and populate the list with the body of emails:
for enum in original_list:
with open(enum, 'r') as f:
enum = f.read()
msg = email.message_from_string(enum)
for part in msg.walk():
my_body = part.get_payload(decode=True)
original_data_body.append(my_body)
I get the bodies from messages from another file which contains all messages in mbox format. Again with walk and get_payload.
The problem is that the emails in mbox contains in the end extra license messages.
How to remove this extra messages and compare the body of emails?
Is the extra license message always the same? If yes then you can split the string based on that and keep only the first part returned by split which will contain the original message. If its not exactly the same but that there is a pattern that repeat across messages split it on that pattern and return the first part.
Yes, the message is always the same. I can split but this this means to hard coded the split. I hope for more elegant way. :(
I am using the IMAPClient library in Python. I am able to download the attached document in the email. I am interested in only Excel files.
I am interested to extract the recipient list from the email. Any idea how to do it in Python ?
Here is the code snippet which might be useful
for ind_mail in emails:
msg_string = ind_mail['RFC822'].decode("utf-8")
#print(msg_string.decode("utf-8"))
email_msg = email.message_from_string(msg_string)
for part in email_msg.walk():
# Download only Excel File
filetype = part.get_content_type()
if(filetype == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'):
#download
The straightforward answer to your question is to get the corresponding headers' values, i.e.:
to_rcpt = email_msg.get_all('to', [])
cc_rcpt = email_msg.get_all('cc', [])
, inside that first loop. The MIME standard doesn't enforce uniqueness on the headers (though strongly suggests it), thus get_all; if not present, you'll still have an empty list for a consecutive loop.
But as tripleee has rightfully pointed out, the mime headers can be easily censored, spoofed or simply removed.
Yet this is the only info persisted and returned by a server, and all mail clients use to present to us :)
Calling msg.get_all will return a list containing one entry per one header, so if you have multiple header, you'll get a list per header
BUT
If one header has multiple emails in a coma-separated way, you will only get one string and you'll have to split it.
The best way to have the list of all the emails from a specific header is to use getaddresses (https://docs.python.org/3/library/email.utils.html#email.utils.getaddresses)
from email.utils import getaddresses
to_rcpt = getaddresses(email_msg.get_all('to', []))
get_all will return an array of all the "To:" headers, and getaddresses will parse each entry and return as many emails as present on each headers. For instance:
message = """
To: "Bob" <email1#gmail.com>, "John" <email2#gmail.com>
To: email3#gmail.com, email4#gmail.com
"""
to_rcpt = getaddresses(email_msg.get_all('to', []))
=> [('Bob', 'email1#gmail.com'), ('John', 'email2#gmail.com'), ('', 'email3#gmail.com'), ('', 'email4#gmail.com')]
Here's an excerpt from the code I'm using. I'm looping through the part that adds the email; my problem is rather than changing the "to" field on each loop, it is appending the "to" data. Obviously this causes some issues, since the to field ends up getting longer and longer. I tried msgRoot.del_param('To') to no avail. I even tried setting the msgRoot['To'] to refer to the first index of a list so I could simply change the value of that list item (also didn't work).
from email.MIMEMultipart import MIMEMultipart
msgRoot = MIMEMultipart('related')
msgRoot['To'] = 'email#email.com'
You can use the replace_header method.
replace_header(_name, _value)
Replace a header. Replace the first header found in the message that matches _name, retaining header order and field name case. If no matching header was found, a KeyError is raised.
New in version 2.2.2.
For example,
if msgRoot.has_key('to'):
msgRoot.replace_header('to', someAdress)
else:
msgRoot['to'] = 'email#email.com'
I just do this:
del msgRoot["To"]
msgRoot["To"] = "email#email.com"
My homebrewed blog platform at http://www.royalbarrel.com/ stores its blog posts this way, using Mime messages. Works great. And if someone adds a comment I upgrade the message to MimeMultipart and have the first payload be the actual blog post and subsequent payloads be the comments.