Mail Internet header analyzer in Python? - python

been going through old posts but so far only found solutions for identifying e.g. sender, recipient, subject. I'm looking to get started on code that would analyze the internet header similar to tools like https://testconnectivity.microsoft.com/MHA/Pages/mha.aspx and https://toolbox.googleapps.com/apps/messageheader/ .
I would like to be able to extract e.g. From, Reply-to, submitting MX, X-originating IP, X-mailer. Should I create a parser from scratch or is there something I could use? Perhaps a sample or something you can share?
Best,
Fredrik

email module deals with e-mails quite nicely.
For example:
import email
msg = email.message_from_file("some_saved_email.eml")
# To get to headers, you treat the Message() as a dict:
print msg.keys() # To get all headers
print msg["X-Mailer"] # To get to header's value
# Let us list the whole header:
for header, value in msg.items():
print header+": "+value

Related

separate emails in the email thread based on reference or in-reply-to headers using imap_tools

I am working on a CRM, where I am receiving hundreds of emails for offers/requirements per day. I am building an API that will process the email and will insert entries in the CRM.
I am using imap_tools to get the mails in my API. but I am stuck at the point when there's a thread/conversation. I read some articles regarding using reference or in-reply-to header from the mail. but unlucky so far. I have also tried using the message-id but it gave me the same email thread instead of multiple emails.
I am getting an email thread/conversation as a single email and I want to get separated emails so I can process them easily.
here's what I have done so far.
from imap_tools import MailBox
with MailBox('mail.mail.com').login('abc#abc.com', 'password', 'INBOX') as mailbox:
for msg in mailbox.fetch():
From = msg.headers['from'][0]
To = msg.headers['to'][0]
subject = msg.headers['subject'][0]
received_date = msg.headers['date'][0]
raw_email = msg.text
process_email(raw_email) #processing the email
The issue you are facing is not related to the headers reference or in-reply-to. Most email clients will append the previous email as quoted text to the new mail when you reply. Hence in a thread, a mail will have the body of all previous mails as quoted text.
In most cases, and I say most since the Email standards vary a lot from client to client, the client will quote the previous mail by pretending > before all quoted lines
new message
> old message
>> very old message
As a hacky solution, you can drop all lines that start with >
In python, you can splitlines() and filter
lines = email.splitlines()
new_lines = [i for i in lines if not i.startswith('>')]
or
new_lines = list(filter(lambda i: not i.startswith('>'), lines))
you may use regular expressions or other techniques too.
the issue with the solution is obvious, if an email contains > else where it will cause loss of information. Hence a more complicated approach is to select lines with > and compare them with the previous emails in the thread using references and remove those which match.
Google has their patented implementation here
https://patents.google.com/patent/US7222299
Source: How to remove the quoted text from an email and only show the new text
Edit
I realized Gmail follows the > quoting and other clients may follow other methods. There's a Wikipedia article on it: https://en.wikipedia.org/wiki/Posting_style
conceptually the approach needed will be similar, but different types of clients will need to be handled

Receiving just the email value from a header

I'm parsing email headers and need to use some values from them.
This would be a piece of the code I'm using, in this case for the last received email:
result, data = mail.fetch(mails_list[-1], "(RFC822)")
raw_email = data[0][1]
contenido = email.message_from_bytes(raw_email)
print (contenido['From'])
I'm getting the values as follows:
Google <no-reply#accounts.google.com>
=?UTF-8?Q?ihniwid_ingenier=C3=ADa?= <whatever#gmail.com>
I don't know how to receive them just as values, the email in this case, without the previous "name" and the "<>" wrapping it.
And regarding the "=?UTF-8?Q?" symbols on the second line, I have no idea why it is happening, but it does with every non-english letter, like "í", in that case. I've been searching about the text format and I checked that I have UTF-8 set in every aspect I know, windows config, environment variables...
Any help is appreciated.

Separate outlook getproperty into variables like message id, in-reply and so on

I working on some analytics for our email help line. I can see the headers and everything that is in them, but I need to separate each header component into its own field/variable. What is the best way to accomplish this.
here is the the code i currently have.
import win32com.client
import win32com
import pandas as pd
M_date = []
M_sender = []
M_sub = []
M_flag = []
M_cat = []
M_folder = []
outlook = win32com.client.Dispatch("outlook.application").GetNamespace("MAPI")
for i in range(0, 20):
try:
inbox = outlook.getdefaultfolder(6).folders[i]
try:
for message in inbox.items:
try:
Folder = str(inbox) + " " + str(i)
Sender= message.sendername
Subject= message.subject
Dates= message.ReceivedTime
M_import = message.Importance
if message.FlagRequest == None :
Flag = ""
else:
Flag = message.FlagRequest
if message.Categories == None:
cat = ""
else:
cat = message.Categories
msg = message.PropertyAccessor.GetProperty("http://schemas.microsoft.com/mapi/proptag/0x007D001F")
print(msg) #debug header
M_folder.append(Folder)
M_date.append(Dates.strftime("%b %d %Y %H:%M:"))
M_sender.append(Sender)
M_sub.append(Subject)
M_flag.append(Flag)
M_cat.append(cat)
except:
pass
except:
pass
except:
pass
df = pd.DataFrame({
'In folder': M_folder,
'Date': M_date,
'Sender': M_sender,
'Subject': M_sub,
'flags': M_flag,
'Categrories': M_cat})
df.to_csv('email_data.csv', index=False)
Thanks
Transport headers is a string which contains properties and their values separated by ":". Basically you need to loop through all lines backwards. If the line starts with space or tab, append it to the previous line and delete the current line. Then loop through all lines and separate them into the header name (left of the first ":") and the header value (right of the first ":").
I do not know Python so I cannot provide any code, but I can tell you about the format of the Transport Message Headers. (I must learn Python, my son-in-law swears by it.)
The Transport Message Headers contain an indefinite number of lines separated by carriage return linefeed. In VBA to access the individual lines, you would have something like:
Dim msgParts() As String
msgParts = Split(msg, vbCrLf)
If a line starts with one or more spaces and or horizontal tabs, it is a continuation of the previous line. Replace all the spaces and tabs at the beginning of a continuation line with one space and append to the previous line.
A line, together with any continuation lines, starts “Xxxx: ”. “Xxxx” will be “To” or “From” or any of the other specified identifiers or a private identifier.
The specification of the lines are RFCs (Request For Comments). I would start with RFC 5321 and follow the references to the related RFCs. Or perhaps I would not.
I have not looked at the RFCs for SMTP (Simple Mail Transfer Protocol) for many years. My recollection is that they were once much simpler. For example, my recollection is that the specification dealt with the continuations and then dealt with the combined line; this would have been standard practice when I was young. I was looking at the specification for email addresses which seemed overly complicated with lots of CRLFs that I did not remember as being allowed within a line. I finally realised that the specification for an email address allowed for a continuation line break between any two elements. In my humble opinion, this made for an unnecessarily complex specification. I would also expect the processing code to be slower since it would be attempting to solve two separate problems at the same time.
In the end, I gave up on the SMTP RFCs. Partly because of the continuation line issue but mainly because they now handle a lot of specialised situations that are quite outside the needs of the simple emails I send and receive. I decided it was easier to analyse the emails I had sent or received than attempt to simplify the specification down to my requirements.
My interest in looking at the Transport Message Headers was because I wanted to identify the other party of every email. For every email in my Outlook folders, I was either the sender or I was one of the recipients. If I was the sender, I wanted the first or only recipient. If I was a recipient, I wanted the sender. This proved difficult or impossible from the properties such as To and From because they usually contain display names. The display names for myself, were every possible variation of my name. If this issue is relevant to you, I am happy to share how I handled it.

How to obtain the recipient list from email using IMAPClient in Python

I am using the IMAPClient library in Python. I am able to download the attached document in the email. I am interested in only Excel files.
I am interested to extract the recipient list from the email. Any idea how to do it in Python ?
Here is the code snippet which might be useful
for ind_mail in emails:
msg_string = ind_mail['RFC822'].decode("utf-8")
#print(msg_string.decode("utf-8"))
email_msg = email.message_from_string(msg_string)
for part in email_msg.walk():
# Download only Excel File
filetype = part.get_content_type()
if(filetype == 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'):
#download
The straightforward answer to your question is to get the corresponding headers' values, i.e.:
to_rcpt = email_msg.get_all('to', [])
cc_rcpt = email_msg.get_all('cc', [])
, inside that first loop. The MIME standard doesn't enforce uniqueness on the headers (though strongly suggests it), thus get_all; if not present, you'll still have an empty list for a consecutive loop.
But as tripleee has rightfully pointed out, the mime headers can be easily censored, spoofed or simply removed.
Yet this is the only info persisted and returned by a server, and all mail clients use to present to us :)
Calling msg.get_all will return a list containing one entry per one header, so if you have multiple header, you'll get a list per header
BUT
If one header has multiple emails in a coma-separated way, you will only get one string and you'll have to split it.
The best way to have the list of all the emails from a specific header is to use getaddresses (https://docs.python.org/3/library/email.utils.html#email.utils.getaddresses)
from email.utils import getaddresses
to_rcpt = getaddresses(email_msg.get_all('to', []))
get_all will return an array of all the "To:" headers, and getaddresses will parse each entry and return as many emails as present on each headers. For instance:
message = """
To: "Bob" <email1#gmail.com>, "John" <email2#gmail.com>
To: email3#gmail.com, email4#gmail.com
"""
to_rcpt = getaddresses(email_msg.get_all('to', []))
=> [('Bob', 'email1#gmail.com'), ('John', 'email2#gmail.com'), ('', 'email3#gmail.com'), ('', 'email4#gmail.com')]

Sending the variable's content to my mailbox in Python?

I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.
If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.
for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers

Categories

Resources