Parsing an email message body - python

I'm using the gmail API to parse through my gmail message body. It works other than when the body is in an html. Does anyone know how I can just extract the text within the email? If not, how I can just ignore emails with html?
Eventually I want to implement this for personal/professional emails in which there likely won't be html in it.
def message_converter(message_id):
message = service.users().messages().get(userId='me', id=message_id,format='raw').execute()
msg_str = str(base64.urlsafe_b64decode(message['raw'].encode('ASCII')),'UTF-8')
mime_msg = email.message_from_string(msg_str)
if mime_msg.is_multipart():
for payload in mime_msg.get_payload():
# if payload.is_multipart(): ...
print (payload.get_payload())
else:
print (mime_msg.get_payload())

html2text does a pretty good job - it converts HTML into ASCII text.
You may need to do additional parsing/formatting after the fact, however.

i dont know if this can help you but Gmail Api have the same syntax so in C# you can find the text message in 3 places (it depends on the mail server) so :
msg.Payload.Parts[1].Body.Data; // here you can find text message without HTML tag
msg.Payload.Parts[0].Body.Data; // here you can find text message with HTML tag
msg.Payload.Body.Data; // and here you dont have a choice you have the HTMl tag

This answer may help you do what you are heading to. I understand that you wanna get certain texts out of the body of the email. You may use regular expressions to do that. I made a video explaining how to get data out of Gmail email body but using Google App Script (JavaScript):
https://youtu.be/nI1OH3pAz6s?t=8
You download the code from GitHub link:
https://gist.github.com/MoayadAbuRmilah/5835369fdebbecf980029f7339e4d769

Related

Sending emojis with Selenium Python

I know that questions like this have already been asked but I have not found a clear answer that really works.
I want to send automatically dm on social medias but I want to add emojis in it. The thing is that I don't understand how do people send it because it is not allowed by the Chromedriver (it is said that only BMP is supported).
Indeed, I have found a solution to add text in an input, which is as follows :
JS_ADD_TEXT_TO_INPUT = """
var elm = arguments[0], txt = arguments[1];
elm.value += txt;
elm.dispatchEvent(new Event('change'));
"""
In my case, the place where i want to add emoji is not always an input (it can be a span or a div). Does someone have an idea about what code could help me to do that ? Thanks in advance !
If you're trying to add an emoji to a non-input field, you could set the textContent directly to it.
script = """document.querySelector('%s').textContent='%s';""" % (
css_selector,
value,
)
driver.execute_script(script)
Be sure to escape any quotes & special characters before feeding a selector in there. Eg. import re; re.escape(css_selector)
That will let you set any text on a web page element to anything, including emojis if you're not typing into an input field.

separate emails in the email thread based on reference or in-reply-to headers using imap_tools

I am working on a CRM, where I am receiving hundreds of emails for offers/requirements per day. I am building an API that will process the email and will insert entries in the CRM.
I am using imap_tools to get the mails in my API. but I am stuck at the point when there's a thread/conversation. I read some articles regarding using reference or in-reply-to header from the mail. but unlucky so far. I have also tried using the message-id but it gave me the same email thread instead of multiple emails.
I am getting an email thread/conversation as a single email and I want to get separated emails so I can process them easily.
here's what I have done so far.
from imap_tools import MailBox
with MailBox('mail.mail.com').login('abc#abc.com', 'password', 'INBOX') as mailbox:
for msg in mailbox.fetch():
From = msg.headers['from'][0]
To = msg.headers['to'][0]
subject = msg.headers['subject'][0]
received_date = msg.headers['date'][0]
raw_email = msg.text
process_email(raw_email) #processing the email
The issue you are facing is not related to the headers reference or in-reply-to. Most email clients will append the previous email as quoted text to the new mail when you reply. Hence in a thread, a mail will have the body of all previous mails as quoted text.
In most cases, and I say most since the Email standards vary a lot from client to client, the client will quote the previous mail by pretending > before all quoted lines
new message
> old message
>> very old message
As a hacky solution, you can drop all lines that start with >
In python, you can splitlines() and filter
lines = email.splitlines()
new_lines = [i for i in lines if not i.startswith('>')]
or
new_lines = list(filter(lambda i: not i.startswith('>'), lines))
you may use regular expressions or other techniques too.
the issue with the solution is obvious, if an email contains > else where it will cause loss of information. Hence a more complicated approach is to select lines with > and compare them with the previous emails in the thread using references and remove those which match.
Google has their patented implementation here
https://patents.google.com/patent/US7222299
Source: How to remove the quoted text from an email and only show the new text
Edit
I realized Gmail follows the > quoting and other clients may follow other methods. There's a Wikipedia article on it: https://en.wikipedia.org/wiki/Posting_style
conceptually the approach needed will be similar, but different types of clients will need to be handled

Mail Internet header analyzer in Python?

been going through old posts but so far only found solutions for identifying e.g. sender, recipient, subject. I'm looking to get started on code that would analyze the internet header similar to tools like https://testconnectivity.microsoft.com/MHA/Pages/mha.aspx and https://toolbox.googleapps.com/apps/messageheader/ .
I would like to be able to extract e.g. From, Reply-to, submitting MX, X-originating IP, X-mailer. Should I create a parser from scratch or is there something I could use? Perhaps a sample or something you can share?
Best,
Fredrik
email module deals with e-mails quite nicely.
For example:
import email
msg = email.message_from_file("some_saved_email.eml")
# To get to headers, you treat the Message() as a dict:
print msg.keys() # To get all headers
print msg["X-Mailer"] # To get to header's value
# Let us list the whole header:
for header, value in msg.items():
print header+": "+value

Python e-mail CGI script sends duplicate e-mail

I've been experimenting with a Python CGI script to send an e-mail (hosted with a comercial web host - 123reg), and the problem is whenever I run the script from my web browser, it sends two identical e-mails.
The code to send the mail is definitely only being executed once, there are no loops which could cause it to happen twice, I am definitely not clicking the button twice. No exceptions are thrown and the "success" page is sent to the browser as normal.
The strangest thing is that when I comment out the code to print the result page (which is very simple and has no side effects, just 3 print statements in a row) and replace it with a dummy print statement (print "Content-type: text/plain\n\ntest"), it works properly and only sends one e-mail.
I have tried googling the problem to no avail.
I am at my wit's end because this problem doesn't make any sense to me. I'm pretty sure it must be my script since inexplicably it works when you comment out those print statements.
I'd appreciate any help, thanks.
EDIT:
Here's the code which, when commented out, fixes the problem:
print "Content-type: text/html"
print
print page
EDIT:
The code to send the e-mail:
#send_email function: sends message from from_addr, assumes valid input
def send_email(from_addr, message):
#form the email headers/text:
email = "From: " + from_addr + "\n"
email += "To: " + TO[0] + "\n"
email += "Subject: " + SUBJECT + "\n"
email += "\n"
email += message
#return true for success, false for failure:
try:
server = smtplib.SMTP(SERVER)
server.sendmail(from_addr, TO, email)
server.quit()
return True;
except smtplib.SMTPException:
return False;
#end of send_email function
I'd post the code to format the page variable, but all it does is read from a file, format a string and return the string. Nothing unusual going on.
EDIT
OK, I've commented out the file IO code in the create_page function and it solves the issue, but I don't understand why, and I don't know how to modify it so that it'll work properly.
The create_page function, and therefore the file IO, was still being executed when I found that commenting out the print statements solved the problem.
This is the file IO code from before I commented it out (it's at the very start of the create_page function and the rest of the function simply modifies the page string, then returns it):
#read the template from the file:
frame_f = open(FRAME)
page = frame_f.read()
frame_f.close()
EDIT:
I have just replaced the file IO by copying and pasting the file text directly into a string in my source file, so there is no longer any file IO. This still hasn't fixed the problem. At this point my only theory is that computers hate me...
EDIT:
I'll have to post this here since stackoverflow won't let me answer my own question since I'm a newbie here...
EDIT:
OK, I posted it as an actual answer now.
PROBLEM SOLVED!
It turns out that it was the browser's fault all along. The reason I didn't notice this sooner was because I tested it in both Firefox and Chrome ages ago to rule the browser out, however it turns out that both Chrome and Firefox share this same bug.
I realised what was happening when the server logs finally updated, I realised that often GET requests were immediately (1 second later) followed by another GET request. I did some googling and found this:
What causes Firefox to make a GET request after submitting a form via the POST method?
It turns out that if you have an img tag with an empty src attribute e.g.
<img src=""/>
(I had some javascript which modified that tag), Firefox will send a duplicate GET request in place of a request for the image. It also turns out that Chrome has the same problem. This also explains why the problem was only happening when I was trying to include my html template.
It would help if you posted more code, but does the "page" variable contain code that would execute the email server a second time, or cause a page refresh that would trigger the email a second time.
The same thing will happen if you have a Javascript call with an empty src or "#" as src:
<script type="text/javascript" src="#"></script>
Perhaps also with an empty href for a css link. I haven't experienced that, but I'd expect the same behavior.

Sending the variable's content to my mailbox in Python?

I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.
If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.
for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers

Categories

Resources