Finding links in an emails body with Python

Finding links in an emails body with Python - python

I am currently working on a project in Python that would be connecting to an email server and looking at the latest email to tell the user if there is an attachment or a link embedded in the email. I have the former working but not the latter.
I may be having troubles with the if any() part of my script. As it seems to half work when I test. Although it may be due to how the email string is printed out?
Here is my code for connecting to gmail and then looking for the link.
import imaplib
import email
word = ["http://", "https://", "www.", ".com", ".co.uk"] #list of strings to search for in email body
#connection to the email server
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email#gmail.com', 'password')
mail.list()
# Out: list of "folders" aka labels in gmail.
mail.select("Inbox", readonly=True) # connect to inbox.
result, data = mail.uid('search', None, "ALL") # search and return uids instead
ids = data[0] # data is a list.
id_list = ids.split() # ids is a space separated string
latest_email_uid = data[0].split()[-1]
result, data = mail.uid('fetch', latest_email_uid, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID
raw_email = data[0][1] # here's the body, which is raw headers and html and body of the whole email
# including headers and alternate payloads
print "---------------------------------------------------------"
print "Are there links in the email?"
print "---------------------------------------------------------"
msg = email.message_from_string(raw_email)
for part in msg.walk():
# each part is a either non-multipart, or another multipart message
# that contains further parts... Message is organized like a tree
if part.get_content_type() == 'text/plain':
plain_text = part.get_payload()
print plain_text # prints the raw text
if any(word in plain_text for word in word):
print '****'
print 'found link in email body'
print '****'
else:
print '****'
print 'no link in email body'
print '****'
So basically as you can see I have a variable called 'Word' which contains an array of keywords to search for in the plain text email.
When I send a test email with an embedded link that is in the format of 'http://' or 'https://' - the email prints out the email body with the link in the text like this -
---------------------------------------------------------
Are there links in the email?
---------------------------------------------------------
Test Link <http://www.google.com/>
****
found link in email body
****
And I get my print message saying 'found link in email body' - which is the result I am looking for in my test phase, yet this will lead onto something else to happen within the final program.
Yet, if I add an embedded link in the email with no http:// such as google.com then the link doesn't print out and I don't get the result, even though I have an embedded link.
Is there a reason for this? I'm also suspecting maybe my if any() loops is not really the best. I didn't really understand it when I originally added it but it worked for http:// links. Then I tried just a .com and got my problem which I am having trouble finding a solution for.

To check if there are attachments to an e-mail you can search the headers for Content-Type and see if it says "multipart/*". E-mails with multipart content types may contain attachments.
To inspect the text for links, images, etc, you can try using Regular Expressions. As a matter of fact, this is probably your best option in my opinion. With regex (or Regular Expressions) you can find strings that match a given pattern. The pattern "<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>", for example, should match all links in your email message regardless of whether they are a single word or a full URL. I hope that helps!
Here's an example of how you can implement this in Python:
import re
text = "This is your e-mail body. It contains a link to <a
href='http//www.google.com'>Google</a>."
link_pattern = re.compile('<a[^>]+href=\'(.*?)\'[^>]*>(.*)?</a>')
search = link_pattern.search(text)
if search is not None:
print("Link found! -> " + search.group(0))
else:
print("No links were found.")
For the "end-user" the link will just appear as "Google", without www and much less http(s)... However, the source code will have the html wrapping it, so by inspecting the raw body of the message you can find all links.
My code is not perfect but I hope it gives you a general direction... You can have multiple patterns looked up in your e-mail body text, for image occurences, videos, etc. To learn Regular Expressions you'll need to research a little, here's another link, to Wikipedia

Related

How do I send multipart HTML and PLAIN Formatted emails, through the GMAIL-API for python

I have a question related to last answer in How do I send HTML Formatted emails, through the gmail-api for python but unfortunately the answer does not work for me. If I attach both the 'plain' and 'html' parts, it only accepts the LAST 'attach' call I make. That is, if I attach as 'plain' AFTER 'html', it only sends as 'plain',(which looks unappealing on devices/apps with HTML rendering)., but if I attach the 'html' AFTER 'plain', it only sends the 'html' format (which looks bad on devices/apps without HTML rendering). Unlike the person who posted that question, I do need both parts because some of the devices/apps that receive my emails do not render HTML and need the plain text part.
This is not a problem if I used 'smtplib' instead of GMAIL-API, but I want to use gmail api for better security in my app.
Here is my code:
message = MIMEMultipart('alternative')
message['to'] = to_email
message['from'] = from_email
message['subject'] = subject
body_plain = MIMEText(email_body,'plain')
message.attach(body_plain)
body_html_format="<u><b>html:<br>"+email_body+"</b></u>"
body_html = MIMEText(body_html_format,'html')
message.attach(body_html) # PROBLEM: Will only send as HTML since this was added LAST.
raw_string = base64.urlsafe_b64encode(message.as_bytes()).decode()
request = service.users().messages().send(userId='my.email#gmail.com',body={'raw':raw_string})
message = request.execute()
Thanks and regards,
Doug

How to send hyperlink with SendGrid using Python

I'm trying to send a simple mail with SendGrid which must contain one hyperlink.
I'm not doing anything fancy, just following the documentation example with some changes
import os
from sendgrid.helpers.mail import *
sg = sendgrid.SendGridAPIClient(api_key=os.environ.get('SENDGRID_API_KEY'))
from_email = Email("test#example.com")
to_email = To("test#example.com")
subject = "Sending with SendGrid is Fun"
content = Content("text/html", '<html>google</html>')
mail = Mail(from_email, to_email, subject, content)
response = sg.client.mail.send.post(request_body=mail.get())
It looks fine to me, but once I run the script and the mail is sent, it shows up like plain text I cannot click on.
I also tried many other combinations removing the <html> tag, using single and double quotes with the backslash, but nothing really worked. I even tried to do the same thing without the Mail Helper Class, but it didn't work.
Thanks very much for the help.

content = Content(
"text/html", "Hi User, \n This is a test email.\n This is to also check if hyperlinks work <a href='https://www.google./com'> Google </a> Regards Karthik")
This helped me. I believe you don't need to mention the html tags

How can I get the body of an email with Python and Google's gmail API?

I just started up with APIs and figured I'd play around with Gmail's. I'm looking to scrape all emails sent to me in the past month for some text analysis. I might just be being thick here or have missed some documentation somewhere (probably), but I can't figure out how to get the body of emails that had attachments. I'm not interested in the attachment, just the body of the email.
results = gmail.users().messages().list(labelIds=['INBOX'], q='to:me after: '+str(date),userId='me').execute()
mssg_list = results['messages']
for mssg in mssg_list:
m_id = mssg['id']
message = gmail.users().messages().get(userId='me', id=m_id).execute()
body = message['payload']['parts'][1]['body']
final_body = base64.urlsafe_b64decode(body['data'].decode("utf-8"))
For messages with attachments, it returns only attachmentId and size, rather than size and data. I tried reading the attachmentId to see if perhaps data was kept there instead, but no dice; it appears to just refer to the attachment. Where is the actual text body living?

What fields are available after parsing an email message?

I am using email.message_from_string to parse an email message into Python. The documentation doesn't seem to say what standard fields there are.
How do I know what fields are available to read from, such as msg['to'], msg['from'], etc.? Can I still find this if I don't have an email message to experiment with on the command line?

email.message_from_string() just parses the headers from the email. Using keys() you get all present headers from the email.
import email
e = """Sender: test#test.dk
From: test#test.dk
HelloWorld: test
test email
"""
a = email.message_from_string(e)
print a.keys()
Outputs: ['Sender', 'From', 'HelloWorld']
Therefore, you will never find a manual that includes from, to, sender etc. as they are not part of the API, but just parsed from the headers.

Extracting URL from email inbox

Ok there has been some confusion in what I am trying to do so I am doing this over again. I am looking to write a script to run against my inbox that will give me the From Address, Subject, and URL in the email body. The issue I am having is that the URL parsing of the script is pulling all URL's from the email and not just the one from the body. Here is an example
To: Tom#mail.com
From: Joe#test.com
Subject: Confirm you test score
Please go to the following URL to confirm your test score. WWW.test.com/confirmation
Thanks again for your input.
Signed
Joe
(Part of Joes signature has an image)
The URL for the image is
http://www.test.com/wp-content/uploads/_client_image/66-dcfc0fc8.png
I want my output to be
From: Joe#test.com
Subject: Confirm your test score
URL: WWW.test.com/confirmation
I get this instead
From: Joe#test.com
Subject: Confirem your test score
URL: WWW.test.com/confirmation, http://www.test.com/wp-content/uploads/_client_image/66-dcfc0fc8.png
And here is my script
import re
import mailbox
import urlparse
mbx=mailbox.mbox("Mail Box Path")
url_pattern = re.compile('''["']http://[^+]*?['"]''')
for k, m in mbx.iteritems():
print "From %s\n" % m['from']
print "Subject %s\n" % m['subject']
print "URL %s\n" % url_pattern.findall(m.as_string())

Signatures count as the body of the email - so you can't really separate them.
If you're sure there's only one link in the email that you care about, you could try just looking at only the first URL you match - but there isn't a (reliable) way to make sure that you're only interacting with the body of the email and not the signature as well.
Someone even wrote a paper on this - it's extremely difficult, especially when you can't control the format of the emails you're dealing with.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding links in an emails body with Python - python

Related

How do I send multipart HTML and PLAIN Formatted emails, through the GMAIL-API for python

How to send hyperlink with SendGrid using Python

How can I get the body of an email with Python and Google's gmail API?

What fields are available after parsing an email message?

Extracting URL from email inbox

Categories

Resources