How to scrape a link from a multipart email in python

How to scrape a link from a multipart email in python - python

I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.
I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?
My code currently looks like this:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))

Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
The function used:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)

Related

Unable to loop correctly

I am working on assignment to extract emails from the mailbox.
Below are my codes, I am referencing from this case and combine with some other research online:
import win32com.client
import pandas as pd
import os
outlook = win32com.client.Dispatch("Outlook.Aplication").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6).Folders["Testmails"]
condition = pd.read_excel(r"C:\Users\Asus\Desktop\Python\Condition.xlsx", sheet_name = 'endword')
emails = condition.iloc[:,1].tolist()
done = outlook.GetDefaultFolder(6).Folders["Testmails"].Folders["Done"]
Item = inbox.Items.GetFirst()
add = Item.SenderEmailAddress
for attachment in Item.Attachments:
if any([add.endswith(m) for m in condition]) and Item.Attachments.Count > 0:
print(attachment.FileName)
dir = "C:\\Users\\Asus\\Desktop\\Python\\Output\\"
fname = attachment.FileName
outpath = os.path.join(dir, fname)
attachment.SaveAsFile(outpath)
Item.Move(done)
The code above is running, but it only saves the first email attachment, and the other email that matches the condition is not saving.
The condition file is like below, if is gmail to save in file A. But I am not sure if we can do by vlookup in loops.
mail end Directory
0 gmail.com "C:\\Users\\Asus\\Desktop\\Output\\A\\"
1 outlook.com "C:\\Users\\Asus\\Desktop\\Output\\A\\"
2 microsoft.com "C:\\Users\\Asus\\Desktop\\Output\\B\\"
Thanks for all the gurus who is helping much. I have edited the codes above but now is facing other issues on looping.

Fix Application on Dispatch("Outlook.Aplication") should be double p
On filter add single quotation mark round 'emails'
Example
Filter = "[SenderEmailAddress] = 'emails'"
for loop, you are using i but then you have print(attachment.FileName) / attachment.SaveAsFile
use i for all - print(i.FileName) / i.SaveAsFile or attachment
import win32com.client
Outlook = win32com.client.Dispatch("Outlook.Application")
olNs = Outlook.GetNamespace("MAPI")
Inbox = olNs.GetDefaultFolder(6)
Filter = "[SenderEmailAddress] = '0m3r#email.com'"
Items = Inbox.Items.Restrict(Filter)
Item = Items.GetFirst()
if Item.Attachments.Count > 0:
for attachment in Item.Attachments:
print(Item.Attachments.Count)
print(attachment.FileName)
attachment.SaveAsFile(r"C:\path\to\my\folder\Attachment.xlsx")

The 'NoneType' object has no attribute 'Attachments' error means that you're trying to get attachments from something that is None.
You're getting attachments in only one place:
for i in Item.Attachments:
...
so we can conclude that the Item here is None.
By looking at Microsoft's documentation we can see that the method...
Returns Nothing if no first object exists, for example, if there are no objects in the collection
Therefore, I'd imagine there's an empty collection, or no emails matching your filter
To handle this you could use an if statement
if Item is not None:
for i in Item.Attachments:
...
else:
pass # Do something here if there's nothing matching your filter

How to make a time restriction in outlook using python?

I am making a program that:
opens outlook
find emails per subject
extract some date from emails (code and number)
fills these data in excel file in.
Standard email looks like this:
Subject: Test1
Hi,
You got a new answer from user Alex.
Code: alex123fj
Number1: 0611111111
Number2: 1020
Number3: 3032
I encounter 2 main problems in the process.
Firstly, I do not get how to make time restriction for emails in outlook. For example, if I want to read emails only from yesterday.
Secondly, all codes and numbers from email I save in lists. But every item gets this ["alex123fj/r"] in place from this ["alex123fj"]
I would appreciate any help or advice, that is my first ever program in Python.
Here is my code:
import win32com.client
import re
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.Folders('myemail#....').Folders('Inbox')
messages = inbox.Items
def get_code(messages):
codes_lijst = []
for message in messages:
subject = message.subject
if subject == "Test1":
body = message.body
matches = re.finditer("Code:\s(.*)$", body, re.MULTILINE)
for match in matches:
codes_lijst.append(match.group(1))
return codes_lijst
def get_number(messages):
numbers_lijst = []
for message in messages:
subject = message.subject
if subject == "Test1":
body = message.body
matches = re.finditer("Number:\s(.*)$", body, re.MULTILINE)
for match in matches:
numbers_lijst.append(match.group(1))
return numbers_lijst
code = get_code(messages)
number = get_number(messages)
print(code)
print(number)

Firstly, never loop through all items in a folder. Use Items.Find/FindNext or Items.Restrict with a restriction on ConversationTopic (e.g. [ConversationTopic] = 'Test1').
To create a date/time restriction, add a range restriction ([ReceivedTime] > 'some value') and [ReceivedTime] < 'other value'

monitoring a text site (json) using python

IM working on a program to grab variant ID from this website
https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json
Im using the code
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
I dont know what to do here in order to grab the Name of the items. If you visit the link you will see that the name is under 'title'. If you could help me with this that would be awesome.
I get the error message "TypeError: string indices must be integers" so im not too sure what to do.

Your biggest problem right now is that you are adding items to the list before you're checking if they're in it, so everything is coming back as in the list.
Looking at your code right now, I think what you want to do is combine things into a single for loop.
Also as a heads up you shouldn't use a variable name like list as it is shadowing the built-in Python function list().
list = [] # You really should change this to something else
def check_endpoint():
endpoint = ""
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['threads']: # For each id in threads list
PID = id['product']['globalPid'] # Get current PID
if PID in list:
print('checking for new products')
else:
title = (id['product']['title'])
Image = (id['product']['imageUrl'])
ReleaseType = (id['product']['selectionEngine'])
Time = (id['product']['effectiveInStockStartSellDate'])
send(title, PID, Image, ReleaseType, Time)
print ('added to database'.format(PID))
list.append(PID) # Add PID to the list
return
def main():
while(True):
check_endpoint()
time.sleep(20)
return
if __name__ == "__main__":
main()

script to serve from url, for requests matching regular expression

I am a complete n00b in Python and am trying to figure out a stub for mitmproxy.
I have tried the documentation but they assume we know Python so i am at a stalemate.
I've been working with a script:
original_url = 'http://production.domain.com/1/2/3'
new_content_path = '/home/andrepadez/proj/main.js'
body = open(new_content_path, 'r').read()
def response(context, flow):
url = flow.request.get_url()
if url == original_url:
flow.response.content = body
As you can predict, the proxy takes every request to 'http://production.domain.com/1/2/3' and serves the content of my file.
I need this to be more dynamic:
for every request to 'http://production.domain.com/*', i need to serve a correspondent URL, for example:
http://production.domain.com/1/4/3 -> http://develop.domain.com/1/4/3
I know i have to use a regular expression, so i can capture and map it correctly, but i don't know how to serve the contents of the develop url as "flow.response.content".
Any help will be welcome

You would have to do something like this:
import re
# In order not to re-read the original file every time, we maintain
# a cache of already-read bodies.
bodies = { }
def response(context, flow):
# Intercept all URLs
url = flow.request.get_url()
# Check if this URL is one of "ours" (check out Python regexps)
m = re.search('REGEXP_FOR_ORIGINAL_URL/(\d+)/(\d+)/(\d+)', url)
if None != m:
# It is, and m will contain this information
# The three numbers are in m.group(1), (2), (3)
key = "%d.%d.%d" % ( m.group(1), m.group(2), m.group(3) )
try:
body = bodies[key]
except KeyError:
# We do not yet have this body
body = // whatever is necessary to retrieve this body
= open("%s.txt" % ( key ), 'r').read()
bodies[key] = body
flow.response.content = body

Is there a good regular expression for multiline matching of received SIP invites?

I really need python regexp which would give me this information:
Data:
Received from 1.1.1.1 18:41:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com To:
sdfasdfasdfas From: "test"
Via:
sdafsdfasdfasd
Sent from 1.1.1.1 18:42:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
From: "test"
To:
sdfasdfasdfas Via:
sdafsdfasdfasd
Received from 1.1.1.1 18:50:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
Via: sdafsdfasdfasd
From: "test"
To:
sdfasdfasdfas
What I need to achieve, is to find the newest INVITE that was "Received" in order to get From: header value. So searching the data backwards.
Is it possible with unique regexp ? :)
Thanks.

One-line answer, assuming you suck the entire header into a string with embedded newlines (or cr/nl's):
sorted(re.findall("Received [^\r\n]+ (\d{2}:\d{2}:\d{2}:\d{3})[^\"]+From: \"([^\r\n]+)\"", data))[-1][1]
The trick to doing it with one RE is using [^\r\n] instead of . when you want to scan over stuff. This works assuming from string always has the double quotes. The double quotes are used to keep the scanner from swallowing the entire string at the first Received... ;)

I do not think a single regular expression is the answer. I think a stateful line-by-line matcher is what you're looking for here.
import re
import collections
_msg_start_re = re.compile('^(Received|Sent)\s+from\s+(\S.*):\s*$')
_msg_field_re = re.compile('^([A-Za-z](?:(?:\w|-)+)):\s+(\S(?:.*\S)?)\s*$')
def message_parser():
hdr = None
fields = collections.defaultdict(list)
msg = None
while True:
if msg is not None:
line = (yield msg)
msg = None
hdr = None
fields = collections.defaultdict(list)
else:
line = (yield None)
if hdr is None:
hdr_match = _msg_start_re.match(line)
hdr = None if hdr_match is None else hdr_match.groups()
elif len(fields) <= 0:
field_match = _msg_field_re.match(line)
if field_match is not None:
fields[field_match.group(1)].append(field_match.group(2))
else: # Waiting for the end of the message
if line.strip() == '':
msg = (hdr, dict(fields))
else:
field_match = _msg_field_re.match(line)
fields[field_match.group(1)].append(field_match.group(2))
Example of use:
parser = msg_parser()
parser.next()
recvd_invites = [msg for msg in (parser.send(line) for line in linelst) \
if (msg is not None) and \
(msg[0][0] == 'Received') and \
('INVITE' in msg[1])]
You might be able to do this with a multiple line regex, but if you do it this way you get the message nicely parsed into its various fields. Presumably you want to do something interesting with the messages, and this will let you do a whole bunch more with them without having to use more regexps.
This also allows you to parse something other than an already existing file or a giant string with all the messages in it. For example, if you want to parse the output of a pipe that's printing out these requests as they happen you can simply do msg = parser.send(line) every time you receive a line and get a new message out as soon as its all been printed (if the line isn't the end of a message then msg will be None).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a link from a multipart email in python - python

Related

Unable to loop correctly

How to make a time restriction in outlook using python?

monitoring a text site (json) using python

script to serve from url, for requests matching regular expression

Is there a good regular expression for multiline matching of received SIP invites?

Categories

Resources