script to serve from url, for requests matching regular expression

script to serve from url, for requests matching regular expression - python

I am a complete n00b in Python and am trying to figure out a stub for mitmproxy.
I have tried the documentation but they assume we know Python so i am at a stalemate.
I've been working with a script:
original_url = 'http://production.domain.com/1/2/3'
new_content_path = '/home/andrepadez/proj/main.js'
body = open(new_content_path, 'r').read()
def response(context, flow):
url = flow.request.get_url()
if url == original_url:
flow.response.content = body
As you can predict, the proxy takes every request to 'http://production.domain.com/1/2/3' and serves the content of my file.
I need this to be more dynamic:
for every request to 'http://production.domain.com/*', i need to serve a correspondent URL, for example:
http://production.domain.com/1/4/3 -> http://develop.domain.com/1/4/3
I know i have to use a regular expression, so i can capture and map it correctly, but i don't know how to serve the contents of the develop url as "flow.response.content".
Any help will be welcome

You would have to do something like this:
import re
# In order not to re-read the original file every time, we maintain
# a cache of already-read bodies.
bodies = { }
def response(context, flow):
# Intercept all URLs
url = flow.request.get_url()
# Check if this URL is one of "ours" (check out Python regexps)
m = re.search('REGEXP_FOR_ORIGINAL_URL/(\d+)/(\d+)/(\d+)', url)
if None != m:
# It is, and m will contain this information
# The three numbers are in m.group(1), (2), (3)
key = "%d.%d.%d" % ( m.group(1), m.group(2), m.group(3) )
try:
body = bodies[key]
except KeyError:
# We do not yet have this body
body = // whatever is necessary to retrieve this body
= open("%s.txt" % ( key ), 'r').read()
bodies[key] = body
flow.response.content = body

Related

Using a variable from a dictionary in a loop to attach to an API call

I'm calling a LinkedIn API with the code below and it does what I want.
However when I use almost identical code inside a loop it returns a type error.
it returns a type error:
File "C:\Users\pchmurzynski\OneDrive - Centiq Ltd\Documents\Python\mergedreqs.py", line 54, in <module>
auth_headers = headers(access_token)
TypeError: 'dict' object is not callable
It has a problem with this line (which again, works fine outside of the loop):
headers = headers(access_token)
I tried changing it to
headers = headers.get(access_token)
or
headers = headers[access_token]
EDIT:
I have also tried this, with the same error:
auth_headers = headers(access_token)
But it didn't help. What am I doing wrong? Why does the dictionary work fine outside of the loop, but not inside of it and what should I do to make it work?
What I am hoping to achieve is to get a list, which I can save as json with share statistics called for each ID from the "shids" list. That can be done with individual requests - one link for one ID,
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid})
or a a request with a list of ids.
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid},urn%3Ali%3AugcPost%3A{shid2},...,urn%3Ali%3AugcPost%3A{shidx})
Updated Code thanks to your comments.
shlink = ("https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{})")
#loop through the list of share ids and make an api request for each of them
shares = []
token = auth(credentials) # Authenticate the API
headers = fheaders(token) # Make the headers to attach to the API call.
for shid in shids:
#create a request link for each sh id
r = (shlink.format(shid))
#call the api
res = requests.get(r, headers = auth_headers)
share_stats = res.json()
#append the shares list with the responce
shares.append(share_stats["elements"])

works fine outside the loop
Because in the loop, you re-define the variable. Added print statments to show it
from liapiauth import auth, headers # one type
for ...:
...
print(type(headers))
headers = headers(access_token) # now set to another type
print(type(headers))
Lesson learned - don't overrwrite your imports
Some refactors - your auth token isn't changing, so don't put it in the loop; You can use one method for all LinkedIn API queries
from liapiauth import auth, headers
import requests
API_PREFIX = 'https://api.linkedin.com/v2'
SHARES_ENDPOINT_FMT = '/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{}'
def get_linkedin_response(endpoint, headers):
return requests.get(API_PREFIX + endpoint, headers=headers)
def main(access_token=None):
if access_token is None:
raise ValueError('Access-Token not defined')
auth_headers = headers(access_token)
shares = []
for shid in shids:
endpoint = SHARES_ENDPOINT_FMT.format(shid)
resp = get_linkedin_response(endpoint, auth_headers)
if resp.status_code // 100 == 2:
share_stats = resp.json()
shares.append(share_stats[1])
# TODO: extract your data here
idlist = [el["id"] for el in shares_list["elements"]]
if __name__ == '__main__':
credentials = 'credentials.json'
main(auth(credentials))

Flask loop takes long time to complete

I have this loop in my app.py. For some reason it extends the load time by over 3 seconds. Are there any solutions?
import dateutil.parser as dp
# Converts date from ISO-8601 string to formatted string and returns it
def dateConvert(date):
return dp.parse(date).strftime("%H:%M # %e/%b/%y")
def nameFromID(userID):
if userID is None:
return 'Unknown'
else:
response = requests.get("https://example2.org/" + str(userID), headers=headers)
return response.json()['firstName'] + ' ' + response.json()['lastName']
logs = []
response = requests.get("https://example.org", headers=headers)
for response in response.json():
logs.append([nameFromID(response['member']), dateConvert(response['createdAt'])])

It extends the load time by over 3 seconds because it does a lot of unnecessary work, that's why.
You're not using requests Sessions. Each request will require creating and tearing down an HTTPS connection. That's slow.
You're doing another HTTPS request for each name conversion. (See above.)
You're parsing the JSON you get in that function twice.
Whatever dp.parse() is (dateutil?), it's probably doing a lot of extra work parsing from a free-form string. If you know the input format, use strptime.
Here's a rework that should be significantly faster. Please see the TODO points first, of course.
Also, if you are at liberty to knowing the member id -> name mapping doesn't change, you can make name_cache a suitably named global variable too (but remember it may be persisted between requests).
import datetime
import requests
INPUT_DATE_FORMAT = "TODO_FILL_ME_IN" # TODO: FILL ME IN.
def dateConvert(date: str):
return datetime.datetime.strptime(date, INPUT_DATE_FORMAT).strftime(
"%H:%M # %e/%b/%y"
)
def nameFromID(sess: requests.Session, userID):
if userID is None:
return "Unknown"
response = sess.get(f"https://example2.org/{userID}")
response.raise_for_status()
data = response.json()
return "{firstName} {lastName}".format_map(data)
def do_thing():
headers = {} # TODO: fill me in
name_cache = {}
with requests.Session() as sess:
sess.headers.update(headers)
logs = []
response = sess.get("https://example.org")
for response in response.json():
member_id = response["member"]
name = name_cache.get(member_id)
if not name:
name = name_cache[member_id] = nameFromID(sess, member_id)
logs.append([name, dateConvert(response["createdAt"])])

How to scrape a link from a multipart email in python

I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.
I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?
My code currently looks like this:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))

Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
The function used:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)

Python TPCServer rfile.read blocks

I am writing a simple SocketServer.TCPServer request handler (StreamRequestHandler) that will capture the request, along with the headers and the message body. This is for faking out an HTTP server that we can use for testing.
I have no trouble grabbing the request line or the headers.
If I try to grab more from the rfile than exists, the code blocks. How can I grab all of the request body without knowing its size? In other words, I don't have a Content-Size header.
Here's a snippet of what I have now:
def _read_request_line(self):
server.request_line = self.rfile.readline().rstrip('\r\n')
def _read_headers(self):
headers = []
for line in self.rfile:
line = line.rstrip('\r\n')
if not line:
break
parts = line.split(':', 1)
header = (parts[0].strip(), parts[0].strip())
headers.append(header)
server.request_headers = headers
def _read_content(self):
server.request_content = self.rfile.read() # blocks

Keith's comment is correct. Here's what it looks like
length = int(self.headers.getheader('content-length'))
data = self.rfile.read(length)

Fetching language detection from Google api

I have a CSV with keywords in one column and the number of impressions in a second column.
I'd like to provide the keywords in a url (while looping) and for the Google language api to return what type of language was the keyword in.
I have it working manually. If I enter (with the correct api key):
http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=merde
I get:
{"responseData": {"language":"fr","isReliable":false,"confidence":6.213709E-4}, "responseDetails": null, "responseStatus": 200}
which is correct, 'merde' is French.
so far I have this code but I keep getting server unreachable errors:
import time
import csv
from operator import itemgetter
import sys
import fileinput
import urllib2
import json
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
#not working
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
# Deserialize JSON to Python objects
result_object = json.loads(result)
#Get the rows in the table, then get the second column's value
# for each row
return row in result_object
#not working
def retrieve_terms(seedterm):
print(seedterm)
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=%(seed)s'
url = url_template % {"seed": seedterm}
try:
with urllib2.urlopen(url) as data:
data = perform_request(seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
#terms = parse_result(result)
#print terms
print result
def main(argv):
filename = argv[1]
csvfile = open(filename, 'r')
csvreader = csv.DictReader(csvfile)
rows = []
for row in csvreader:
rows.append(row)
sortedrows = sorted(rows, key=itemgetter('impressions'), reverse = True)
keys = sortedrows[0].keys()
for item in sortedrows:
retrieve_terms(item['keywords'])
try:
outputfile = open('Output_%s.csv' % (filename),'w')
except IOError:
print("The file is active in another program - close it first!")
sys.exit()
dict_writer = csv.DictWriter(outputfile, keys, lineterminator='\n')
dict_writer.writer.writerow(keys)
dict_writer.writerows(sortedrows)
outputfile.close()
print("File is Done!! Check your folder")
if __name__ == '__main__':
start_time = time.clock()
main(sys.argv)
print("\n")
print time.clock() - start_time, "seconds for script time"
Any idea how to finish the code so that it will work? Thank you!

Try to add referrer, userip as described in the docs:
An area to pay special attention to
relates to correctly identifying
yourself in your requests.
Applications MUST always include a
valid and accurate http referer header
in their requests. In addition, we
ask, but do not require, that each
request contains a valid API Key. By
providing a key, your application
provides us with a secondary
identification mechanism that is
useful should we need to contact you
in order to correct any problems. Read
more about the usefulness of having an
API key
Developers are also encouraged to make
use of the userip parameter (see
below) to supply the IP address of the
end-user on whose behalf you are
making the API request. Doing so will
help distinguish this legitimate
server-side traffic from traffic which
doesn't come from an end-user.
Here's an example based on the answer to the question "access to google with python":
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from pprint import pprint
api_key, userip = None, None
query = {'q' : 'матрёшка'}
referrer = "https://stackoverflow.com/q/4309599/4279"
if userip:
query.update(userip=userip)
if api_key:
query.update(key=api_key)
url = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s' %(
urllib.urlencode(query))
request = urllib2.Request(url, headers=dict(Referer=referrer))
json_data = json.load(urllib2.urlopen(request))
pprint(json_data['responseData'])
Output
{u'confidence': 0.070496580000000003, u'isReliable': False, u'language': u'ru'}
Another issue might be that seedterm is not properly quoted:
if isinstance(seedterm, unicode):
value = seedterm
else: # bytes
value = seedterm.decode(put_encoding_here)
url = 'http://...q=%s' % urllib.quote_plus(value.encode('utf-8'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

script to serve from url, for requests matching regular expression - python

Related

Using a variable from a dictionary in a loop to attach to an API call

Flask loop takes long time to complete

How to scrape a link from a multipart email in python

Python TPCServer rfile.read blocks

Fetching language detection from Google api

Categories

Resources