How to make rp and xrange dynamic? - python

Hey guys many thanks for taking the time to look at my problem, I have been working on this code for about 1 week (I am new to coding and to python 1 week also) Currently the loop only works if x in xrange(x) and 'rp' : 'x' is the correct number of rows available from this xml. The xml updates throughout the day, I was wondering if anyone can offer a solution to make x dynamic?
import mechanize
import urllib
import json
import re
from sched import scheduler
from time import time, sleep
from sched import scheduler
from time import time, sleep
s = scheduler(time, sleep)
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval
s.run()
def getData():
post_url = "urlofinterest_xml"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '8',
'sortname' : 'roi',
'sortorder' : 'desc'
}
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern1 = re.compile('> (.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
pattern4 = re.compile('title=\'Naps posted: (.*) Winners:')
pattern5 = re.compile('Winners: (.*)\'><img src=')
for i in xrange(8):
user_delimiter = xmlload1['rows'][i]['cell']['username']
selection_delimiter = xmlload1['rows'][i]['cell']['race_horse']
username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
userid_delimiter_results = int(re.findall(pattern2, user_delimiter)[0])
user_selection = re.findall(pattern3, selection_delimiter)[0]
user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
user_numberofwinners = float(re.findall(pattern5, user_delimiter)[0])
strikeratecalc1 = user_numberofwinners/user_numberofselections
strikeratecalc2 = strikeratecalc1*100
print "user id = ",userid_delimiter_results
print "username = ",username_delimiter_results
print "user selection = ",user_selection
print "best price available as decimal = ",xmlload1['rows'][i]['cell'] ['tws.best_price']
print "race time = ",xmlload1['rows'][i]['cell']['race_time']
print "race meeting = ",xmlload1['rows'][i]['cell']['race_meeting']
print "ROI = ",xmlload1['rows'][i]['cell']['roi']
print "number of selections = ",user_numberofselections
print "number of winners = ",user_numberofwinners
print "Strike rate = ",strikeratecalc2,"%"
print ""
getData()
run_periodically(time()+5, time()+1000000, 15, getData)
Kind regards AEA

First, I'm going to talk about how you iterate over your results. Based on your code, xmlload1['rows'] is an array of dicts, so instead of choosing an arbitrary number, you can iterate over it directly instead. To make this a better example, I'm going to set up some arbitrary data to make this clear:
xmlload1 = {
"rows": [{"cell": {"username": "one", "race_horse":"b"}}, {"cell": {"username": "two", "race_horse": "c"}}]
}
So, given the data above, you can just iterate through rows in a for loop, like this:
for row in xmlload1['rows']:
cell = row["cell"]
print cell["username"]
print cell["race_horse"]
Each iteration, cell takes on the value of another element in the iterable (the list in xmlload1['rows']). This works with any container or sequence that supports iteration (like lists, tuples, dicts, generators, etc.)
Notice how I haven't used any magic numbers anywhere, so xmlload1['rows'] could be arbitrarily long and it would still work.
You can set the requests to be dynamic by using a function, like this:
def get_data(rp=8, page=1):
parameters = {'page' : str(page),
'rp' : str(rp),
'sortname' : 'roi',
'sortorder' : 'desc'
}
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
return json.loads(trans_array)
Now, you can call get_data(rp=5) to get 5 rows, or get_data(rp=8) to get 8 rows [and get_data(rp=8, page=3) to get the third page], etc. And you can clearly add additional variables or even pass in the parameters dict to the function directly.

I'm not sure I understand your question, but I think what you want is this:
rows = xmlload1['rows']
for row in rows:
user_delimiter = row['cell']['username']
selection_delimiter = row['cell']['race_horse']
# ...
If you need the row index as well as the row itself, use enumerate:
rows = xmlload1['rows']
for i, row in enumerate(rows):
user_delimiter = row['cell']['username']
selection_delimiter = row['cell']['race_horse']
# ...
In general, if you're doing for i in range(…) for any purpose other than a fixed number of iterations, you're doing it wrong. There's usually a collection you want to iterate over; just find it and iterate over it.

Related

Writing to databases on Python

I have been trying to write to a database and am having trouble setting data using two different classes in the one function.
Firstly, all values are being passed through a GUI and in this case, only the following entries were passed: 'Categories' = C, 'Usage' = Charter, 'DispTHR' = 5000.
Here you can see that I have two classes I am wanting to access (airport and runway) where the function set_opsdatabase_details() will go to the appropriate class and write to our database. This is all well and good when the airport and runway are separate from each other; however, when integrating them in the same function I can't seem to get the corresponding airportvalueDict = {} and runwayvalueDict = {] to display the values I want. Could someone help me understand how to write the correct entry box values into the corresponding valueDict dictionaries?
Thank you in advance! (a screenshot of the output from the print statements is attached)
Function in python
Output of function with the print statements
Example of code in text format:
#edit function for first part of operational window
def edit_details1(self):
airport = self.manager.get_chosen_airport()
runway = airport.get_chosen_runway()
airportlabels = ['Category', 'Usage', 'Curfew',
'Surveyor', 'LatestSurvey',
'FuelAvail', 'FuelPrice', 'TankerPort', 'RFF']
runwaylabels = ['DispTHR', 'VASI', 'TCH']
airportvalueDict = {}
runwayvalueDict = {}
print(self.entryDict['Category']["state"])
if self.entryDict['Category']["state"] == 'disabled':
for entry in self.entryDict.values():
entry.config(state=tk.NORMAL)
self.editButton.config(text="Confirm")
elif self.entryDict['Category']['state'] == 'normal':
airport = self.manager.get_chosen_airport()
runway = airport.get_chosen_runway()
values = [x.get() for x in self.varDict.values()]
for label, value in zip(airportlabels, values):
airportvalueDict[label] = value
airport.set_opsdatabase_details(airportvalueDict)
print(f'airport: {airportvalueDict}')
for label, value in zip(runwaylabels, values):
runwayvalueDict[label] = value
runway.set_opsdatabase_details(runwayvalueDict)
print(f'runway: {runwayvalueDict}')
for entry in self.entryDict.values():
entry.config(state=tk.DISABLED)
self.editButton.config(text="Edit...")

How to optimize retrieval of 10 most frequent words inside a json data object?

I'm looking for ways to make the code more efficient (runtime and memory complexity)
Should I use something like a Max-Heap?
Is the bad performance due to the string concatenation or sorting the dictionary not in-place or something else?
Edit: I replaced the dictionary/map object to applying a Counter method on a list of all retrieved names (with duplicates)
minimal request: script should take less then 30 seconds
current runtime: it takes 54 seconds
# Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests
# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead
import json
# dict subclass for counting hashable objects
from collections import Counter
#import heapq
import datetime
url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []
#####################################################################################
# Calls the site http://www.namefake.com 100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden etc....
#####################################################################################
requests.packages.urllib3.disable_warnings()
t = datetime.datetime.now()
for x in range(100):
# for each name, break it to first and last name
# no need for authentication
# http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
responseObj = requests.get(url, verify=False)
# Decoding JSON data from returned response object text
# Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
# containing a JSON document) to a Python object.
jsonData = json.loads(responseObj.text)
x = jsonData['name']
newName = ""
for full_name in x:
# make a string from the decoded python object concatenation
newName += str(full_name)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
words.append(y[2])
else:
words.append(y[0])
words.append(y[1])
# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)
# list of tuples
top_ten =Counter(words).most_common(10)
top_names_list = [name[0] for name in top_ten ]
print((datetime.datetime.now()-t).total_seconds())
print(top_names_list)
You are calling an endpoint of an API that generates dummy information one person at a time - that takes considerable amount of time.
The rest of the code is taking almost no time.
Change the endpoint you are using (there is no bulk-name-gathering on the one you use) or use built-in dummy data provided by python modules.
You can clearly see that "counting and processing names" is not the bottleneck here:
from faker import Faker # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()
# get 10.000 names, split them and add 1st part
t = datetime.datetime.now()
c.update( (fake.name().split()[0] for _ in range(10000)) )
print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())
Output for 10000 names:
[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134),
('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111),
('Matthew', 111), ('Lisa', 101)]
in
1.886564 # seconds
General advise for code optimization: measure first then optimize the bottlenecks.
If you need a codereview you can check https://codereview.stackexchange.com/help/on-topic and see if your code fits with the requirements for the codereview stackexchange site. As with SO some effort should be put into the question first - i.e. analyzing where the majority of your time is being spent.
Edit - with performance measurements:
import requests
import json
from collections import defaultdict
import datetime
# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time
d = defaultdict(int)
def insertToDict(n):
d[n] += 1
url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
# for each name, break it to first and last name
try:
t = datetime.datetime.now() # start time for API call
# no need for authentication
responseObj = requests.get(url, verify=False)
jsonData = json.loads(responseObj.text)
# end time for API call
api_times.append( (datetime.datetime.now()-t).total_seconds() )
x = jsonData['name']
t = datetime.datetime.now() # start time for name processing
newName = ""
for name_char in x:
# make a string from the decoded python object concatenation
newName = newName + str(name_char)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
insertToDict(y[2])
else:
insertToDict(y[0])
insertToDict(y[1])
# end time for name processing
process_times.append( (datetime.datetime.now()-t).total_seconds() )
except:
continue
newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times ))
Output:
['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206
You can make the parsing part better .. I did not, because it does not matter.
It is better to use timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx #bruno desthuilliers ) - in this case I did not use timeit because I do not want to call API 100000 times to average results

Make biopython Entrez.esearch loop through parameters

I'm trying to adapt a script (found here: https://gist.github.com/bonzanini/5a4c39e4c02502a8451d) to search and retrieve data from PubMed.
Here's what I have so far:
#!/usr/bin/env python
from Bio import Entrez
import datetime
import json
# Create dictionary of journals (the official abbreviations are not used here...)
GroupA=["Nature", "Science", "PNAS","JACS"]
GroupB=["E-life", "Mol Cell","Plos Computational","Nature communication","Cell"]
GroupC=["Nature Biotech", "Nature Chem Bio", "Nature Str Bio", "Nature Methods"]
Journals = {}
for item in GroupA:
Journals.setdefault("A",[]).append(item)
for item in GroupB:
Journals.setdefault("B",[]).append(item)
for item in GroupC:
Journals.setdefault("C",[]).append(item)
# Set dates for search
today = datetime.datetime.today()
numdays = 15
dateList = []
for x in range (0, numdays):
dateList.append(today - datetime.timedelta(days = x))
dateList[1:numdays-1] = []
today = dateList[0].strftime("%Y/%m/%d")
lastdate = dateList[1].strftime("%Y/%m/%d")
print 'Retreiving data from ' '%s to %s' % (lastdate,today)
for value in Journals['A']:
Entrez.email = "email"
handle = Entrez.esearch(db="pubmed",term="gpcr[TI] AND value[TA]",
sort="pubdate",retmax="10",retmode="xml",datetype="pdat",mindate=lastdate,maxdate=today)
record = Entrez.read(handle)
print(record["IdList"])
I would like to use each "value" of the for loop (in this case journal titles) as a parameter for Entrez.search function. There is no built in parameter for this, so it would have to be inside the term parameter, but it doesn't work as shown.
Once I have an ID list, I will then use Entrez.fetch to retrieve and print the data that I want, but that's another question...
I hope this is clear enough, first question for me! Thanks!
If I understand you correctly, I think this is what you are looking for:
term="gpcr[TI] AND {}[TA]".format(value)
using this, the each term will be:
"gpcr[TI] AND Nature[TA]"
"gpcr[TI] AND Science[TA]"
"gpcr[TI] AND PNAS[TA]"
"gpcr[TI] AND JACS[TA]"

Optimizing Algorithm of large dataset calculations

Once again I find myself stumped with pandas, and how to best perform a 'vector operation'. My code works, however it will take a long time to iterate through everything.
What the code is trying to do is loop through shapes.cv and determine which shape_pt_sequence is a stop_id, and then assigns the stop_lat and stop_lon to shape_pt_lat and shape_pt_lon, while also marking the shape_pt_sequence as is_stop.
GISTS
stop_times.csv LINK
trips.csv LINK
shapes.csv LINK
Here is my code:
import pandas as pd
from haversine import *
'''
iterate through shapes and match stops along a shape_pt_sequence within
x amount of distance. for shape_pt_sequence that is closest, replace the stop
lat/lon to the shape_pt_lat/shape_pt_lon, and mark is_stop column with 1.
'''
# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_index = list(set(shapes['shape_id']))
shapes_index.sort(key=int)
shapes.set_index(['shape_id', 'shape_pt_sequence'], inplace=True)
# readability assignments for trips.csv
trips = pd.read_csv('csv/trips.csv')
trips_index = list(set(trips['trip_id']))
trips.set_index(['trip_id'], inplace=True)
# readability assignments for stops_times.csv
stop_times = pd.read_csv('csv/stop_times.csv')
stop_times.set_index(['trip_id','stop_sequence'], inplace=True)
print(len(stop_times.loc[1423492]))
# readability assginments for stops.csv
stops = pd.read_csv('csv/stops.csv')
stops.set_index(['stop_id'], inplace=True)
# for each trip_id
for i in trips_index:
print('******NEW TRIP_ID******')
print(i)
i = i.astype(int)
# for each stop_sequence in stop_times
for x in range(len(stop_times.loc[i])):
stop_lat = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][0]
stop_lon = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][1]
stop_coordinate = (stop_lat, stop_lon)
print(stop_coordinate)
# shape_id that matches trip_id
print('**SHAPE_ID**')
trips_shape_id = trips.loc[i,['shape_id']].iloc[0]
trips_shape_id = int(trips_shape_id)
print(trips_shape_id)
smallest = 0
for y in range(len(shapes.loc[trips_shape_id])):
shape_lat = shapes.loc[trips_shape_id].iloc[y,[0,1]][0]
shape_lon = shapes.loc[trips_shape_id].iloc[y,[0,1]][1]
shape_coordinate = (shape_lat, shape_lon)
haversined = haversine_mi(stop_coordinate, shape_coordinate)
if smallest == 0 or haversined < smallest:
smallest = haversined
smallest_shape_pt_indexer = y
else:
pass
print(haversined)
print('{0:.20f}'.format(smallest))
print('{0:.20f}'.format(smallest))
print(smallest_shape_pt_indexer)
# mark is_stop as 1
shapes.iloc[smallest_shape_pt_indexer,[2]] = 1
# replace coordinate value
shapes.loc[trips_shape_id].iloc[y,[0,1]][0] = stop_lat
shapes.loc[trips_shape_id].iloc[y,[0,1]][1] = stop_lon
shapes.to_csv('csv/shapes.csv', index=False)
What you could do to optmizing this code is use some threads/workers instead those for.
I recommend using the Pool of Workes as its very simple to use.
In:
for i in trips_index:
You could use something like:
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.apply_async(func, trips_index)
And than the method func would be like:
def func(i):
#code here
And you could simply put the whole for loop inside this method.
It would make it work with 4 subprocess in this example, git it a nice improvment.
One thing to consider is that a collection of trips will often have the same sequence of stops and the same shape data (the only difference between trips is the timing). So it might make sense to cache the find-closest-point-on-shape operation for (stop_id, shape_id). I bet that would reduce your runtime by an order-of-magnitude.

How do I deal with list of lists correctly in python? Specifically with a list of functions

I am trying to grep some results pages for work, and then eventually print them out to an html website so someone does not have to manually look through each section.
How I would eventually use: I feed this function a result page, it greps through the 5 different sections, then I can do a html output (thats what that print substitute area is for) with all the different results.
OK MASSIVE EDIT I actually removed the old code because I was asking too many questions. I fixed my code taking some suggestions, but I am still interested in the advantage of using human-readable dict instead of just list. Here is my working code that gets all the right results into a 'list of lists', I then outputted the first section in my eventual html block
import urllib
import re
import string
import sys
def ipv6_results(input_page):
sections = ['/spec.p2/summary.html', '/nd.p2/summary.html',
'/addr.p2/summary.html', '/pmtu.p2/summary.html',
'/icmp.p2/summary.html']
variables_output=[]
for s in sections:
temp_list = []
page = input_page + s
#print page
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)) )
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)))
variables_output.append(temp_list)
#print variables to check them :)
print "------"
print variables_output
print "Ready Logo Phase 2"
print "Section | Total | Pass | Fail |"
#this next part is eventually going to output an html block
output = string.Template("""
1 - RFC2460-IPv6 Specs $spec_total $spec_pass $spec_fail
""")
print output.substitute(spec_total=variables_output[0][0], spec_pass=variables_output[0][1],
spec_fail=variables_output[0][2])
return 1
imagine the tabbing is correct :( I wish this was more like paste bin, suggestions welcome on pasting code in here
Generally, you don't declare the shape of the list first, and then fill in the values. Instead, you build the list as you discover the values.
Your variables has a lot of structure. You've got inner lists of 3 elements, always in the order of 'total', 'pass', 'fail'. Perhaps these 3-tuples should be made namedtuples. That way, you can access the three parts with humanly-recogizable names (data.total, data.pass, data.fail), instead of cryptic index numbers (data[0], data[1], data[2]).
Next, your 3-tuples differ by prefixes: 'spec', 'nd', 'addr', etc.
These sound like keys to a dict rather than elements of a list.
So perhaps consider making variables a dict. That way, you can access the particular 3-tuple you want with the humanly-recognizable variables['nd'] instead of variables[1]. And you can access the nd_fail value with variables['nd'].fail instead of variables[1][2]:
import collections
# define the namedtuple class Point (used below).
Point = collections.namedtuple('Point', 'total pass fail')
# Notice we declare `variables` empty at first; we'll fill in the values later.
variables={}
keys=('spec','nd','addr','pmtu','icmp')
for s in sections:
for key in keys:
page = input_page + s
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
ntotal = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
npass = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
nfail = int(m.group(1))
# We create an instance of the namedtuple on the right-hand side
# and store the value in `variables[key]`, thus building the
# variables dict incrementally.
variables[key]=Point(ntotal,npass,nfail)
The first thing is that those lists there will only be the values of the variables, at the time of assignment. You would be changing the list value, but not the variables.
I would seriously consider using classes and build structures of those, including lists of class instances.
For example:
class SectionResult:
def __init__(self, total = 0, pass = 0, fail = 0):
self.total = total
self.pass = pass
self.fail = fail
Since it looks like each group should link up with a section, you can create a list of dictionaries (or perhaps a list of classes?) with the bits associated with a section:
sections = [{'results' : SectionResult(), 'filename': '/addr.p2/summary.html'}, ....]
Then in the loop:
for section in sections:
page = input_page + section['filename']
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].total = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].pass = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].fail = int(m.group(1))
I would use a dictionary inside a list. Maybe something like:
def ipv6_results(input_page):
results = [{file_name:'/spec.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/nd.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/addr.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/pmtu.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/icmp.p2/summary.html', total:0, pass:0, fail:0}]
for r in results:
url_reference = urllib.urlopen(input_page + r[file_name])
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[total] = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[pass] = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[fail] = int(m.group(1))
for r in results:
print r[total]
print r[pass]
print r[fail]

Categories

Resources