Project structures for making API calls in python - python

I am doing my first personal project and I don't know what is the best way to structure my project. It is to create a pennystock screener that screens for stocks that gap up overnight and then filter them based on market cap, premarket volume, price, public float etc.
Right now I've written separate functions that each does its own little thing. I have a function that connects to the REST endpoint to retrieve data, one that uses the data to return a list of stocks that gap up over 50%, and then ones that filter the list of gap ups and return a dictionary of filtered tickers (e.g. ticker name, its float, whether it satisfies the condition of the filter). Then I'd use the value in that dictionary to create another filter (like float rotations based on the values of float that I've stored in the float dictionary). Finally, I have a main function that get the final list of stock which satisfies all the conditions.
The problem is that I don't know how to better structure the project. What I have right now is very inefficient. Take the following snippet as example: for ticker name, i need to call the connect rest function, and then for top gainers, I'm calling ticker names. Then for the float filter, I'm calling the top gainers. For float rotation filter, I need to call the float filter function. Finally with the main function I need to call everything... I’m making api calls in every function. It’s making too many requests and I’d like to store the data after I make the requests.
I don’t know whether to use nested functions, or make global variables, or to write the functions in separate files and then import. Also, I am confused about whether I need to create classes.
def connect_REST(data):
return data
def get_ticker_names(ticker_names, US_listed):
data = []
data = connect_REST(data)
return ticker_names, US_listed
#filter by percent change since last close to get top gainers(>= 50% gap-up pre market)
def get_top_gainer(top_gainer_list):
return top_gainer_list
# Condition 2: Float > 2M and < 30M
def filter_SharesFloat(floatData, backup_list2):
top_gainer_list = []
top_gainer_list = get_top_gainer(top_gainer_list)
floatData = {}
backup_list2 = []
print(floatData)
print("Float data not available: ", backup_list2)
return floatData, backup_list2
def filter_float_rotation():
top_gainer_list = []
predicted_intra_volume = []
predicted_intra_volume = get_predicted_intra_volume()
top_gainer_list = get_top_gainer(top_gainer_list)
floatData = {}
floatData = filter_SharesFloat(floatData)
for ticker in top_gainer_list:
floatRotations = predictedVolume / floatData[ticker]['float']
if floatRotations < 1:
cond_5 = True
else:
cond_5 = False
return floatRotationData
def main():
#get ticker that satisfies condition 1, 2, 3, 4, 5

Related

Pythonic Way of Storing Multiple Set of Lists (same names - different params) to be Called by Functions

I've built a few generalized functions that loop over a set of parameters to help in a price optimisation exercise. Given that each product has a different set of costs inputs the amount of parameters for each can vary. The issue is that each product will have a different set of configs, so I need to store these somehow. To build my quick demo to test my code I just added suffix _1, _2, _3, etc... but now looking for a more structured way to build and maintain it.
import pandas as pd
#Configs params - Product 1
factors_1 = ['BaseAmount','Factor1','Factor2','Factor3','Factor4','Factor5']
operation_1 = [None,'x','x','x','x','+']
custom_functions_1 = [None,None,None,None,'custom_function_1(df, 0.15)',None] #To call custom function
rounding_1 = [None,False,True,True,False,True]
rounding_decimals_1 = [None,None,1,0,None,0]
operation_summary_1 = pd.DataFrame(list(zip(operation_1, custom_functions_1, rounding_1, rounding_decimals_1)),
index = factors_1,
columns =['operation', 'custom_functions', 'rounding','rounding_decimals'])
operation_summary_1
#Configs params - Product 2
factors_2 = ['BaseAmount','Factor1','Factor6','Factor5','Factor7']
operation_2 = [None,'x','x','+','+']
custom_functions_2 = [None,None,None,'custom_function_2(df, 0.15)',None] #To call custom function
rounding_2 = [None,False,True,True,True]
rounding_decimals_2 = [None,None,0,0,0]
operation_summary_2 = pd.DataFrame(list(zip(operation_2, custom_functions_2, rounding_2, rounding_decimals_2)),
index = factors_2,
columns =['operation', 'custom_functions', 'rounding','rounding_decimals'])
operation_summary_2
What I'm looking for is a recommendation on the best way to store these lists for 100s of products which I would want to load and then iterate on as lists. I was thinking classes could be one good way of storing these, but I don't have much experience with those.
Thinking of doing something like the following, but first not sure how to get it to work and more importantly not sure it's good coding practice.
class product_1:
def __init__(self):
self.factors = ['BaseAmount','Factor1','Factor2','Factor3','Factor4','Factor5']
product_1.self.factors
you need to create a data holder, the best you can get is a dataclass, https://docs.python.org/3/library/dataclasses.html
you create one class that describes your data container. Then you call to this class as many times as you need to create new instances, see examples in the link above

Get a dictionary from a class?

I want to:
Take a list of lists
Make a frequency table in a dictionary
Do things with the resulting dictionary
The class works, the code works, the frequency table is correct.
I want to get a class that returns a dictionary, but I actually get a class that returns a class type.
I can see that it has the right content in there, but I just can't get it out.
Can someone show me how to turn the output of the class to a dictionary type?
I am working with HN post data. Columns, a few thousand rows.
freq_pph = {}
freq_cph = {}
freq_uph = {}
# Creates a binned frequency table:
# - key is bin_minutes (size of bin in minutes).
# - value is freq_value which sums/counts the number of things in that column.
class BinFreq:
def __init__(self, dataset, bin_minutes, freq_value, dict_name):
self.dataset = dataset
self.bin_minutes = bin_minutes
self.freq_value = freq_value
self.dict_name = dict_name
def make_table(self):
# Sets bin size
# Counts how of posts in that timedelta
if (self.bin_minutes == 60) and (self.freq_value == "None"):
for post in self.dataset:
hour_dt = post[-1]
hour_str = hour_dt.strftime("%H")
if hour_str in self.dict_name:
self.dict_name[hour_str] += 1
else:
self.dict_name[hour_str] = 1
# Sets bins size
# Sums the values of a given index/column
if (self.bin_minutes == 60) and (self.freq_value != "None"):
for post in self.dataset:
hour_dt = post[-1]
hour_str = hour_dt.strftime("%H")
if hour_str in self.dict_name:
self.dict_name[hour_str] += int(row[self.freq_value])
else:
self.dict_name[hour_str] = int(row[self.freq_value])
Instantiate:
pph = BinFreq(ask_posts, 60, "None", freq_pph)
pph.make_table()
How can pph be turned into a real dictionary?
If you want the make_table function to return a dictionary, then you have to add a return statement at the end of it, for example: return self.dict_name.
If you then want to use it outside of the class, you have to assign it to a variable, so in the second snipped do: my_dict = pph.make_table().
Classes can't return things – functions in classes could. However, the function in your class doesn't; it just modifies self.dict_name (which is a misnomer; it's really just a reference to a dict, not a name (which one might imagine is a string)), which the caller then reads (or should, anyway).
In addition, there seems to be a bug; the second if block (which is never reached anyway) refers to row, an undefined name.
Anyway, your class doesn't need to be a class at all, and is easiest implemented with the built-in collections.Counter() class:
from collections import Counter
def bin_by_hour(dataset, value_key=None):
counter = Counter()
for post in dataset:
hour = post[-1].hour # assuming it's a `datetime` object
if value_key: # count using `post[value_key]`
counter[hour] += post[value_key]
else: # just count
counter[hour] += 1
return dict(counter.items()) # make the Counter a regular dict
freq_pph = bin_by_hour(ask_posts)
freq_cph = bin_by_hour(ask_posts, value_key="num_comments") # or whatever

Best search algorithm to find 'similar' strings in excel spreadsheet

I am trying to figure out the most efficient way of finding similar values of a specific cell in a specified column(not all columns) in an excel .xlsx document. The code I have currently assumes all of the strings are unsorted. However the file I am using and the files I will be using all have strings sorted from A-Z. So instead of doing a linear search I wonder what other search algorithm I could use as well as being able to fix my coding eg(binary search etc).
So far I have created a function: find(). Before the function runs the program takes in a value from the user's input that then gets set as the sheet name. I print out all available sheet names in the excel doc just to help the user. I created an empty array results[] to store well....the results. I created a for loop that iterates through only column A because I only want to iterate through a custom column. I created a variable called start that is the first coordinate in column A eg(A1 or A400) this will change depending on the iteration the loop is on. I created a variable called next that will get compared with the start. Next is technically just start + 1, however since I cant add +1 to a string I concatenate and type cast everything so that the iteration becomes a range from A1-100 or however many cells are in column A. My function getVal() gets called with two parameters, the coordinate of the cell and the worksheet we are working from. The value that is returned from getVal() is also passed inside my function Similar() which is just a function that calls SequenceMatcher() from difflib. Similar just returns the percentage of how similar two strings are. Eg. similar(hello, helloo) returns int 90 or something like that. Once the similar function is called if the strings are above 40 percent similar appends the coordinates into the results[] array.
def setSheet(ws):
sheet = wb[ws]
return sheet
def getVal(coordinate, worksheet):
value = worksheet[coordinate].value
return value
def similar(first, second):
percent = SequenceMatcher(None, first, second).ratio() * 100
return percent
def find():
column = "A"
print("\n")
print("These are all available sheets: ", wb.sheetnames)
print("\n")
name = input("What sheet are we working out of> ")
results = []
ws = setSheet(name)
for i in range(1, ws.max_row):
temp = str(column + str(i))
x = ws[temp]
start = ws[x].coordinate
y = str(column + str(i + 1))
next = ws[y].coordinate
if(similar(getVal(start,ws), getVal(next,ws)) > 40):
results.append(getVal(start))
return results
This is some nasty looking code so I do apologize in advance. The expected results should just be a list of strings that are "similar".

Python recursive function variable scope [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
“Least Astonishment” in Python: The Mutable Default Argument
I'm using the MailSnake in Python, which is a wrapper for the MailChimp API.
Now I'm getting some curious behaviour for a function I've written to pull lists of subscribers we have. This is the code I'm using:
from mailsnake import MailSnake
from mailsnake.exceptions import *
ms = MailSnake('key here')
def return_members (status, list_id, members = [], start = 0, limit = 15000, done = 0):
temp_list = ms.listMembers(status=status, id=list_id, start=page, limit=limit, since='2000-01-01 01:01:01')
for item in temp_list['data']: # Add latest pulled data to our list
members.append(item)
done = limit + done
if done < temp_list['total']: # Continue if we have yet to
start = start + 1
if limit > (temp_list['total'] - done): # Restrict how many more results we get out if are on the penultimate page
limit = temp_list['total'] - done
print 'Making another API call to get complete list'
return_members(status, list_id, members, page, limit, done)
return members
for id in lists:
unsubs = return_members('subscribed',id)
for person in unsubs:
print person['email']
print 'Finished getting information'
So this function runs recursively until we have pulled all members from a given list.
But what I've noticed is that the variable unsubs seems to just get bigger and bigger. In that when the function return_members is called with different list ids, I get an amalgamation of the emails of every list I have called so far (rather than just one particular list).
If I call return_members('subscribed', id, []) which explicitly gives it a fresh array then it's fine. But I don't see why I need to do this, as if I am calling the function with a different list ID, it's not running recursively and since I haven't specificed the members variable, it defaults to []
I think this may be a quirk of python, or I've just missed something!
The linked SO infamous question by Martjin would help you understand the underline issue, but to get this sorted out you can write the following loop
for item in temp_list['data']: # Add latest pulled data to our list
members.append(item)
to a more pythonic version
members = members + temp_list['data'] # Add latest pulled data to our list
this small change would ensure that you are working with an instance different from the one passed as the parameter to return_members
Try replacing:
def return_members (status, list_id, members = [], start = 0, limit = 15000, done = 0):
with:
def return_members (status, list_id, members = None, start = 0, limit = 15000, done = 0):
if not members: members = []

Python data structure recommendation?

I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.
So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:
statline = stats[(2001, 'SEA', 'suzukic01')]
Which yields something like
[305, 20, 444, 330, 45]
I'd like to alter this data structure to be quickly summed by either of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like
hr_total = stats[year=2001, idx=3]
Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.
Any ideas?
Read up on Data Warehousing. Any book.
Read up on Star Schema Design. Any book. Seriously.
You have several dimensions: Year, Player, Team.
You have one fact: score
You want to have a structure like this.
You then want to create a set of dimension indexes like this.
years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )
Your fact table can be this a collections.namedtuple. You can use something like this.
class ScoreFact( object ):
def __init__( self, year, player, team, score ):
self.year= year
self.player= player
self.team= team
self.score= score
years[self.year].append( self )
players[self.player].append( self )
teams[self.team].append( self )
Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.
years['2001'] are all scores for the given year.
players['SEA'] are all scores for the given player.
etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.
[ x for x in players['SEA'] if x.year == '2001' ]
Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.
The syntax stats[year=2001, idx=3] is invalid Python and there is no way you can make it work with those square brackets and "keyword arguments"; you'll need to have a function or method call in order to accept keyword arguments.
So, say we make it a function, to be called like wells(stats, year=2001, idx=3). I imagine the idx argument is mandatory (which is very peculiar given the call, but you give no indication of what could possibly mean to omit idx) and exactly one of year, playerid, and teamid must be there.
With your current data structure, wells can already be implemented:
def wells(stats, year=None, playerid=None, teamid=None, idx=None):
if idx is None: raise ValueError('idx must be specified')
specifiers = [(i, x) for x in enumerate((year, playerid, teamid)) if x is not None]
if len(specifiers) != 2:
raise ValueError('Exactly one of year, playerid, teamid, must be given')
ikey, keyv = specifiers[0]
return sum(v[idx] for k, v in stats.iteritems() if k[ikey]==keyv)
of course, this is O(N) in the size of stats -- it must examine every entry in it. Please measure correctness and performance with this simple implementation as a baseline. An alternative solutions (much speedier in use, but requiring much time for preparation) is to put three dicts of lists (one each for year, playerid, teamid) to the side of stats, each entry indicating (or copying, but I think indicating by full key may suffice) all entries of stats that match that that ikey / keyv pair. But it's not clear at this time whether this implementation may not be premature, so please try first with the simple-minded idea!-)
def getSum(d, year, idx):
sum = 0
for key in d.keys():
if key[0] == year:
sum += d[key][idx]
return sum
This should get you started. I have made the assumption in this code, that ONLY year will be asked for, but it should be easy enough for you to manipulate this to check for other parameters as well
Cheers

Categories

Resources