Creating multiple dataframes from a stored procedure - python

I'm working with a stored procedure in which I pass it a start and end date and it returns data. Im passing it ten different dates and making ten calls to it, see below:
match1 = sp_data(startDate = listOfDates[0], endDate=listOfDates[0])
match2 = sp_data(startDate = listOfDates[1], endDate=listOfDates[1])
match3 = sp_data(startDate = listOfDates[2], endDate=listOfDates[2])
match4 = sp_data(startDate = listOfDates[3], endDate=listOfDates[3])
match5 = sp_data(startDate = listOfDates[4], endDate=listOfDates[4])
match6 = sp_data(startDate = listOfDates[5], endDate=listOfDates[5])
match7 = sp_data(startDate = listOfDates[6], endDate=listOfDates[6])
match8 = sp_data(startDate = listOfDates[7], endDate=listOfDates[7])
match9 = sp_data(startDate = listOfDates[8], endDate=listOfDates[8])
match10 = sp_data(startDate = listOfDates[9], endDate=listOfDates[9])
See listOfDates pandas series below:
print(listOfDates)
0 20220524
1 20220613
2 20220705
3 20220713
4 20220720
5 20220805
6 20220903
7 20220907
8 20220928
9 20221024
Name: TradeDate, dtype: object
Is there a better and more efficient way of doing this? Potentially in a loop of some kind?
Any help greatly appreciated, thanks!

You could use a list comprehension to make a list of matches:
matches = [sp_data(startDate=trade_date, endDate=trade_date) for trade_date in listOfDates]

Related

Why "NameError: name 'product_id_list' is not defined"=

I write this and i don't know why product_id_list is not defined if i have defined it like 4 lines before.
Any suggestions? I thin identation is alright so I don't have any more ideas and I also searched around without luck.
Thank you!!
def make_dataSet_rowWise(reorder_product):
print('unique Product in dataset = ', len(reorder_product.product_id.unique()))
print('unique order_id in dataset = ', len(reorder_product.order_id.unique()))
product_id_list = reorder_product.product_id.unique().tolist()
product_id_list.append("order_id")
product_id_dict = {}
i = 0
for prod_id in product_id_list:
product_id_dict[prod_id] = i
i = i+1
product_id_df = pd.Dataframe(columns = product_id_list)
row_list_all = []
order_id_list = reorder_product.order_id.unique()
i = 1
for id in order_id_list:
#print(i)
i = i+1
np_zeros = np.zeros(shape = [len(product_id_list)-1])
ordered_product_list = reorder_product.loc[reorder_product.order_id == id]["product_id"].tolist()
for order_prod in ordered_product_list:
np_zeros[product_id_dict.get(order_prod)] = 1
row_list = np_zeros.tolist()
row_list.append(id)
row_list_all.append(row_list)
return (row_list_all, product_id_list)
df_row_wise = make_dataSet_rowWise(reorder_product_99Pct)
product_id_df = pd.DataFrame(df_row_wise[0], columns = df_row_wise[1])
product_id_df.head()
The error I have is this one:
NameError Traceback (most recent call last)
<ipython-input-343-07bcac1b3b48> in <module>
7 i = 0
8
----> 9 for prod_id in product_id_list:
10 product_id_dict[prod_id] = i
11 i = i+1
NameError: name 'product_id_list' is not defined
As already mentioned by the other answers, your indentation is wrong.
My recommendation is that you use a IDE like VSCode, there is also a free web version https://vscode.dev/
With such kind of IDE you can see that your indentation is wrong, check screenshot and line 27
There are also wrong indentations with the 3 for loops. The correct indentation should be as the following
I think your indentation may be wrong, the for-loops and return statement is out of the function (with your indentation) so I indented it so that it would still be part of the function...
def make_dataSet_rowWise(reorder_product):
print('unique Product in dataset = ', len(reorder_product.product_id.unique()))
print('unique order_id in dataset = ', len(reorder_product.order_id.unique()))
product_id_list = reorder_product.product_id.unique().tolist()
product_id_list.append("order_id")
product_id_dict = {}
i = 0
for prod_id in product_id_list:
product_id_dict[prod_id] = i
i = i+1
product_id_df = pd.Dataframe(columns = product_id_list)
row_list_all = []
order_id_list = reorder_product.order_id.unique()
i = 1
for id in order_id_list:
#print(i)
i = i+1
np_zeros = id.zeros(shape = [len(product_id_list)-1])
ordered_product_list = reorder_product.loc[reorder_product.order_id == id]["product_id"].tolist()
for order_prod in ordered_product_list:
np_zeros[product_id_dict.get(order_prod)] = 1
row_list = np_zeros.tolist()
row_list.append(id)
row_list_all.append(row_list)
return (row_list_all, product_id_list)
I'm new here, but i think you either need to define the variable out of the scope of
def make_dataSet_rowWise(reorder_product):
OR indent the for loops to be inside
make_dataSet_rowWise

For loop for web scraping in python

I have a small project working on web-scraping Google search with a list of keywords. I have built a nested For loop for scraping the search results. The problem is that a for loop for searching keywords in the list does not work as I intended to, which is scraping the data from each searching result. The results get only the result of the last keyword, except for the first two search results.
Here is the code:
browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
values =[]
for ceo in ceo_list:
browser.get(baseUrl + ceo)
r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')
df = pd.DataFrame()
for i in r:
value = i.select_one('div.Z1hOCe').text
ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text
values = [ceo, value]
s = pd.Series(values)
df = df.append(s,ignore_index=True)
print(df)
The output:
0 1
0 Warren Buffet Born: October 28, 1955 (age 64 years), Seattle...
The output that I am expecting is as this:
0 1
0 Bill Gates Born:..........
1 Elon Musk Born:...........
2 Warren Buffett Born: August 30, 1930 (age 89 years), Omaha, N...
Any suggestions or comments are welcome here.
Declare df = pd.DataFrame() outside the for loop
Since currently, you have defined it inside the loop, for each keyword in your list it will initialize a new data frame and the older will be replaced. That's why you are just getting the result for the last keyword.
Try this:
browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
df = pd.DataFrame()
for ceo in ceo_list:
browser.get(baseUrl + ceo)
r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')
for i in r:
value = i.select_one('div.Z1hOCe').text
ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text
s = pd.Series([ceo, value])
df = df.append(s,ignore_index=True)
print(df)

python list of dictionaries only updating 1 attribute and skipping others

I have a list of lists containing company objects:
companies_list = [companies1, companies2]
I have the following function:
def get_fund_amount_by_year(companies_list):
companies_length = len(companies_list)
for idx, companies in enumerate(companies_list):
companies1 = companies.values_list('id', flat=True)
funding_rounds = FundingRound.objects.filter(company_id__in=companies1).order_by('announced_on')
amount_per_year_list = []
for fr in funding_rounds:
fr_year = fr.announced_on.year
fr_amount = fr.raised_amount_usd
if not any(d['year'] == fr_year for d in amount_per_year_list):
year_amount = {}
year_amount['year'] = fr_year
for companies_idx in range(companies_length):
year_amount['amount'+str(companies_idx)] = 0
if companies_idx == idx:
year_amount['amount'+str(companies_idx)] = fr_amount
amount_per_year_list.append(year_amount)
else:
for year_amount in amount_per_year_list:
if year_amount['year'] == fr_year:
year_amount['amount'+str(idx)] += fr_amount
return amount_per_year_list
The problem is the resulting list of dictionaries has only one amount attribute updated.
As you can see "amount0" contains all "0" amounts:
[{'amount1': 12100000L, 'amount0': 0, 'year': 1999}, {'amount1':
8900000L, 'amount0': 0, 'year': 2000}]
What am I doing wrong?
I put list of dictionaries being built in the loop and so when it iterated it overwrote the last input. I changed it to look like:
def get_fund_amount_by_year(companies_list):
companies_length = len(companies_list)
**amount_per_year_list = []**
for idx, companies in enumerate(companies_list):
companies1 = companies.values_list('id', flat=True)
funding_rounds = FundingRound.objects.filter(company_id__in=companies1).order_by('announced_on')

TypeError: 'DataFrame' object is not callable python function

I have two functions, one which creates a dataframe from a csv and another which manipulates that dataframe. There is no problem the first time I pass the raw data through the lsc_age(import_data()) functions. However, I get the above-referenced error (TypeError: 'DataFrame' object is not callable) upon second+ attempts. Any ideas for how to solve the problem?
def import_data(csv,date1,date2):
global data
data = pd.read_csv(csv,header=1)
data = data.iloc[:,[0,1,4,6,7,8,9,11]]
data = data.dropna(how='all')
data = data.rename(columns={"National: For Dates 9//1//"+date1+" - 8//31//"+date2:'event','Unnamed: 1':'time','Unnamed: 4':'points',\
'Unnamed: 6':'name','Unnamed: 7':'age','Unnamed: 8':'lsc','Unnamed: 9':'club','Unnamed: 11':'date'})
data = data.reset_index().drop('index',axis=1)
data = data[data.time!='Time']
data = data[data.points!='Power ']
data = data[data['event']!="National: For Dates 9//1//"+date1+" - 8//31//"+date2]
data = data[data['event']!='USA Swimming, Inc.']
data = data.reset_index().drop('index',axis=1)
for i in range(len(data)):
if len(str(data['event'][i])) <= 3:
data['event'][i] = data['event'][i-1]
else:
data['event'][i] = data['event'][i]
data = data.dropna()
age = []
event = []
gender = []
for row in data.event:
gender.append(row.split(' ')[0])
if row[:9]=='Female 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
elif row[:7]=='Male 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
else:
n = 2
groups = row.split(' ')
event.append(' '.join(groups[n:]))
groups = row.split(' ')
age.append(groups[1])
data['age_group'] = age
data['event_simp'] = event
data['gender'] = gender
data['year'] = date2
return data
def lsc_age(data_two):
global lsc, lsc_age, top, all_performers
lsc = pd.DataFrame(data_two['event'].groupby(data_two['lsc']).count()).reset_index().sort_values(by='event',ascending=False)
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
top = pd.concat([lsc_age[lsc_age.age_group=='10 & under'].head(),lsc_age[lsc_age.age_group=='11-12'].head(),\
lsc_age[lsc_age.age_group=='13-14'].head(),lsc_age[lsc_age.age_group=='15-16'].head(),\
lsc_age[lsc_age.age_group=='17-18'].head()],ignore_index=True)
all_performers = pd.concat([lsc_age[lsc_age.age_group=='10 & under'],lsc_age[lsc_age.age_group=='11-12'],\
lsc_age[lsc_age.age_group=='13-14'],lsc_age[lsc_age.age_group=='15-16'],\
lsc_age[lsc_age.age_group=='17-18']],ignore_index=True)
all_performers = all_performers.rename(columns={'event':'no. top 100'})
all_performers['age_year_lsc'] = all_performers.age_group+' '+all_performers.year.astype(str)+' '+all_performers.lsc
return all_performers
years = [i for i in range(2008,2018)]
for i in range(len(years)-1):
lsc_age(import_data(str(years[i+1])+"national100.csv",\
str(years[i]),str(years[i+1])))
During the first call to your function lsc_age() in line
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
you are overwriting your function object with a dataframe. This is happening since you imported the function object from the global namespace with
global lsc, lsc_age, top, all_performers
Functions in Python are objects. Please see more information about this here.
To solve your problem, try to avoid the global imports. They do not seem to be necessary. Try to pass your data around through the arguments of the function.

Set limit feature_importances_ in DataFrame Pandas

I want to set a limit for my feature_importances_ output using DataFrame.
Below is my code (refer from this blog):
train = df_visualization.sample(frac=0.9,random_state=639)
test = df_visualization.drop(train.index)
train.to_csv('train.csv',encoding='utf-8')
test.to_csv('test.csv',encoding='utf-8')
train_dis = train.iloc[:,:66]
train_val = train_dis.values
train_in = train_val[:,:65]
train_out = train_val[:,65]
test_dis = test.iloc[:,:66]
test_val = test_dis.values
test_in = test_val[:,:65]
test_out = test_val[:,65]
dt = tree.DecisionTreeClassifier(random_state=59,criterion='entropy')
dt = dt.fit(train_in,train_out)
score = dt.score(train_in,train_out)
test_predicted = dt.predict(test_in)
# Print the feature ranking
print("Feature ranking:")
print (DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False))
My problem now is it display all 65 features.
Output :
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
sbp 0.052067
Intubation-No 0.050729
... ...
Babinski-Normal 0.000000
ABG-Metabolic Alkolosis 0.000000
ABG-Respiratory Acidosis 0.000000
Reflexes-Unilateral Hyperreflexia 0.000000
NS-No 0.000000
For example I just want top 5 features only.
Expected output:
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
Update :
I got the way to display using itertuples.
display = pd.DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False)
x=0
for row,col in display.itertuples():
if x<5:
print(row,"=",col)
else:
break
x++
Output :
Feature ranking:
wbc = 0.227780409582
age = 0.100949241154
gcs = 0.0693593476192
hr = 0.069270425399
rbs = 0.0534175402602
But I want to know whether this is the efficient way to get the output?
Try this:
indices = np.argsort(dt.feature_importances_)[::-1]
for i in range(5):
print " %s = %s" % (feature_cols[indices[i]], dt.feature_importances_[indices[i]])

Categories

Resources