Appending floats to empty pandas DataFrame - python

I am trying to build a recursive function that will take data from a specific dataframe, do a little math with it, and append that result to a new dataframe. My current code looks as follows
div1, div2, div3 = [pd.DataFrame(index = range(1), columns = ['g']) for i in range(3)]
# THIS IS NOT WORKING FOR SOME REASON
def stats(div, obp):
loop = 1
while loop <= 3:
games = obp['g'].sum() / 2
div = div.append(games)
loop += 1
if loop == 2:
stats(div2, dii_obp, dii_hr)
elif loop == 3:
stats(div3, diii_obp, diii_hr)
else:
print('Invalid')
stats(div1, di_obp)
I get an error that reads:
TypeError: cannot concatenate object of type '<class 'numpy.float64'>'; only Series and DataFrame objs are valid
div1, and di_obp are dataframes and ['g'] is a column in the di_obp dataframe.
I have tried making the variable games into an empty list, and a series and got different errors. I am not sure what I should try next. Any help is much appreciated!!
here is the head of the di_obp dataframe, the dii_obp and the diii_obp dataframes are the same but with different values.
print(di_obp.head())
rank team g ab h bb hbp sf sh
608 213.0 None 56.0 1947.0 526.0 182.0 55.0 19.0 22.0
609 214.0 None 36.0 1099.0 287.0 124.0 25.0 11.0 24.0
610 215.0 None 35.0 1099.0 247.0 159.0 51.0 11.0 24.0
611 216.0 None 36.0 1258.0 317.0 157.0 30.0 11.0 7.0
612 217.0 None 38.0 1136.0 281.0 138.0 41.0 14.0 10.0
CURRENT PROBLEM:
div1, div2, div3 = [pd.DataFrame(index= range(1), columns = ['g']) for i in range(3)]
def stats(div, obp):
loop = 1
while loop <= 3:
while loop <= 3:
games = obp['g'].sum() / 2
div[0] = div[0].append({'g': games}, ignore_index=True)
loop += 1
if loop == 2:
stats([div2], dii_obp)
elif loop == 3:
stats([div3], diii_obp)
else:
print('Done')
stats([div1], di_obp)
this does not return an error, but my dataframes are still empty

Appending to a dataframe is generally not recommended. Instead, you should accumulate your data in lists and then create dataframes from those lists:
div1, div2, div3 = [[] for _ in range(3)]
def stats(div, obp):
loop = 1
while loop <= 3:
while loop <= 3:
games = obp['g'].sum() / 2
div.append(games)
loop += 1
if loop == 2:
stats(div2, dii_obp)
elif loop == 3:
stats(div3, diii_obp)
else:
print('Done')
stats(div1, di_obp)
div1_df, div2_df, div2_df = [pd.DataFrame({'g': div}) for div in [div1, div2, div3]]

Related

offsetting row values using pandas.dataframe.shift()

Iam having a data frame similar to the one below
absences_df= pd.DataFrame({'PersonNumber' : ['1234','1234','1234','5678', '5678', '5678', '997','998','998'],
'Start':['2022-03-07','2022-03-08','2022-03-09','2022-03-09','2022-03-10','2022-03-11','2022-03-07','2022-03-07','2022-03-08'],
'End':['2022-03-07','2022-03-08','2022-03-09','2022-03-09','2022-03-10','2022-03-11','2022-03-07','2022-03-07','2022-03-08'],
'hours' : ['1','1', '1','1','2','2','3.5','1','2']
})
absences_df:
I am having another dataframe like the one below:
input_df = pd.DataFrame({'PersonNumber' : ['1234','5678','997','998'],
'W03' : ['1.0','11.0','1.0','22.0'],
'W3_5' : ['2.0','12.0','2.0','23.0'],
'W04' : ['3.0','13.0','3.0','24.0'],
'W4_5' : ['4.0','14.0','4.0','25.0'],
'W05' : ['5.0','15.0','5.0','26.0'],
'W5_5' : ['0.0','16.0','6.0','27.0'],
'W06' : ['0.0','17.0','7.0','28.0'],
'W6_5' : ['6.0','18.0','8.0','29.0'],
'W07' : ['7.0','19.0','9.0','0.0'],
'W7_5' : ['8.0','0.0','10.0','0.0'],
'W08' : ['9.0','0.0','11.0','31.0'],
'W8_5' : ['10.0','0.0','12.0','32.0'],
'W09' : ['11.0','22.0','13.0','34.0'],
})
input_df :
i wanted to offset the row values in my second data frame(input_df ) based on the value that is present "hours" column in my first data frame(absences_df). After offsetting, the last value should be repeated for the remaining columns.
I wanted an output similar to the one below.
output_df = pd.DataFrame({'PersonNumber' : ['1234','5678','997','998'],
'W03' : ['0.0','0.0','7.0','27.0'],
'W3_5' : ['0.0','0.0','8.0','28.0'],
'W04' : ['6.0','0.0','9.0','29.0'],
'W4_5' : ['7.0','22.0','10.0','0.0'],
'W05' : ['8.0','22.0','11.0','0.0'],
'W5_5' : ['9.0','22.0','12.0','31.0'],
'W06' : ['10.0','22.0','13.0','32.0'],
'W6_5' : ['11.0','22.0','13.0','34.0'],
'W07' : ['11.0','22.0','13.0','34.0'],
'W7_5' : ['11.0','22.0','13.0','34.0'],
'W08' : ['11.0','22.0','13.0','34.0'],
'W8_5' : ['11.0','22.0','13.0','34.0'],
'W09' : ['11.0','22.0','13.0','34.0']
})
Final_df:
Simply put,
1)Employee 1234 is absent for 3 days and the sum of his each day hours is 3(1+1+1). So 3(total hours sum)+ 2(common for every one) = 5. So offset starts from W5_5
2)Employee 5678 is absent for 3 days and the sum of his each day hours is 5(1+2+2). So 5(total hours sum)+ 2(common for every one) = 7. So the offset starts from W7_5
3)Employee 997 is absent for 1 day and the sum of his each day hours is 3.5. So 3.5(total sum)+ 2(common for every one) = 5.5. So offset starts from W06
4)Employee 998 is absent for 2 days and the sum of his each day hours is 3(1+2). So 3(total hours sum) + 2(common for every one) = 5. So offset starts from W5_5
I have tried using shift() and a few other ways, but nothing helped me.
Posting what i have tried here
A=absences_df['PersonNumber'].value_counts()
dfNew_employee=[]
dfNew_repeat_time=[]
dfNew_Individual_hrs=[]
df_new_average_hours =[]
dfNew_total_hrs=[]
for i in A.index:
individual_employee=absences_df.loc[(absences_df['PersonNumber'] == i)]
hr_per_day=individual_employee['Duration'].iloc[0]
dfNew_employee.append(i)
dfNew_repeat_time.append(A[i])
dfNew_Individual_hrs.append(hr_per_day)
dfNew_total_hrs.append(str(sum(individual_employee['Duration'])+2))
df_new_average_hours.append(str((int(hr_per_day)*int(A[i]))+2))
print('employee id:',i,'; Repeated:',A[i],'; Hours=',hr_per_day,'; Total hours=',sum(individual_employee['Duration'])+2)
main_cnt = 0
b = weekly_penality_df.copy()
df_final = pd.DataFrame(columns=b.columns)
for k in dfNew_employee:
i=dfNew_total_hrs[main_cnt]
i=int(float(i)*2)-5
# if main_cnt > 0:
# b = a3.copy()
print(i)
a = b[b['PersonNumber'] == str(k)]
if a.shape[0] == 0:
print(main_cnt)
continue
a_ref_index = a.index.values.astype(int)[0]
#a_ref_index
a1 = b[["PersonNumber"]].copy()
a2 = b.copy()
a2.drop(['PersonNumber'], axis=1, inplace = True)
a21 = a2.iloc[[a_ref_index],:].copy()
a21.dropna(axis =1, inplace = True)
a21_last_value = a21[a21.columns[-1]]
a2.iloc[[a_ref_index],:] = a2.iloc[[a_ref_index],:].shift(i*-1, axis = 1, fill_value =float(a21_last_value))
a3=pd.concat([a1, a2], axis=1)
temp = a3[a3['PersonNumber'] == str(k)]
#df_final = df_final.append(temp, ignore_index=True)
b.loc[temp.index, :] = temp[:]
a3 = a3.reset_index(drop=True)
main_cnt=main_cnt+1
Please help me with any Easier/simplest solution.
Thanks in advance
This is the function to get the exact column name from absences_df
def get_offset_amount(person_number):
#calculating the sum of all the absent hour for a particular person
offset=absences_df[absences_df['PersonNumber']==person_number]['hours'].astype(float).sum()
#if sum is zero than no change in the output dataframe
if offset == 0:
return 0
# Adding 2 as per your requerment
offset+=2
#creating the column name
if offset.is_integer():
column_name = 'W{offset}_5'.format(offset= int(offset))
else:
column_name = 'W0{offset}'.format(offset= int(offset+1))
#Fetching the column number using the column name just created
return input_df.columns.tolist().index(column_name)
Iterating the input DF and creating the offset list. Using the same shift function from your try.
ouput_lst = []
for person_number in input_df['PersonNumber']:
shift_amount = get_offset_amount(person_number)
last_value = input_df[input_df['PersonNumber']==person_number].iloc[0,-1]
lst = input_df[input_df['PersonNumber']==person_number] \
.shift(periods = shift_amount*-1,axis = 1,fill_value = last_value) \
.iloc[0,:].tolist()[:-1]
new_lst = [person_number, *lst]
ouput_lst.append(new_lst)
output_df = pd.DataFrame(ouput_lst)
output_df.columns = input_df.columns
Ouput_df
PersonNumber W03 W3_5 W04 W4_5 W05 W5_5 W06 W6_5 W07 W7_5 \
0 1234 0.0 0.0 6.0 7.0 8.0 9.0 10.0 11.0 11.0 11.0
1 5678 0.0 0.0 0.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0
2 997 7.0 8.0 9.0 10.0 11.0 12.0 13.0 13.0 13.0 13.0
3 998 27.0 28.0 29.0 0.0 0.0 31.0 32.0 34.0 34.0 34.0
W08 W8_5 W09
0 11.0 11.0 11.0
1 22.0 22.0 22.0
2 13.0 13.0 13.0
3 34.0 34.0 34.0

Panda(Python): add a new column in a data frame which depends on its row value and aggregated value from another data frame

I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.
For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913

Iterating through a series to find values >= x then use values

I have a series (of length 201) created from reading a .xlsx spread sheet, as follows:
xl = pandas.ExcelFile(file)
data = xl.parse('Sheet1')
data.columns = ["a", "b", "c", "d", "e", "f", "g", "h"]
A = data.a
So I am working with A and if I print (A) I get
0 76.0
1 190.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 320.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I want to iterate through A and find all the values => 180 and make a new array (or series) where for the values in A => 180 I subtract 180 but for values in A =< 180 I use the original value. I have tried the following but I get errors:
nx = len(A)
for i in range (nx):
if A_new(i) >= A(i) + 180:
else A_new(i) == A(i)
Use Series.mask / Series.where:
new_s = s.mask(s.ge(180),s.sub(180))
#new_s = s.sub(180).where(s.ge(180),s) #or series.where
or np.where
new_s = pd.Series(data = np.where(s.ge(180),s.sub(180),s),
index = s.index,
name = s.name)
We could also use Series.loc
new_s = s.copy()
new_s.loc[s.ge(180)] =s.sub(180)
new_s output
0 76.0
1 10.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 140.0
200 0.0
Name: Vazi, Length: 201, dtype: float64

Speed optimization for loop

I'm trying to predict the outcome of sports games and therefore want to transform my dataframe in such a way that I can train a model. Currently I am using a for loop to loop through all played games, pick the two players of the game and check how they performed the x games before the actual game took place. After this I want to take the mean of the statistics of previous games of these players and concatenate these together. In the end I add the true outcome of the actual game so I can train a model on the true outcome.
Now I got some speed performance issues, my current code takes about 9 minutes to complete for 20000 games (with ~200 variables). I already managed to go from 20 to 9 minutes.
I started with adding each game to a dataframe, later I changed this to adding each seperate dataframe to a list and make one big dataframe of this list in the end.
I also included if statements which make sure that the loop continues if a player did not play at least x games.
I expect the outcome to be much faster than 9 minutes. I think it can be much faster.
Hope you guys can help me!
import pandas as pd
import numpy as np
import random
import string
letters = list(string.ascii_lowercase)
datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
'League': np.random.choice(['LeagueA','LeagueB'], 5000),
'Home_player':np.random.choice(letters, 5000),
'Away_player':np.random.choice(letters, 5000),
'Home_strikes':np.random.randint(1,20,5000),
'Home_kicks':np.random.randint(1,20,5000),
'Away_strikes':np.random.randint(1,20,5000),
'Away_kicks':np.random.randint(1,20,5000),
'Winner':np.random.randint(0,2,5000)})
leagues = list(data['League'].unique())
home_columns = [col for col in data if col.startswith('Home')]
away_columns = [col for col in data if col.startswith('Away')]
# Determine to how many last x games to take statistics
total_games = 5
final_df = []
# Make subframe of league
for league in leagues:
league_data = data[data.League == league]
league_data = league_data.sort_values(by='Date').reset_index(drop=True)
# Pick the last game
league_data = league_data.head(500)
for i in range(0,len(league_data)):
if i < 1:
league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
else:
league_copy = league_data[:-i].reset_index(drop=True)
# Loop back from the last game
last_game = league_copy.iloc[-1:].reset_index(drop=True)
# Take home and away player
Home_player = last_game.loc[0,"Home_player"] # Pick home team
Away_player = last_game.loc[0,'Away_player'] # pick away team
# # Remove last row so current game is not picked
df = league_copy[:-1]
# Now check the statistics of the games befóre this game was played
Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
# If the player did not play at least x number of games, then continue
if len(Home) < total_games:
continue
else:
Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Do the same for the away team
Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
if len(Away) < total_games:
continue
else:
Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Now concat home and away player data
Home_away = pd.concat([Home, Away], axis=1)
Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
# Take the mean of all columns
Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
# Now again add home team and away team to dataframe
Home_away["Home_player"] = Home_player
Home_away["Away_player"] = Away_player
winner = last_game.loc[0,"Winner"]
date = last_game.loc[0,"Date"]
Home_away['Winner'] = winner
Home_away['Date'] = date
final_df.append(Home_away)
final_df = pd.concat(final_df, axis=0)
final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
This doesn't answer your question but you can leverage the package line_profiler to find the slow parts of your code.
Resource:
http://gouthamanbalaraman.com/blog/profiling-python-jupyter-notebooks.html
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 35.0 35.0 0.0 letters = list(string.ascii_lowercase)
3 1 11052.0 11052.0 0.0 datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
4
5 1 3483.0 3483.0 0.0 data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
6 1 1464.0 1464.0 0.0 'League': np.random.choice(['LeagueA','LeagueB'], 5000),
7 1 2532.0 2532.0 0.0 'Home_player':np.random.choice(letters, 5000),
8 1 1019.0 1019.0 0.0 'Away_player':np.random.choice(letters, 5000),
9 1 693.0 693.0 0.0 'Home_strikes':np.random.randint(1,20,5000),
10 1 682.0 682.0 0.0 'Home_kicks':np.random.randint(1,20,5000),
11 1 682.0 682.0 0.0 'Away_strikes':np.random.randint(1,20,5000),
12 1 731.0 731.0 0.0 'Away_kicks':np.random.randint(1,20,5000),
13 1 40409.0 40409.0 0.0 'Winner':np.random.randint(0,2,5000)})
14
15 1 6560.0 6560.0 0.0 leagues = list(data['League'].unique())
16 1 439.0 439.0 0.0 home_columns = [col for col in data if col.startswith('Home')]
17 1 282.0 282.0 0.0 away_columns = [col for col in data if col.startswith('Away')]
18
19 # Determine to how many last x games to take statistics
20 1 11.0 11.0 0.0 total_games = 5
21 1 12.0 12.0 0.0 final_df = []
22
23 # Make subframe of league
24 3 38.0 12.7 0.0 for league in leagues:
25
26 2 34381.0 17190.5 0.0 league_data = data[data.League == league]
27 2 30815.0 15407.5 0.0 league_data = league_data.sort_values(by='Date').reset_index(drop=True)
28 # Pick the last game
29 2 5045.0 2522.5 0.0 league_data = league_data.head(500)
30 1002 14202.0 14.2 0.0 for i in range(0,len(league_data)):
31 1000 11943.0 11.9 0.0 if i < 1:
32 2 28407.0 14203.5 0.0 league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
33 else:
34 998 5305364.0 5316.0 4.2 league_copy = league_data[:-i].reset_index(drop=True)
35
36 # Loop back from the last game
37 1000 4945240.0 4945.2 3.9 last_game = league_copy.iloc[-1:].reset_index(drop=True)
38
39 # Take home and away player
40 1000 1504055.0 1504.1 1.2 Home_player = last_game.loc[0,"Home_player"] # Pick home team
41 1000 899081.0 899.1 0.7 Away_player = last_game.loc[0,'Away_player'] # pick away team
42
43 # # Remove last row so current game is not picked
44 1000 2539351.0 2539.4 2.0 df = league_copy[:-1]
45
46 # Now check the statistics of the games befóre this game was played
47 1000 16428854.0 16428.9 13.0 Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
48
49 # If the player did not play at least x number of games, then continue
50 1000 49133.0 49.1 0.0 if len(Home) < total_games:
51 260 2867.0 11.0 0.0 continue
52 else:
53 740 12968016.0 17524.3 10.2 Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
54
55
56 # Do the same for the away team
57 740 12007650.0 16226.6 9.5 Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
58
59 740 33357.0 45.1 0.0 if len(Away) < total_games:
60 64 825.0 12.9 0.0 continue
61 else:
62 676 11598741.0 17157.9 9.1 Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
63
64
65 # Now concat home and away player data
66 676 5114022.0 7565.1 4.0 Home_away = pd.concat([Home, Away], axis=1)
67 676 9702001.0 14352.1 7.6 Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
68
69 # Take the mean of all columns
70 676 12171184.0 18004.7 9.6 Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
71
72 # Now again add home team and away team to dataframe
73 676 5112558.0 7563.0 4.0 Home_away["Home_player"] = Home_player
74 676 4880017.0 7219.0 3.8 Home_away["Away_player"] = Away_player
75
76 676 791718.0 1171.2 0.6 winner = last_game.loc[0,"Winner"]
77 676 696925.0 1031.0 0.5 date = last_game.loc[0,"Date"]
78 676 5142111.0 7606.7 4.1 Home_away['Winner'] = winner
79 676 9630466.0 14246.3 7.6 Home_away['Date'] = date
80
81 676 16125.0 23.9 0.0 final_df.append(Home_away)
82 1 5088063.0 5088063.0 4.0 final_df = pd.concat(final_df, axis=0)
83 1 18424.0 18424.0 0.0 final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
IIUC, you can obtain the last 5 game statistics, including the current one by:
# replace this with you statistic columns
stat_cols = data.columns[4:]
total_games = 5
data.groupby(['League','Home_player', 'Away_player'])[stat_cols].rolling(total_games).mean()
If you want to exclude the current one:
last_stats = data.groupby(['League','Home_player', 'Away_player']).apply(lambda x: x[stat_cols].shift().rolling(total_games).mean())
This last_stats data frame should have the same index as the the original one, so you can do:
train_data = data.copy()
# backup the actual outcome
train_data['Actual'] = train_data['Winner']
# copy the average statistics
train_data[stat_cols] = last_stats
All together should not take more than 1 min.

How do I iterate over a DataFrame when apply won't work without a for loop?

I am trying to find the best way to apply my function to each individual row of a pandas DataFrame without using iterrows() or itertuples(). Note that I am pretty sure apply() will not work in this case.
Here the first 5 rows of the DataFrame that I'm working with:
In [2470]: home_df.head()
Out[2470]:
GameId GameId_real team FTHG FTAG homeElo awayElo homeGame
0 0 -1 Charlton 1.0 2.0 1500.0 1500.0 1
1 1 -1 Derby 2.0 1.0 1500.0 1500.0 1
2 2 -1 Leeds 2.0 0.0 1500.0 1500.0 1
3 3 -1 Leicester 0.0 5.0 1500.0 1500.0 1
4 4 -1 Liverpool 2.0 1.0 1500.0 1500.0 1
Here is my function and the code that I am currently using:
def wt_goals_elo(df, game_id_row, team_row):
wt_goals = (df[(df.GameId < game_id_row) & (df.team == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
game_id_idx = home_df.columns.get_loc('GameId')
team_idx = home_df.columns.get_loc('team')
wt_goals = [wt_goals_elo(home_df, row[game_id_idx + 1], row[team_idx + 1]) for row in home_df.itertuples()]
FTHG = Full time home goals.
I am basically trying to find the weighted average of full time home goals, weighted by away elo for previous games. I can do this using a for loop but am unable to do it using apply, as I need to refer to the original DataFrame to filter by GameId and team.
Any ideas?
Thanks so much in advance.
I believe need:
def wt_goals_elo(game_id_row, team_row):
print (game_id_row)
wt_goals = (home_df[(home_df.GameId.shift() < game_id_row) &
(home_df.team.shift() == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
home_df['w'] = home_df.apply(lambda x: wt_goals_elo(x['GameId'], x['team']), axis=1)

Categories

Resources