Iam having a data frame similar to the one below
absences_df= pd.DataFrame({'PersonNumber' : ['1234','1234','1234','5678', '5678', '5678', '997','998','998'],
'Start':['2022-03-07','2022-03-08','2022-03-09','2022-03-09','2022-03-10','2022-03-11','2022-03-07','2022-03-07','2022-03-08'],
'End':['2022-03-07','2022-03-08','2022-03-09','2022-03-09','2022-03-10','2022-03-11','2022-03-07','2022-03-07','2022-03-08'],
'hours' : ['1','1', '1','1','2','2','3.5','1','2']
})
absences_df:
I am having another dataframe like the one below:
input_df = pd.DataFrame({'PersonNumber' : ['1234','5678','997','998'],
'W03' : ['1.0','11.0','1.0','22.0'],
'W3_5' : ['2.0','12.0','2.0','23.0'],
'W04' : ['3.0','13.0','3.0','24.0'],
'W4_5' : ['4.0','14.0','4.0','25.0'],
'W05' : ['5.0','15.0','5.0','26.0'],
'W5_5' : ['0.0','16.0','6.0','27.0'],
'W06' : ['0.0','17.0','7.0','28.0'],
'W6_5' : ['6.0','18.0','8.0','29.0'],
'W07' : ['7.0','19.0','9.0','0.0'],
'W7_5' : ['8.0','0.0','10.0','0.0'],
'W08' : ['9.0','0.0','11.0','31.0'],
'W8_5' : ['10.0','0.0','12.0','32.0'],
'W09' : ['11.0','22.0','13.0','34.0'],
})
input_df :
i wanted to offset the row values in my second data frame(input_df ) based on the value that is present "hours" column in my first data frame(absences_df). After offsetting, the last value should be repeated for the remaining columns.
I wanted an output similar to the one below.
output_df = pd.DataFrame({'PersonNumber' : ['1234','5678','997','998'],
'W03' : ['0.0','0.0','7.0','27.0'],
'W3_5' : ['0.0','0.0','8.0','28.0'],
'W04' : ['6.0','0.0','9.0','29.0'],
'W4_5' : ['7.0','22.0','10.0','0.0'],
'W05' : ['8.0','22.0','11.0','0.0'],
'W5_5' : ['9.0','22.0','12.0','31.0'],
'W06' : ['10.0','22.0','13.0','32.0'],
'W6_5' : ['11.0','22.0','13.0','34.0'],
'W07' : ['11.0','22.0','13.0','34.0'],
'W7_5' : ['11.0','22.0','13.0','34.0'],
'W08' : ['11.0','22.0','13.0','34.0'],
'W8_5' : ['11.0','22.0','13.0','34.0'],
'W09' : ['11.0','22.0','13.0','34.0']
})
Final_df:
Simply put,
1)Employee 1234 is absent for 3 days and the sum of his each day hours is 3(1+1+1). So 3(total hours sum)+ 2(common for every one) = 5. So offset starts from W5_5
2)Employee 5678 is absent for 3 days and the sum of his each day hours is 5(1+2+2). So 5(total hours sum)+ 2(common for every one) = 7. So the offset starts from W7_5
3)Employee 997 is absent for 1 day and the sum of his each day hours is 3.5. So 3.5(total sum)+ 2(common for every one) = 5.5. So offset starts from W06
4)Employee 998 is absent for 2 days and the sum of his each day hours is 3(1+2). So 3(total hours sum) + 2(common for every one) = 5. So offset starts from W5_5
I have tried using shift() and a few other ways, but nothing helped me.
Posting what i have tried here
A=absences_df['PersonNumber'].value_counts()
dfNew_employee=[]
dfNew_repeat_time=[]
dfNew_Individual_hrs=[]
df_new_average_hours =[]
dfNew_total_hrs=[]
for i in A.index:
individual_employee=absences_df.loc[(absences_df['PersonNumber'] == i)]
hr_per_day=individual_employee['Duration'].iloc[0]
dfNew_employee.append(i)
dfNew_repeat_time.append(A[i])
dfNew_Individual_hrs.append(hr_per_day)
dfNew_total_hrs.append(str(sum(individual_employee['Duration'])+2))
df_new_average_hours.append(str((int(hr_per_day)*int(A[i]))+2))
print('employee id:',i,'; Repeated:',A[i],'; Hours=',hr_per_day,'; Total hours=',sum(individual_employee['Duration'])+2)
main_cnt = 0
b = weekly_penality_df.copy()
df_final = pd.DataFrame(columns=b.columns)
for k in dfNew_employee:
i=dfNew_total_hrs[main_cnt]
i=int(float(i)*2)-5
# if main_cnt > 0:
# b = a3.copy()
print(i)
a = b[b['PersonNumber'] == str(k)]
if a.shape[0] == 0:
print(main_cnt)
continue
a_ref_index = a.index.values.astype(int)[0]
#a_ref_index
a1 = b[["PersonNumber"]].copy()
a2 = b.copy()
a2.drop(['PersonNumber'], axis=1, inplace = True)
a21 = a2.iloc[[a_ref_index],:].copy()
a21.dropna(axis =1, inplace = True)
a21_last_value = a21[a21.columns[-1]]
a2.iloc[[a_ref_index],:] = a2.iloc[[a_ref_index],:].shift(i*-1, axis = 1, fill_value =float(a21_last_value))
a3=pd.concat([a1, a2], axis=1)
temp = a3[a3['PersonNumber'] == str(k)]
#df_final = df_final.append(temp, ignore_index=True)
b.loc[temp.index, :] = temp[:]
a3 = a3.reset_index(drop=True)
main_cnt=main_cnt+1
Please help me with any Easier/simplest solution.
Thanks in advance
This is the function to get the exact column name from absences_df
def get_offset_amount(person_number):
#calculating the sum of all the absent hour for a particular person
offset=absences_df[absences_df['PersonNumber']==person_number]['hours'].astype(float).sum()
#if sum is zero than no change in the output dataframe
if offset == 0:
return 0
# Adding 2 as per your requerment
offset+=2
#creating the column name
if offset.is_integer():
column_name = 'W{offset}_5'.format(offset= int(offset))
else:
column_name = 'W0{offset}'.format(offset= int(offset+1))
#Fetching the column number using the column name just created
return input_df.columns.tolist().index(column_name)
Iterating the input DF and creating the offset list. Using the same shift function from your try.
ouput_lst = []
for person_number in input_df['PersonNumber']:
shift_amount = get_offset_amount(person_number)
last_value = input_df[input_df['PersonNumber']==person_number].iloc[0,-1]
lst = input_df[input_df['PersonNumber']==person_number] \
.shift(periods = shift_amount*-1,axis = 1,fill_value = last_value) \
.iloc[0,:].tolist()[:-1]
new_lst = [person_number, *lst]
ouput_lst.append(new_lst)
output_df = pd.DataFrame(ouput_lst)
output_df.columns = input_df.columns
Ouput_df
PersonNumber W03 W3_5 W04 W4_5 W05 W5_5 W06 W6_5 W07 W7_5 \
0 1234 0.0 0.0 6.0 7.0 8.0 9.0 10.0 11.0 11.0 11.0
1 5678 0.0 0.0 0.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0
2 997 7.0 8.0 9.0 10.0 11.0 12.0 13.0 13.0 13.0 13.0
3 998 27.0 28.0 29.0 0.0 0.0 31.0 32.0 34.0 34.0 34.0
W08 W8_5 W09
0 11.0 11.0 11.0
1 22.0 22.0 22.0
2 13.0 13.0 13.0
3 34.0 34.0 34.0
I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.
For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913
I have a series (of length 201) created from reading a .xlsx spread sheet, as follows:
xl = pandas.ExcelFile(file)
data = xl.parse('Sheet1')
data.columns = ["a", "b", "c", "d", "e", "f", "g", "h"]
A = data.a
So I am working with A and if I print (A) I get
0 76.0
1 190.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 320.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I want to iterate through A and find all the values => 180 and make a new array (or series) where for the values in A => 180 I subtract 180 but for values in A =< 180 I use the original value. I have tried the following but I get errors:
nx = len(A)
for i in range (nx):
if A_new(i) >= A(i) + 180:
else A_new(i) == A(i)
Use Series.mask / Series.where:
new_s = s.mask(s.ge(180),s.sub(180))
#new_s = s.sub(180).where(s.ge(180),s) #or series.where
or np.where
new_s = pd.Series(data = np.where(s.ge(180),s.sub(180),s),
index = s.index,
name = s.name)
We could also use Series.loc
new_s = s.copy()
new_s.loc[s.ge(180)] =s.sub(180)
new_s output
0 76.0
1 10.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 140.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I'm trying to predict the outcome of sports games and therefore want to transform my dataframe in such a way that I can train a model. Currently I am using a for loop to loop through all played games, pick the two players of the game and check how they performed the x games before the actual game took place. After this I want to take the mean of the statistics of previous games of these players and concatenate these together. In the end I add the true outcome of the actual game so I can train a model on the true outcome.
Now I got some speed performance issues, my current code takes about 9 minutes to complete for 20000 games (with ~200 variables). I already managed to go from 20 to 9 minutes.
I started with adding each game to a dataframe, later I changed this to adding each seperate dataframe to a list and make one big dataframe of this list in the end.
I also included if statements which make sure that the loop continues if a player did not play at least x games.
I expect the outcome to be much faster than 9 minutes. I think it can be much faster.
Hope you guys can help me!
import pandas as pd
import numpy as np
import random
import string
letters = list(string.ascii_lowercase)
datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
'League': np.random.choice(['LeagueA','LeagueB'], 5000),
'Home_player':np.random.choice(letters, 5000),
'Away_player':np.random.choice(letters, 5000),
'Home_strikes':np.random.randint(1,20,5000),
'Home_kicks':np.random.randint(1,20,5000),
'Away_strikes':np.random.randint(1,20,5000),
'Away_kicks':np.random.randint(1,20,5000),
'Winner':np.random.randint(0,2,5000)})
leagues = list(data['League'].unique())
home_columns = [col for col in data if col.startswith('Home')]
away_columns = [col for col in data if col.startswith('Away')]
# Determine to how many last x games to take statistics
total_games = 5
final_df = []
# Make subframe of league
for league in leagues:
league_data = data[data.League == league]
league_data = league_data.sort_values(by='Date').reset_index(drop=True)
# Pick the last game
league_data = league_data.head(500)
for i in range(0,len(league_data)):
if i < 1:
league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
else:
league_copy = league_data[:-i].reset_index(drop=True)
# Loop back from the last game
last_game = league_copy.iloc[-1:].reset_index(drop=True)
# Take home and away player
Home_player = last_game.loc[0,"Home_player"] # Pick home team
Away_player = last_game.loc[0,'Away_player'] # pick away team
# # Remove last row so current game is not picked
df = league_copy[:-1]
# Now check the statistics of the games befóre this game was played
Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
# If the player did not play at least x number of games, then continue
if len(Home) < total_games:
continue
else:
Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Do the same for the away team
Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
if len(Away) < total_games:
continue
else:
Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Now concat home and away player data
Home_away = pd.concat([Home, Away], axis=1)
Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
# Take the mean of all columns
Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
# Now again add home team and away team to dataframe
Home_away["Home_player"] = Home_player
Home_away["Away_player"] = Away_player
winner = last_game.loc[0,"Winner"]
date = last_game.loc[0,"Date"]
Home_away['Winner'] = winner
Home_away['Date'] = date
final_df.append(Home_away)
final_df = pd.concat(final_df, axis=0)
final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
This doesn't answer your question but you can leverage the package line_profiler to find the slow parts of your code.
Resource:
http://gouthamanbalaraman.com/blog/profiling-python-jupyter-notebooks.html
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 35.0 35.0 0.0 letters = list(string.ascii_lowercase)
3 1 11052.0 11052.0 0.0 datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
4
5 1 3483.0 3483.0 0.0 data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
6 1 1464.0 1464.0 0.0 'League': np.random.choice(['LeagueA','LeagueB'], 5000),
7 1 2532.0 2532.0 0.0 'Home_player':np.random.choice(letters, 5000),
8 1 1019.0 1019.0 0.0 'Away_player':np.random.choice(letters, 5000),
9 1 693.0 693.0 0.0 'Home_strikes':np.random.randint(1,20,5000),
10 1 682.0 682.0 0.0 'Home_kicks':np.random.randint(1,20,5000),
11 1 682.0 682.0 0.0 'Away_strikes':np.random.randint(1,20,5000),
12 1 731.0 731.0 0.0 'Away_kicks':np.random.randint(1,20,5000),
13 1 40409.0 40409.0 0.0 'Winner':np.random.randint(0,2,5000)})
14
15 1 6560.0 6560.0 0.0 leagues = list(data['League'].unique())
16 1 439.0 439.0 0.0 home_columns = [col for col in data if col.startswith('Home')]
17 1 282.0 282.0 0.0 away_columns = [col for col in data if col.startswith('Away')]
18
19 # Determine to how many last x games to take statistics
20 1 11.0 11.0 0.0 total_games = 5
21 1 12.0 12.0 0.0 final_df = []
22
23 # Make subframe of league
24 3 38.0 12.7 0.0 for league in leagues:
25
26 2 34381.0 17190.5 0.0 league_data = data[data.League == league]
27 2 30815.0 15407.5 0.0 league_data = league_data.sort_values(by='Date').reset_index(drop=True)
28 # Pick the last game
29 2 5045.0 2522.5 0.0 league_data = league_data.head(500)
30 1002 14202.0 14.2 0.0 for i in range(0,len(league_data)):
31 1000 11943.0 11.9 0.0 if i < 1:
32 2 28407.0 14203.5 0.0 league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
33 else:
34 998 5305364.0 5316.0 4.2 league_copy = league_data[:-i].reset_index(drop=True)
35
36 # Loop back from the last game
37 1000 4945240.0 4945.2 3.9 last_game = league_copy.iloc[-1:].reset_index(drop=True)
38
39 # Take home and away player
40 1000 1504055.0 1504.1 1.2 Home_player = last_game.loc[0,"Home_player"] # Pick home team
41 1000 899081.0 899.1 0.7 Away_player = last_game.loc[0,'Away_player'] # pick away team
42
43 # # Remove last row so current game is not picked
44 1000 2539351.0 2539.4 2.0 df = league_copy[:-1]
45
46 # Now check the statistics of the games befóre this game was played
47 1000 16428854.0 16428.9 13.0 Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
48
49 # If the player did not play at least x number of games, then continue
50 1000 49133.0 49.1 0.0 if len(Home) < total_games:
51 260 2867.0 11.0 0.0 continue
52 else:
53 740 12968016.0 17524.3 10.2 Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
54
55
56 # Do the same for the away team
57 740 12007650.0 16226.6 9.5 Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
58
59 740 33357.0 45.1 0.0 if len(Away) < total_games:
60 64 825.0 12.9 0.0 continue
61 else:
62 676 11598741.0 17157.9 9.1 Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
63
64
65 # Now concat home and away player data
66 676 5114022.0 7565.1 4.0 Home_away = pd.concat([Home, Away], axis=1)
67 676 9702001.0 14352.1 7.6 Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
68
69 # Take the mean of all columns
70 676 12171184.0 18004.7 9.6 Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
71
72 # Now again add home team and away team to dataframe
73 676 5112558.0 7563.0 4.0 Home_away["Home_player"] = Home_player
74 676 4880017.0 7219.0 3.8 Home_away["Away_player"] = Away_player
75
76 676 791718.0 1171.2 0.6 winner = last_game.loc[0,"Winner"]
77 676 696925.0 1031.0 0.5 date = last_game.loc[0,"Date"]
78 676 5142111.0 7606.7 4.1 Home_away['Winner'] = winner
79 676 9630466.0 14246.3 7.6 Home_away['Date'] = date
80
81 676 16125.0 23.9 0.0 final_df.append(Home_away)
82 1 5088063.0 5088063.0 4.0 final_df = pd.concat(final_df, axis=0)
83 1 18424.0 18424.0 0.0 final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
IIUC, you can obtain the last 5 game statistics, including the current one by:
# replace this with you statistic columns
stat_cols = data.columns[4:]
total_games = 5
data.groupby(['League','Home_player', 'Away_player'])[stat_cols].rolling(total_games).mean()
If you want to exclude the current one:
last_stats = data.groupby(['League','Home_player', 'Away_player']).apply(lambda x: x[stat_cols].shift().rolling(total_games).mean())
This last_stats data frame should have the same index as the the original one, so you can do:
train_data = data.copy()
# backup the actual outcome
train_data['Actual'] = train_data['Winner']
# copy the average statistics
train_data[stat_cols] = last_stats
All together should not take more than 1 min.
I am trying to find the best way to apply my function to each individual row of a pandas DataFrame without using iterrows() or itertuples(). Note that I am pretty sure apply() will not work in this case.
Here the first 5 rows of the DataFrame that I'm working with:
In [2470]: home_df.head()
Out[2470]:
GameId GameId_real team FTHG FTAG homeElo awayElo homeGame
0 0 -1 Charlton 1.0 2.0 1500.0 1500.0 1
1 1 -1 Derby 2.0 1.0 1500.0 1500.0 1
2 2 -1 Leeds 2.0 0.0 1500.0 1500.0 1
3 3 -1 Leicester 0.0 5.0 1500.0 1500.0 1
4 4 -1 Liverpool 2.0 1.0 1500.0 1500.0 1
Here is my function and the code that I am currently using:
def wt_goals_elo(df, game_id_row, team_row):
wt_goals = (df[(df.GameId < game_id_row) & (df.team == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
game_id_idx = home_df.columns.get_loc('GameId')
team_idx = home_df.columns.get_loc('team')
wt_goals = [wt_goals_elo(home_df, row[game_id_idx + 1], row[team_idx + 1]) for row in home_df.itertuples()]
FTHG = Full time home goals.
I am basically trying to find the weighted average of full time home goals, weighted by away elo for previous games. I can do this using a for loop but am unable to do it using apply, as I need to refer to the original DataFrame to filter by GameId and team.
Any ideas?
Thanks so much in advance.
I believe need:
def wt_goals_elo(game_id_row, team_row):
print (game_id_row)
wt_goals = (home_df[(home_df.GameId.shift() < game_id_row) &
(home_df.team.shift() == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
home_df['w'] = home_df.apply(lambda x: wt_goals_elo(x['GameId'], x['team']), axis=1)