I’m new to Pandas.
I have a data set that is horse racing results. Example here:
RaceID RaceDate RaceMeet Position Horse Jockey Trainer RaceLength race win HorseWinPercentage
446252 01/01/2008 Southwell (AW) 1 clear reef tom mclaughlin jane chapple-hyam 3101 1 1 0
447019 14/01/2008 Southwell (AW) 5 clear reef tom mclaughlin jane chapple-hyam 2654 1 0 100
449057 21/02/2008 Southwell (AW) 2 clear reef tom mclaughlin jane chapple-hyam 3101 1 0 50
463805 26/08/2008 Chelmsford (AW) 6 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 33.33333333
469220 27/11/2008 Chelmsford (AW) 3 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 25
470195 11/12/2008 Chelmsford (AW) 5 clear reef tom mclaughlin jane chapple-hyam 3080 1 0 20
471052 26/12/2008 Wolhampton (AW) 1 clear reef andrea atzeni jane chapple-hyam 2690 1 1 16.66666667
471769 07/01/2009 Wolhampton (AW) 6 clear reef ian mongan jane chapple-hyam 2690 1 0 28.57142857
472137 13/01/2009 Chelmsford (AW) 2 clear reef jamie spencer jane chapple-hyam 3080 1 0 25
472213 20/01/2009 Southwell (AW) 5 clear reef jamie spencer jane chapple-hyam 2654 1 0 22.22222222
476595 25/03/2009 Kempton (AW) 4 clear reef pat cosgrave jane chapple-hyam 2639 1 0 20
477674 08/04/2009 Kempton (AW) 5 clear reef pat cosgrave jane chapple-hyam 2639 1 0 18.18181818
479098 21/04/2009 Kempton (AW) 3 clear reef andrea atzeni jane chapple-hyam 2639 1 0 16.66666667
492913 14/11/2009 Wolhampton (AW) 1 clear reef andrea atzeni jane chapple-hyam 3639 1 1 15.38461538
493720 25/11/2009 Kempton (AW) 3 clear reef andrea atzeni jane chapple-hyam 3518 1 0 21.42857143
495863 29/12/2009 Southwell (AW) 1 clear reef shane kelly jane chapple-hyam 3101 1 1 20
I want to be able to groupby() multiple axis to count up wins and create combination win percentages or results at specific track and lengths.
When I just need to groupby a single axis – it works great:
df['horse_win_count'] = df.groupby(['Horse'])['win'].cumsum()
df['horse_race_count'] = df.groupby(['Horse'])['race'].cumsum()
df['HorseWinPercentage2'] = df['horse_win_count'] / df['horse_race_count'] * 100
df['HorseWinPercentage'] = df.groupby('Horse')['HorseWinPercentage2'].shift(+1)
However when I need to groupby more than one axis I get some really weird results.
For example I was to create a win percentage for when a specific Jockey rides a specific Trainers’ horse – groupby([‘Jockey’,’Trainer’]). Then I need to know the percentage as it changes for each individual row (race).
df['jt_win_count'] = df.groupby(['Jockey','Trainer'])['win'].cumsum()
df['jt_race_count'] = df.groupby(['Jockey','Trainer'])['race'].cumsum()
df['JTWinPercentage2'] = df['jt_win_count'] / df['jt_race_count'] * 100
df['JTWinPercentage'] = df.groupby(['Jockey','Trainer'])['JTWinPercentage2'].shift(+1)
df['JTWinPercentage'].fillna(0, inplace=True)
Or I want to count up the number of times a horse has won over that course and that distance. So I need to groupby([‘Horse’, ‘RaceMeet’,’RaceLength’]):
df['CD'] = df.groupby([‘RaceMeet’,’RaceLength’,’Horse’])[‘win’].cumsum()
df['CD'] = df.groupby(["RaceMeet","RaceLength","Horse"]).shift(+1)
I get results in the 10s of 1000s.
How can I groupby several axis, make a computation and shift the results back by one entry while grouped by several entries?
And even better can you explain why my code above doesn’t work? Like I say I’m new to Pandas and keen to learn.
Cheers.
Question was already asked: Pandas DataFrame Groupby two columns and get counts and here
python pandas groupby() result
I do not really know what your goal is though.
I guess you should first add another column with the new parameter you want to group by. for example: df['jockeyTrainer']=df.loc['Jockey']+df.loc['Trainer'] Then you can use this to groupby. Or you follow the information in the links.
Related
I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!
One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
Edited to add easier to reproduce dataframe
I have two dataframes that look something like this:
df1
index = [0,1,2,3,4,5,6,7,8]
a = pd.Series([John Smith, John Smith, John Smith, Kobe Bryant, Kobe Bryant, Kobe Bryant, Jeff Daniels, Jeff Daniels, Jeff Daniels],index= index)
b = pd.Series([7/29/2022, 8/7/2022, 8/29/2022, 7/9/2022, 7/29/2022, 8/9/2022, 7/28/2022, 8/8/2022, 8/28/2022],index= index)
c = pd.Series([185, 187, 186.5, 212.5, 217.5, 220.5, 211.1, 210.5, 213],index= index)
d = pd.Series([],index= index)
df1 = pd.DataFrame(np.d_[a,b,c],columns = ["Name","Date","Weight","Goal"])
or df1 in this format:
Name
Date
Weight
Goal
John Smith
7/29/2022
185
NaN
John Smith
8/7/2022
187
NaN
John Smith
8/29/2022
186.5
NaN
Kobe Bryant
7/9/2022
212.5
NaN
Kobe Bryant
7/29/2022
217.5
NaN
Kobe Bryant
8/9/2022
220.5
NaN
Jeff Daniels
7/28/2022
211.1
NaN
Jeff Daniels
8/8/2022
210.5
NaN
Jeff Daniels
8/28/2022
213
NaN
df2
index = [0,1,2]
a = pd.Series([John Smith, Kobe Bryant, Jeff Daniels],index= index)
b = pd.Series([195,230,220],index= index)
c = pd.Series([],index= index)
df2 = pd.DataFrame(np.c_[a,b],columns = ["Name", "Weight Goal"])
or df2 in this format:
Name
Weight Goal
John Smith
195
Kobe Bryant
230
Jeff Daniels
220
What I want to do is iterate through df1 and set respective weight goal from df2 for each player...but I only want to do this in August, I want to ignore the July dates.
I know that I shouldn't be using a for loop with a dataframe/pandas but I think me showing my mental thought process with one might show the intent that I was trying to achieve with my code attempts.
for player in df1['Name']:
df1 = df1.loc[(df1['Name'] == f'{player}') & (df1['Date'] > '8/1/2022')]
df1.at[df2['Name'] == f'{player}', 'Goal'] = (df2.loc[df2.Name == f'{player}']['Weight Goal'])
This just ends up delivering an empty dataframe & a settingwithcopy warning. I know this is not the right way to do this but I thought it might help to direct me.
Thank You.
If I correctly understand the output you are after (stack overflow tip: it can be useful to provide a sample of your desired output to help people trying to answer your question), then this should work:
# make the Date column into datetime type so it is easier to filter on
df1 = df1.assign(Date=pd.to_datetime(df1.Date))
# separate out the august rows from the other months
df1_august = df1.loc[df1.Date.apply(lambda x: x.month == 8)]
df1_other_months = df1.loc[df1.Date.apply(lambda x: x.month != 8)]
# use a merge rather than a loop to get WeightGoal column in place
df1_august_merged = df1_august.merge(df2, on="Name")
# finally add the rows for the other months back in
final_df = pd.concat([df1_august_merged, df1_other_months])
print(final_df)
Name Date Weight Goal Weight Goal
0 John Smith 2022-08-07 187.0 NaN 195.0
1 John Smith 2022-08-29 186.5 NaN 195.0
2 Kobe Bryant 2022-08-09 220.5 NaN 230.0
3 Jeff Daniels 2022-08-08 210.5 NaN 220.0
4 Jeff Daniels 2022-08-28 213.0 NaN 220.0
0 John Smith 2022-07-29 185.0 NaN NaN
3 Kobe Bryant 2022-07-09 212.5 NaN NaN
4 Kobe Bryant 2022-07-29 217.5 NaN NaN
6 Jeff Daniels 2022-07-28 211.1 NaN NaN
Right now, I am working with this dataframe..
Name
DateSolved
Points
Jimmy
12/3
100
Tim
12/4
50
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
At this moment, if there are duplicate names in the dataset, I just drop the oldest one (by date) from the dataframe by utilizing df.sort_values('DateSolved').drop_duplicates('Name', keep='last') leading to a dataset like this
Name
DateSolved
Points
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
However, instead of dropping the oldest one, I wish to keep it but give it a 50% points reduction. Something like this
Name
DateSolved
Points
Jimmy
12/3
50 (-50%)
Tim
12/4
25 (-50%)
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
How could I go about doing this? I cannot find a way to both FIND the duplicates based on "Name" and then change the value of the "POINTS" column in the same row.
Thank you!
IIUC use DataFrame.duplicated for select all duplicates withot last, select column Points and divide by 2:
df.loc[df.duplicated('Name', keep='last'), 'Points'] /= 2
print (df)
Name DateSolved Points
0 Jimmy 12/3 50.0
1 Tim 12/4 25.0
2 Jo 12/5 25.0
3 Jonny 12/5 25.0
4 Jimmy 12/8 10.0
5 Tim 12/8 10.0
I have an Input Dataframe that the following :
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
Want output dataframe as follows:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
My current code:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
This combines the last 4 rows into one. Instead I want to combine only 2 rows (say any n rows) even if the 'NAME' has the same value.
Appreciate your help on this.
Thanks
You can groupby the grp to get the relative blocks inside the group:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
Output:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0
I have a multiindex dataframe like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
John 3 961.0
4 346.0
Bricks James 10 244.0
20 303.0
30 811.0
Fred 40 449.0
James 501 265.0
Sand Donald 15 378.0
800 359.0
How can I slice that df to see only drivers, who worked for different companies? So the result should be like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0
UPD: My original dataframe is 400k long, so I can't just slice it by index. I'm trying to find general solution to solve problems like these.
To get the number of unique companies a person has worked for, use groupby and unique:
v = (df.index.get_level_values(0)
.to_series()
.groupby(df.index.get_level_values(1))
.nunique())
# Alternative involving resetting the index, may not be as efficient.
# v = df.reset_index().groupby('Driver').Company.nunique()
v
Driver
Donald 1
Fred 2
James 1
John 1
Name: Company, dtype: int64
Now, you can run a query:
names = v[v.gt(1)].index.tolist()
df.query("Driver in #names")
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0