Group by when condition is not just by row in Python Pandas - python

I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.

You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.

Related

How can I combine two newly created columns and also create a column for the division of both

I have the first column:
df_weight = df2.groupby(['Genre']).agg(total = ('weighted score', 'sum')).reset_index()
Genre
total_weight
0
Action and Adventure
1000.0
1
Classic and cult TV
500.0
and the second column:
df_TotalShow = df2.groupby(['Genre']).agg(total = ('No. of shows', 'sum')).reset_index()
Genre
total_shows
0
Action and Adventure
200.0
1
Classic and cult TV
150.0
I want to combine the two and make something similar below but I am unsure of what the code should look like.
Genre
total_weight
total_shows
0
Action and Adventure
1000.0
200.0
1
Classic and cult TV
500.0
150.0
Next, I want to create another column with the division of total_weight / total_shows.
So far, I tried
df = df_weight['total'].div(df_TotalShow['total'])
but this gives me a new series. Is there a way where this could be another column by itself with the final product looking something like
Genre
total_weight
total_shows
Avg
0
Action and Adventure
1000.0
200.0
5.0
1
Classic and cult TV
500.0
150.0
3.33
Use this code
df = df2.groupby(['Genre']).agg({'no. of shows': 'sum', 'weighted score': : 'sum'}).reset_index()
df['division']=df.col1/df.col2
replace the name of your columns. You must perform one time aggregation and grouping for this problem not 2 times

How do I getting all values from a one panda.series row that correspond to s specific value in second row?

I have a Dataframe that looks like this :
name age occ salary
0 Vinay 22.0 engineer 60000.0
1 Kushal NaN doctor 70000.0
2 Aman 24.0 engineer 80000.0
3 Rahul NaN doctor 65000.0
4 Ramesh 25.0 doctor 70000.0
and im trying to get all salary values that correspond to a specific occupation ,t o then compute the mean salary of that occ.
here is an answer with a few step
temp_df = df.loc[df['occ'] == 'engineer']
temp_df.salary.mean()
All averages at once:
df_averages = df[['occ', 'salary']].groupby('occ').mean()
salary
occ
doctor 68333.333333
engineer 70000.000000

Python get value counts from multiple columns and average from another column

I have a dataframe with the following columns
Movie Rating Genre_0 Genre_1 Genre_2
MovieA 8.9 Action Comedy Family
MovieB 9.1 Horror NaN NaN
MovieC 4.4 Comedy Family Adventure
MovieD 7.7 Action Adventure NaN
MovieE 9.5 Adventure Comedy NaN
MovieF 7.5 Horror NaN NaN
MovieG 8.6 Horror NaN NaN
I'd like get a dataframe which has value counts for each genre and the average rating for each time the genre appears
Genre value_count Average_Rating
Action 2 8.3
Comedy 3 7.6
Horror 3 8.4
Family 2 6.7
Adventure 3 7.2
I have tried the following code and am able to get the value counts. However, am unable to get the average rating of each genre based on the number of times each genre appears. Any form of help is much appreciated, thank you.
#create a list for the genre columns
genre_col = [col for col in df if col.startswith('Genre_')]
#get value counts of genres
genre_counts = df[genre_col].apply(pd.Series.value_counts).sum(1).to_frame(name='Count')
genre_counts.index.name = 'Genre'
genre_counts = genre_counts.reset_index()
You can .melt the dataframe then group then melted frame on genre and aggregate using a dictionary that specifies the columns and their corresponding aggregation functions:
# filter and melt the dataframe
m = df.filter(regex=r'Rating|Genre').melt('Rating', value_name='Genre')
# group and aggregate
dct = {'Value_Count': ('Genre', 'count'), 'Average_Rating': ('Rating', 'mean')}
df_out = m.groupby('Genre', as_index=False).agg(**dct)
>>> df_out
Genre Value_Count Average_Rating
0 Action 2 8.30
1 Adventure 3 7.20
2 Comedy 3 7.60
3 Family 2 6.65
4 Horror 3 8.40
The process of encoding the genre to their value counts is frequency encoding it can be done with this code
df_frequency_map = df.Genre_0.value_counts().to_dict()
df['Genre0_frequency_map'] = df.Genre_0.map(df_frequency_map)
The add the average as a feature in your dataset I think you can just perform the same thing but calculate the average before performing to the to_dict() function.
df_frequency_map = df.df.Genre_0.value_counts().mean().to_dict()
df['Genre0_mean_frequency_map'] = df.Genre_0.map(df_frequency_map)

Use Groupby to Calculate Average if Date < X

I am trying to use a data frame that includes historical game statistics like the below df1, and build a second data frame that shows what the various column averages were going into each game (as I show in df2). How can I use grouby or something else to find the various averages for each team but only for games that have a date prior to the date in that specific row. Example of historical games column:
Df1 = Date Team Opponent Points Points Against 1st Downs Win?
4/16/20 Eagles Ravens 10 20 10 0
2/10/20 Eagles Falcons 30 40 8 0
12/15/19 Eagles Cardinals 40 10 7 1
11/15/19 Eagles Giants 20 15 5 1
10/12/19 Jets Giants 10 18 2 1
Below is the dataframe that i'm trying to create. As you can see, it is showing the averages for each column but only for the games that happened prior to each game. Note: this is a simplified example of a much larger data set that i'm working with. In case the context helps, I'm trying to create this dataframe so I can analyze the correlation between the averages and whether the team won.
Df2 = Date Team Opponent Avg Pts Avg Pts Against Avg 1st Downs Win %
4/16/20 Eagles Ravens 25.0 21.3 7.5 75%
2/10/20 Eagles Falcons 30.0 12.0 6.0 100%
12/15/19 Eagles Cardinals 20.0 15.0 5.0 100%
11/15/19 Eagles Giants NaN NaN NaN NaN
10/12/19 Jets Giants NaN NaN NaN NaN
Let me know if anything above isn't clear, appreciate the help.
The easiest way is to turn your dataframe into a Time Series.
Run this for a file:
data=pd.read_csv(r'C:\Users\...csv',index_col='Date',parse_dates=True)
This is an example with a CSV file.
You can run this after:
data[:'#The Date you want to have all the dates before it']
If you want build a Series that has time indexed:
index=pd.DatetimeIndex(['2014-07-04',...,'2015-08-04'])
data=pd.Series([0, 1, 2, 3], index=index)
Define your own function
def aggs_under_date(df, date):
first_team = df.Team.iloc[0]
first_opponent= df.Opponent.iloc[0]
if df.date.iloc[0] <= date:
avg_points = df.Points.mean()
avg_againts = df['Points Against'].mean()
avg_downs = df['1st Downs'].mean()
win_perc = f'{win_perc.sum()/win_perc.count()*100} %'
return [first_team, first_opponent, avg_points, avg_againts, avg_downs, win_perc]
else:
return [first_team, first_opponent, np.nan, np.nan, np.nan, np.nan]
And do the groupby applying the function you just defined
date_max = pd.to_datetime('11/15/19')
Df1.groupby(['Date']).agg(aggs_under_date, date_max)

Pandas compare the same colums between merged dfs

I have two dfs that look like the following:
Df1:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 30
Df2:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 70
If I do the following:
merge = pd.merge(df1, df2, on=['area', 'team'])
I get:
merge:
area team score_x score_y
ontario team 1 60 60
ontario team 3 30 30
ontario team 2 50 50
new york team 1 90 90
new york team 2 30 70
It can be noted that the score in the last row of both dfs is different.
I would like to find what the percent difference is in between score_x and score_y.
However I actually have hundreds of metrics such as "score". How can I find the percent difference of each column of the merged df which has the same key before the merge is done and the _x and _y are apended?
Whats the best way to do this? I guess I could just get a list of the common keys and append a _y and _x to each and then go through the list and check the percent difference of both columns, but is there a better way to do this?
Just set 'area' and 'team' as the frame index and do the "normal" math:
df1.set_index(['area','team'], inplace=True)
df2.set_index(['area','team'], inplace=True)
(df1 - df2) / df1
# score
#area team
#ontario team 1 0.000000
# team 3 0.000000
# team 2 0.000000
#new york team 1 0.000000
# team 2 -1.333333

Categories

Resources