I'm relatively new to Python and wasn't able to find an answer to my question.
Lets say I have saved a DataFrame into the variable movies. The DataFrame looks somewhat like this:
Genre1 Genre2 Genre3 sales
Fantasy Drama Romance 5
Action Fantasy Comedy 3
Comedy Drama ScienceFiction 4
Drama Romance Action 8
What I wanna do is get the average sales for every unique Genre that appears in any of the columns Genre1, Genre2 or Genre3.
I've tried a few different things. What I have right now is:
for x in pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel()):
mean_genre = np.mean(movies['sales'])
print(x, mean_genre)
What I get as a result is:
Fantasy 5.0
Drama 5.0
Romance 5.0
Action 5.0
Comedy 5.0
ScienceFiction 5.0
So it does get me the unique Genres across the three columns but it calculates the average for the whole column sales. How do I get it to calculate the average sales for every unique Genre that appears in any of the three columns Genre1, Genre2 and Genre3? e.g. for the Genre 'Fantasy' it should use row 1 and 2 to calculate the average sales.
Here is an even shorter version:
allGenre=pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
for genre in allGenre:
print("%s : %f") % (genre,movies[movies.isin([genre]).any(1)].sales.mean())
I'm not sure that it is what you want to achieve but this should look for the sale value for each genre (each time it is encountered) :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
dff = pd.DataFrame(columns=['Nb_sales', 'Nb_view'],
index=all_genres, data=0)
for col in ['Genre1', 'Genre2', 'Genre3']:
for genre, value in zip(movies[col].values, movies['sales'].values):
dff.loc[(genre, 'Nb_sales')] += value
dff.loc[(genre, 'Nb_view')] += 1
Then you can compute the mean value :
>>> dff['Mean'] = dff.Nb_sales / dff.Nb_view
>>> dff
Nb_sales Nb_view Mean
Romance 13 2 6.500000
Comedy 7 2 3.500000
ScienceFiction 4 1 4.000000
Fantasy 8 2 4.000000
Drama 17 3 5.666667
Action 11 2 5.500000
More compact solutions could be :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
mean_series = pd.Series(index=all_genres)
for genre in all_genres:
mean_series[genre] = movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(genre)).values].mean()
# Or in one (long) line:
mean_df = pd.DataFrame(columns=['Genre'], data=all_genres)
mean_df['mean'] = mean_df.Genre.apply(
lambda x: movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(x)).values].mean())
Where they both will print your results:
>>> print(mean_series)
Fantasy 4.000000
Drama 5.666667
(....)
Related
I have a dataframe1 that shows the audience's rating and the genre of each movie:
movie_id| rating | action | comedy | drama
0 4 1 1 1
1 5 0 1 0
2 3 0 1 1
1 for action means it is an action movie, and 0 means it is not.
I extracted the average rating for a single genre. Action for example, I did this:
new=df1[df1["action"]==1]
new['rating'].mean()
which shows 4. But Now I have to extract average rating for all genres which should look like this:
action | comedy | drama
4 4 3.5
Any advice on how to approach?
In your case we can select the columns then where all 0 to NaN and mul with the rating
out = df.loc[:,['action','comedy','drama']].where(lambda x : x==1).mul(df.rating,axis=0).mean()
Out[377]:
action 4.0
comedy 4.0
drama 3.5
dtype: float64
If you would like a dataframe
out = out.to_frame().T
You can melt the genre columns and filter to only keep values equal to 1. Then group by the genres and calculate the mean.
pd.melt(
df,
value_vars=["action", "comedy", "drama"],
var_name="genre",
id_vars=["movie_id", "rating"],
).query("value == 1").groupby("genre")["rating"].mean()
which gives
genre
action 4.0
comedy 4.0
drama 3.5
Name: rating, dtype: float64
Multiply the rating column with action, comedy and drama columns, replace 0 with np.nan, and compute the mean:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
# mean implicitly excludes nulls during computations
.replace(0, np.nan)
.mean()
)
action 4.0
comedy 4.0
drama 3.5
dtype: float64
The above returns a Series, if you want a dataframe like output, pass mean to agg:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
.replace(0, np.nan)
.agg(['mean']) # note the `mean` is in a list
)
action comedy drama
mean 4.0 4.0 3.5
I have a data frame like this:
for example
user Top Genre
a Horror
b Romance
and I have the contentbased table for for genre :
for example
Genre Rec Rank
Horror Action 1
Horror Comedy 2
Romance Asian 1
Romance Comedy 2
i want to join table so the output will be :
for example
User Rec Rank
a Horror 1
a Action 2
a Comedy 3
b Romance 1
b Asian 2
b Comedy 3
how to process two tables so that the output is like table above with pandas
Use DataFrame.merge with right join and add same DataFrame with DataFrame.assign for new columns, sorting by both columns and last add 1 to Rank:
df11 = df1.rename(columns={'Top Genre':'Genre'})
df = df11.merge(df2, how='right').append(df11.assign(Rec = df11['Genre'], Rank=0))
df = df.sort_values(['user','Rank'], ignore_index=True)
df['Rank'] +=1
print (df)
user Genre Rec Rank
0 a Horror Horror 1
1 a Horror Action 2
2 a Horror Comedy 3
3 b Romance Romance 1
4 b Romance Asian 2
5 b Romance Comedy 3
I have a dataframe with the following columns
Movie Rating Genre_0 Genre_1 Genre_2
MovieA 8.9 Action Comedy Family
MovieB 9.1 Horror NaN NaN
MovieC 4.4 Comedy Family Adventure
MovieD 7.7 Action Adventure NaN
MovieE 9.5 Adventure Comedy NaN
MovieF 7.5 Horror NaN NaN
MovieG 8.6 Horror NaN NaN
I'd like get a dataframe which has value counts for each genre and the average rating for each time the genre appears
Genre value_count Average_Rating
Action 2 8.3
Comedy 3 7.6
Horror 3 8.4
Family 2 6.7
Adventure 3 7.2
I have tried the following code and am able to get the value counts. However, am unable to get the average rating of each genre based on the number of times each genre appears. Any form of help is much appreciated, thank you.
#create a list for the genre columns
genre_col = [col for col in df if col.startswith('Genre_')]
#get value counts of genres
genre_counts = df[genre_col].apply(pd.Series.value_counts).sum(1).to_frame(name='Count')
genre_counts.index.name = 'Genre'
genre_counts = genre_counts.reset_index()
You can .melt the dataframe then group then melted frame on genre and aggregate using a dictionary that specifies the columns and their corresponding aggregation functions:
# filter and melt the dataframe
m = df.filter(regex=r'Rating|Genre').melt('Rating', value_name='Genre')
# group and aggregate
dct = {'Value_Count': ('Genre', 'count'), 'Average_Rating': ('Rating', 'mean')}
df_out = m.groupby('Genre', as_index=False).agg(**dct)
>>> df_out
Genre Value_Count Average_Rating
0 Action 2 8.30
1 Adventure 3 7.20
2 Comedy 3 7.60
3 Family 2 6.65
4 Horror 3 8.40
The process of encoding the genre to their value counts is frequency encoding it can be done with this code
df_frequency_map = df.Genre_0.value_counts().to_dict()
df['Genre0_frequency_map'] = df.Genre_0.map(df_frequency_map)
The add the average as a feature in your dataset I think you can just perform the same thing but calculate the average before performing to the to_dict() function.
df_frequency_map = df.df.Genre_0.value_counts().mean().to_dict()
df['Genre0_mean_frequency_map'] = df.Genre_0.map(df_frequency_map)
I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.
You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.
Assuming that I have a dataframe of pastries
Pastry Flavor Qty
0 Cupcake Cheese 3
1 Cakeslice Chocolate 2
2 Tart Honey 2
3 Croissant Raspberry 1
And I get the value count of a specific flavor per pastry
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts()
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Then to get the percentile of that flavor qty, I could do this
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts().describe(percentiles=[.75, .85, .95])
And I'd get something like this (from full dataframe)
count 35.00000
mean 1.485714
std 0.853072
min 1.000000
50% 1.000000
75% 2.000000
85% 2.000000
95% 3.300000
max 4.000000
Where the total different pastries that are cheese flavored is 35, so the total cheese qty is distributed amongst those 35 pastries. The mean of qty is 1.48, max qty is 4 (cupcake and tart) etc, etc.
What I want to do is bring that 95th percentile down by counting all other values which are not 'Cheese' in the flavor column, however value_counts() is only counting the ones that are 'Cheese' because I filtered the dataframe. How can I also count the non Cheese rows, so that my percentiles will go down and will represent the distribution of Cheese total in the entire dataframe?
This is an example output:
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Swiss Roll 1
Baklava 0
Cannoli 0
Where the non-cheese flavor pastries are being included with 0 as qty, from there I can just get the percentiles and they will be reduced since there are 0 values now diluting them.
I decided to go and try the long way to try and solve this question and my result gave me the same answer as this question
Here is the long way, in case anyone is curious.
pastries = {}
for p in df['Pastry'].unique():
pastries[p] = df[(df['Flavor'] == 'Cheese') & (df['Pastry'] == p)]['Pastry'].count()
newdf = pd.DataFrame.from_dict(pastries.items())
newdf.describe(percentiles=[.75, .85, .95])