I have a dataframe1 that shows the audience's rating and the genre of each movie:
movie_id| rating | action | comedy | drama
0 4 1 1 1
1 5 0 1 0
2 3 0 1 1
1 for action means it is an action movie, and 0 means it is not.
I extracted the average rating for a single genre. Action for example, I did this:
new=df1[df1["action"]==1]
new['rating'].mean()
which shows 4. But Now I have to extract average rating for all genres which should look like this:
action | comedy | drama
4 4 3.5
Any advice on how to approach?
In your case we can select the columns then where all 0 to NaN and mul with the rating
out = df.loc[:,['action','comedy','drama']].where(lambda x : x==1).mul(df.rating,axis=0).mean()
Out[377]:
action 4.0
comedy 4.0
drama 3.5
dtype: float64
If you would like a dataframe
out = out.to_frame().T
You can melt the genre columns and filter to only keep values equal to 1. Then group by the genres and calculate the mean.
pd.melt(
df,
value_vars=["action", "comedy", "drama"],
var_name="genre",
id_vars=["movie_id", "rating"],
).query("value == 1").groupby("genre")["rating"].mean()
which gives
genre
action 4.0
comedy 4.0
drama 3.5
Name: rating, dtype: float64
Multiply the rating column with action, comedy and drama columns, replace 0 with np.nan, and compute the mean:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
# mean implicitly excludes nulls during computations
.replace(0, np.nan)
.mean()
)
action 4.0
comedy 4.0
drama 3.5
dtype: float64
The above returns a Series, if you want a dataframe like output, pass mean to agg:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
.replace(0, np.nan)
.agg(['mean']) # note the `mean` is in a list
)
action comedy drama
mean 4.0 4.0 3.5
Related
I have the first column:
df_weight = df2.groupby(['Genre']).agg(total = ('weighted score', 'sum')).reset_index()
Genre
total_weight
0
Action and Adventure
1000.0
1
Classic and cult TV
500.0
and the second column:
df_TotalShow = df2.groupby(['Genre']).agg(total = ('No. of shows', 'sum')).reset_index()
Genre
total_shows
0
Action and Adventure
200.0
1
Classic and cult TV
150.0
I want to combine the two and make something similar below but I am unsure of what the code should look like.
Genre
total_weight
total_shows
0
Action and Adventure
1000.0
200.0
1
Classic and cult TV
500.0
150.0
Next, I want to create another column with the division of total_weight / total_shows.
So far, I tried
df = df_weight['total'].div(df_TotalShow['total'])
but this gives me a new series. Is there a way where this could be another column by itself with the final product looking something like
Genre
total_weight
total_shows
Avg
0
Action and Adventure
1000.0
200.0
5.0
1
Classic and cult TV
500.0
150.0
3.33
Use this code
df = df2.groupby(['Genre']).agg({'no. of shows': 'sum', 'weighted score': : 'sum'}).reset_index()
df['division']=df.col1/df.col2
replace the name of your columns. You must perform one time aggregation and grouping for this problem not 2 times
I have a data frame like this:
for example
user Top Genre
a Horror
b Romance
and I have the contentbased table for for genre :
for example
Genre Rec Rank
Horror Action 1
Horror Comedy 2
Romance Asian 1
Romance Comedy 2
i want to join table so the output will be :
for example
User Rec Rank
a Horror 1
a Action 2
a Comedy 3
b Romance 1
b Asian 2
b Comedy 3
how to process two tables so that the output is like table above with pandas
Use DataFrame.merge with right join and add same DataFrame with DataFrame.assign for new columns, sorting by both columns and last add 1 to Rank:
df11 = df1.rename(columns={'Top Genre':'Genre'})
df = df11.merge(df2, how='right').append(df11.assign(Rec = df11['Genre'], Rank=0))
df = df.sort_values(['user','Rank'], ignore_index=True)
df['Rank'] +=1
print (df)
user Genre Rec Rank
0 a Horror Horror 1
1 a Horror Action 2
2 a Horror Comedy 3
3 b Romance Romance 1
4 b Romance Asian 2
5 b Romance Comedy 3
I have a dataframe with the following columns
Movie Rating Genre_0 Genre_1 Genre_2
MovieA 8.9 Action Comedy Family
MovieB 9.1 Horror NaN NaN
MovieC 4.4 Comedy Family Adventure
MovieD 7.7 Action Adventure NaN
MovieE 9.5 Adventure Comedy NaN
MovieF 7.5 Horror NaN NaN
MovieG 8.6 Horror NaN NaN
I'd like get a dataframe which has value counts for each genre and the average rating for each time the genre appears
Genre value_count Average_Rating
Action 2 8.3
Comedy 3 7.6
Horror 3 8.4
Family 2 6.7
Adventure 3 7.2
I have tried the following code and am able to get the value counts. However, am unable to get the average rating of each genre based on the number of times each genre appears. Any form of help is much appreciated, thank you.
#create a list for the genre columns
genre_col = [col for col in df if col.startswith('Genre_')]
#get value counts of genres
genre_counts = df[genre_col].apply(pd.Series.value_counts).sum(1).to_frame(name='Count')
genre_counts.index.name = 'Genre'
genre_counts = genre_counts.reset_index()
You can .melt the dataframe then group then melted frame on genre and aggregate using a dictionary that specifies the columns and their corresponding aggregation functions:
# filter and melt the dataframe
m = df.filter(regex=r'Rating|Genre').melt('Rating', value_name='Genre')
# group and aggregate
dct = {'Value_Count': ('Genre', 'count'), 'Average_Rating': ('Rating', 'mean')}
df_out = m.groupby('Genre', as_index=False).agg(**dct)
>>> df_out
Genre Value_Count Average_Rating
0 Action 2 8.30
1 Adventure 3 7.20
2 Comedy 3 7.60
3 Family 2 6.65
4 Horror 3 8.40
The process of encoding the genre to their value counts is frequency encoding it can be done with this code
df_frequency_map = df.Genre_0.value_counts().to_dict()
df['Genre0_frequency_map'] = df.Genre_0.map(df_frequency_map)
The add the average as a feature in your dataset I think you can just perform the same thing but calculate the average before performing to the to_dict() function.
df_frequency_map = df.df.Genre_0.value_counts().mean().to_dict()
df['Genre0_mean_frequency_map'] = df.Genre_0.map(df_frequency_map)
Accordingly to the docs, the fillna value parameter can be one among the following:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
I have a data frame that looks like:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
And that is what I want to do:
NaN Cabin will be filled according to the median value given the Pclass feature value
NaN Age will be filled according to its median value across the data set
NaN Embarked will be filled according to the median value given the Pclass feature value
So after some data manipulation, I got this data frame:
Pclass Cabin Embarked Ticket
0 1 C S 50
1 2 F S 13
2 3 G S 5
What it says is that for the Pclass == 1 the most common Cabin is C. Given that, in my original data frame df I want to fill every df['Cabin'] == null with C.
This is a small example and I could treat each possible null combination by hand with something as:
df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'
However, I wonder if I can use this derived data frame to do all this filling automatic.
Thank you.
If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.
for median:
df.fillna(df.median())
for mean
df.fillna(df.mean())
see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.
Edit:
Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.
df.fillna({'col1':'a','col2': 1})
I'm relatively new to Python and wasn't able to find an answer to my question.
Lets say I have saved a DataFrame into the variable movies. The DataFrame looks somewhat like this:
Genre1 Genre2 Genre3 sales
Fantasy Drama Romance 5
Action Fantasy Comedy 3
Comedy Drama ScienceFiction 4
Drama Romance Action 8
What I wanna do is get the average sales for every unique Genre that appears in any of the columns Genre1, Genre2 or Genre3.
I've tried a few different things. What I have right now is:
for x in pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel()):
mean_genre = np.mean(movies['sales'])
print(x, mean_genre)
What I get as a result is:
Fantasy 5.0
Drama 5.0
Romance 5.0
Action 5.0
Comedy 5.0
ScienceFiction 5.0
So it does get me the unique Genres across the three columns but it calculates the average for the whole column sales. How do I get it to calculate the average sales for every unique Genre that appears in any of the three columns Genre1, Genre2 and Genre3? e.g. for the Genre 'Fantasy' it should use row 1 and 2 to calculate the average sales.
Here is an even shorter version:
allGenre=pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
for genre in allGenre:
print("%s : %f") % (genre,movies[movies.isin([genre]).any(1)].sales.mean())
I'm not sure that it is what you want to achieve but this should look for the sale value for each genre (each time it is encountered) :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
dff = pd.DataFrame(columns=['Nb_sales', 'Nb_view'],
index=all_genres, data=0)
for col in ['Genre1', 'Genre2', 'Genre3']:
for genre, value in zip(movies[col].values, movies['sales'].values):
dff.loc[(genre, 'Nb_sales')] += value
dff.loc[(genre, 'Nb_view')] += 1
Then you can compute the mean value :
>>> dff['Mean'] = dff.Nb_sales / dff.Nb_view
>>> dff
Nb_sales Nb_view Mean
Romance 13 2 6.500000
Comedy 7 2 3.500000
ScienceFiction 4 1 4.000000
Fantasy 8 2 4.000000
Drama 17 3 5.666667
Action 11 2 5.500000
More compact solutions could be :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
mean_series = pd.Series(index=all_genres)
for genre in all_genres:
mean_series[genre] = movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(genre)).values].mean()
# Or in one (long) line:
mean_df = pd.DataFrame(columns=['Genre'], data=all_genres)
mean_df['mean'] = mean_df.Genre.apply(
lambda x: movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(x)).values].mean())
Where they both will print your results:
>>> print(mean_series)
Fantasy 4.000000
Drama 5.666667
(....)