I have a dataframe with the following columns
Movie Rating Genre_0 Genre_1 Genre_2
MovieA 8.9 Action Comedy Family
MovieB 9.1 Horror NaN NaN
MovieC 4.4 Comedy Family Adventure
MovieD 7.7 Action Adventure NaN
MovieE 9.5 Adventure Comedy NaN
MovieF 7.5 Horror NaN NaN
MovieG 8.6 Horror NaN NaN
I'd like get a dataframe which has value counts for each genre and the average rating for each time the genre appears
Genre value_count Average_Rating
Action 2 8.3
Comedy 3 7.6
Horror 3 8.4
Family 2 6.7
Adventure 3 7.2
I have tried the following code and am able to get the value counts. However, am unable to get the average rating of each genre based on the number of times each genre appears. Any form of help is much appreciated, thank you.
#create a list for the genre columns
genre_col = [col for col in df if col.startswith('Genre_')]
#get value counts of genres
genre_counts = df[genre_col].apply(pd.Series.value_counts).sum(1).to_frame(name='Count')
genre_counts.index.name = 'Genre'
genre_counts = genre_counts.reset_index()
You can .melt the dataframe then group then melted frame on genre and aggregate using a dictionary that specifies the columns and their corresponding aggregation functions:
# filter and melt the dataframe
m = df.filter(regex=r'Rating|Genre').melt('Rating', value_name='Genre')
# group and aggregate
dct = {'Value_Count': ('Genre', 'count'), 'Average_Rating': ('Rating', 'mean')}
df_out = m.groupby('Genre', as_index=False).agg(**dct)
>>> df_out
Genre Value_Count Average_Rating
0 Action 2 8.30
1 Adventure 3 7.20
2 Comedy 3 7.60
3 Family 2 6.65
4 Horror 3 8.40
The process of encoding the genre to their value counts is frequency encoding it can be done with this code
df_frequency_map = df.Genre_0.value_counts().to_dict()
df['Genre0_frequency_map'] = df.Genre_0.map(df_frequency_map)
The add the average as a feature in your dataset I think you can just perform the same thing but calculate the average before performing to the to_dict() function.
df_frequency_map = df.df.Genre_0.value_counts().mean().to_dict()
df['Genre0_mean_frequency_map'] = df.Genre_0.map(df_frequency_map)
Related
I have the first column:
df_weight = df2.groupby(['Genre']).agg(total = ('weighted score', 'sum')).reset_index()
Genre
total_weight
0
Action and Adventure
1000.0
1
Classic and cult TV
500.0
and the second column:
df_TotalShow = df2.groupby(['Genre']).agg(total = ('No. of shows', 'sum')).reset_index()
Genre
total_shows
0
Action and Adventure
200.0
1
Classic and cult TV
150.0
I want to combine the two and make something similar below but I am unsure of what the code should look like.
Genre
total_weight
total_shows
0
Action and Adventure
1000.0
200.0
1
Classic and cult TV
500.0
150.0
Next, I want to create another column with the division of total_weight / total_shows.
So far, I tried
df = df_weight['total'].div(df_TotalShow['total'])
but this gives me a new series. Is there a way where this could be another column by itself with the final product looking something like
Genre
total_weight
total_shows
Avg
0
Action and Adventure
1000.0
200.0
5.0
1
Classic and cult TV
500.0
150.0
3.33
Use this code
df = df2.groupby(['Genre']).agg({'no. of shows': 'sum', 'weighted score': : 'sum'}).reset_index()
df['division']=df.col1/df.col2
replace the name of your columns. You must perform one time aggregation and grouping for this problem not 2 times
You’ll need to bring all your filtering skills together for this task. We’ve provided you a list of companies in the developers variable. Filter df however you choose so that you only get games that meet the following conditions:
Sold in all 3 regions (North America, Europe, and Japan)
The Japanese sales were greater than the combined sales from North America and Europe
The game developer is one of the companies in the developers list
There is no column that explicitly says whether a game was sold in each region, but you can infer that a game was not sold in a region if its sales are 0 for that region.
Use the cols variable to select only the 'name', 'developer', 'na_sales', 'eu_sales', and 'jp_sales' columns from the filtered DataFrame, and assign the result to a variable called df_filtered. Print the whole DataFrame.
You can use a filter mask or query string for this task. In either case, you need to check if the 'jp_sales' column is greater than the sum of 'na_sales' and 'eu_sales', check if each sales column is greater than 0, and use isin() to check if the 'developer' column contains one of the values in developers. Use [cols] to select only those columns and then print df_filtered.
developer
na_sales
eu_sales
jp_sales
critic_score
user_score
0
Nintendo
41.36
28.96
3.77
76.0
8.0
1
NaN
29.08
3.58
6.81
NaN
NaN
2
Nintendo
15.68
12.76
3.79
82.0
8.3
3
Nintendo
15.61
10.93
3.28
80.0
8.0
4
NaN
11.27
8.89
10.22
NaN
NaN
This is my code. Pretty difficult and having difficulty providing a df_filtered variable with a running code.
import pandas as pd
df = pd.read_csv('/datasets/vg_sales.csv')
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
df_filtered = df([cols ]> 0 | cols['jp_sales'] > sum(cols['eu_sales']+cols['na_sales']) | df['developer'].isin(developers))
print(df_filtered)
If I understand correctly, it looks like a multi-condition dataframe filtering:
df[
df["developer"].isin(developers) \
& (df["jp_sales"] > df["na_sales"] + df["eu_sales"]) \
& ~df["na_sales"].isnull()
& ~df["eu_sales"].isnull()
& ~df["jp_sales"].isnull()
]
It will not return results for sample dataset given in question because the conditions that JP sales should exceed NA and EU sales and developer should be from given list are not met. But it works for proper data:
data=[
("SquareSoft",41.36,28.96,93.77,76.0,8.0),
(np.nan,29.08,3.58,6.81,np.nan,np.nan),
("SquareSoft",15.68,12.76,3.79,82.0,8.3),
("Nintendo",15.61,10.93,30.28,80.0,8.0),
(np.nan,11.27,8.89,10.22,np.nan,np.nan)
]
columns = ["developer","na_sales","eu_sales","jp_sales","critic_score","user_score"]
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
df = pd.DataFrame(data=data, columns=columns)
[Out]:
developer na_sales eu_sales jp_sales critic_score user_score
0 SquareSoft 41.36 28.96 93.77 76.0 8.0
Try this:
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
cond = (
# Sold in all 3 regions
df[["na_sales", "eu_sales", "jp_sales"]].gt(0).all(axis=1)
# JP sales greater than NA and EU sales combined
& df["jp_sales"].gt(df["na_sales"] + df["eu_sales"])
# Developer is in a predefined list
& df["developer"].isin(developers)
)
if cond.any():
df_filtered = df.loc[cond, cols]
else:
print("No match found")
I have a dataframe1 that shows the audience's rating and the genre of each movie:
movie_id| rating | action | comedy | drama
0 4 1 1 1
1 5 0 1 0
2 3 0 1 1
1 for action means it is an action movie, and 0 means it is not.
I extracted the average rating for a single genre. Action for example, I did this:
new=df1[df1["action"]==1]
new['rating'].mean()
which shows 4. But Now I have to extract average rating for all genres which should look like this:
action | comedy | drama
4 4 3.5
Any advice on how to approach?
In your case we can select the columns then where all 0 to NaN and mul with the rating
out = df.loc[:,['action','comedy','drama']].where(lambda x : x==1).mul(df.rating,axis=0).mean()
Out[377]:
action 4.0
comedy 4.0
drama 3.5
dtype: float64
If you would like a dataframe
out = out.to_frame().T
You can melt the genre columns and filter to only keep values equal to 1. Then group by the genres and calculate the mean.
pd.melt(
df,
value_vars=["action", "comedy", "drama"],
var_name="genre",
id_vars=["movie_id", "rating"],
).query("value == 1").groupby("genre")["rating"].mean()
which gives
genre
action 4.0
comedy 4.0
drama 3.5
Name: rating, dtype: float64
Multiply the rating column with action, comedy and drama columns, replace 0 with np.nan, and compute the mean:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
# mean implicitly excludes nulls during computations
.replace(0, np.nan)
.mean()
)
action 4.0
comedy 4.0
drama 3.5
dtype: float64
The above returns a Series, if you want a dataframe like output, pass mean to agg:
(df.iloc[:, 2:]
.mul(df.rating, axis = 0)
.replace(0, np.nan)
.agg(['mean']) # note the `mean` is in a list
)
action comedy drama
mean 4.0 4.0 3.5
I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.
You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.
I'm relatively new to Python and wasn't able to find an answer to my question.
Lets say I have saved a DataFrame into the variable movies. The DataFrame looks somewhat like this:
Genre1 Genre2 Genre3 sales
Fantasy Drama Romance 5
Action Fantasy Comedy 3
Comedy Drama ScienceFiction 4
Drama Romance Action 8
What I wanna do is get the average sales for every unique Genre that appears in any of the columns Genre1, Genre2 or Genre3.
I've tried a few different things. What I have right now is:
for x in pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel()):
mean_genre = np.mean(movies['sales'])
print(x, mean_genre)
What I get as a result is:
Fantasy 5.0
Drama 5.0
Romance 5.0
Action 5.0
Comedy 5.0
ScienceFiction 5.0
So it does get me the unique Genres across the three columns but it calculates the average for the whole column sales. How do I get it to calculate the average sales for every unique Genre that appears in any of the three columns Genre1, Genre2 and Genre3? e.g. for the Genre 'Fantasy' it should use row 1 and 2 to calculate the average sales.
Here is an even shorter version:
allGenre=pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
for genre in allGenre:
print("%s : %f") % (genre,movies[movies.isin([genre]).any(1)].sales.mean())
I'm not sure that it is what you want to achieve but this should look for the sale value for each genre (each time it is encountered) :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
dff = pd.DataFrame(columns=['Nb_sales', 'Nb_view'],
index=all_genres, data=0)
for col in ['Genre1', 'Genre2', 'Genre3']:
for genre, value in zip(movies[col].values, movies['sales'].values):
dff.loc[(genre, 'Nb_sales')] += value
dff.loc[(genre, 'Nb_view')] += 1
Then you can compute the mean value :
>>> dff['Mean'] = dff.Nb_sales / dff.Nb_view
>>> dff
Nb_sales Nb_view Mean
Romance 13 2 6.500000
Comedy 7 2 3.500000
ScienceFiction 4 1 4.000000
Fantasy 8 2 4.000000
Drama 17 3 5.666667
Action 11 2 5.500000
More compact solutions could be :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
mean_series = pd.Series(index=all_genres)
for genre in all_genres:
mean_series[genre] = movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(genre)).values].mean()
# Or in one (long) line:
mean_df = pd.DataFrame(columns=['Genre'], data=all_genres)
mean_df['mean'] = mean_df.Genre.apply(
lambda x: movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(x)).values].mean())
Where they both will print your results:
>>> print(mean_series)
Fantasy 4.000000
Drama 5.666667
(....)