I am working on an assignment for my Data Science class. I just need help getting started, as I'm having trouble understanding how to use pandas to group and selecting DISTINCT values.
I need to find the movies with the HIGHEST RATINGS by FEMALES, my code returns me movies with ratings = 5, and gender = 'F', but it also repeats the same movie over and over again, since there are more than 1 users. I'm not sure how to just show movie, count of 5-star ratings, and gender = F. below is my code:
import pandas as pd
import os
m = pd.read_csv('movies.csv')
u = pd.read_csv('users.csv')
r = pd.read_csv('ratings.csv')
ur = pd.merge(u,r)
data = pd.merge(m,ur)
df = pd.DataFrame(data)
top10 = df.loc[(df.gender == 'F')&(df.rating == 5)]
print(top10)
the data files can be downloaded here
I just need some help getting started, theres alot more to the homework, but once I figure this out I can do the rest. Just need a jump-start. thank you very much
mv_id title genres rating user_id gender
1 Toy Story (1995) Animation|Children's|Comedy 5 1 F
2 Jumanji (1995) Adventure|Children's|Fantasy 5 2 F
3 Grumpier Old Men (1995) Comedy|Romance 5 3 F
4 Waiting to Exhale (1995) Comedy|Drama 5 4 F
5 Father of the Bride Part II (1995) Comedy 5 5 F
I would try to do the filtering operation on as little data as possible. To select 5-star ratings of female users, there's no need for the movie metadata (movies.csv). It can be done on the ur data, which is easier than on the df.
# filter the data in `ur`
f_5s_ratings = ur.loc[(ur.gender == 'F')&(ur.rating == 5)]
# count rows per `movie_id`
abs_num_f_5s_ratings = f_5s_ratings.groupby("movie_id").size()
In abs_num_f_5s_ratings you now have a DataFrame counting the total number of 5-star ratings by female users per movie_id:
movie_id
1 253
2 15
3 14
...
If you join that data on the key movie_id with m as a new column (I'll leave it as an exercise to you), you can then sort by this value to get your top 10 movies with absolute number of 5-star ratings by females.
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 csv file. They have one common column which is ID. What I want to do is I want to extract the common rows and built another dataframe. Firstly, I want to select job, and after that, as I said they have one common column, I want to find the rows whose IDs are the same. Visually, the dataframe should be seen like this:
Let first DataFrame is:
#ID
#Gender
#Job
#Shift
#Wage
1
Male
Engineer
Night
8000
2
Male
Engineer
Night
7865
3
Female
Worker
Day
5870
4
Male
Accountant
Day
5870
5
Female
Architecture
Day
4900
Let second one is:
#ID
#Department
1
IT
2
Quality Control
5
Construction
7
Construction
8
Human Resources
And the new DataFrame should be like:
#ID
#Department
#Job
#Wage
1
IT
Engineer
8000
2
Quality Control
Engineer
7865
5
Construction
Architecture
4900
You can use:
df_result = df1.merge(df2, on = 'ID', how = 'inner')
If you want to select only certain columns from a certain df use:
df_result = df1[['ID','Job', 'Wage']].merge(df2[['ID', 'Department']], on = `ID`, how = 'inner')
Use:
df = df2.merge(df1[['ID','Job', 'Wage']], on='ID')
I am new to Pandas and created following example to illustrate a problem I like to solve
Data
Consider following dataframe:
df = pd.DataFrame({ 'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300, 10, 12, 450, 15, 2, 600, 11],
})
Which looks like this:
Person Belonging Value
0 Adam House 300
1 Adam Car 10
2 Cesar Car 12
3 Diana House 450
4 Diana Car 15
5 Diana Bike 2
6 Erika House 600
7 Erika Car 11
Question
How to find the Value of Persons Car(s), if they have a House valued more then 400.
The result I am looking for is this:
Person Belonging Value
4 Diana Car 15
7 Erika Car 11
How can I achieve this in Pandas, and is there something similar to sub-queries?
Sub-query
In SQL there is something called sub-query. Perhaps there is something similar in Pandas.
SELECT *
FROM df
WHERE person IN
(SELECT person
FROM df
WHERE belonging='House' AND value>400)
AND belonging='Car';
person belonging value
---------- ---------- ----------
Diana Car 15
Erika Car 11
One approach you can use that is very similar to the SQL statement.
Start by finding the people with houses with value over 400:
persons = df.loc[(df['Belonging'] == 'House') & (df['Value'] > 400), 'Person']
This will return a series with "Diana" and "Erika".
Then find the cars for such people:
df[df['Person'].isin(persons) & (df['Belonging'] == 'Car')]
This will return your expected result.
Using a join is also possible with merge(), which might be more efficient than using isin() for a large dataset:
df_join = df.merge(persons, on='Person')
And then you can filter to find out the car:
df_join[df['Belonging'] == 'Car']
This will also return your expected result.
One different approach to this problem is to pivot the data by turning the belongings into columns, so you'd have a single row per person with all their belongings listed.
You can use pivot_table() to get this data into a relatively flat dataframe is:
df_pivot = df.pivot_table(values='Value', index='Person', columns='Belonging', fill_value=-1)
At that point, you can find the value of the cars for people with houses worth more than 400 with:
df_pivot.loc[df_pivot['House'] > 400, 'Car']
Note that this last one will return a series rather than a dataframe, since Person was now turned into the index. The pivot dataframe method is really useful if you want to gather more information about a person, so having a person in a single row makes it really easy to access all data related to that person.
print(df[df.Person.isin(df.loc[df.Value > 400, 'Person']) & (df.Belonging == 'Car')])
Prints:
Person Belonging Value
4 Diana Car 15
7 Erika Car 11
Consider a set-based (similar to SQL) approach with merge and query retaining your WHERE clauses:
final_df = (
df.query("Belonging == 'Car'")
.merge(df.query("Belonging == 'House' & Value > 400"),
on="Person", suffixes=["_Car","_House"])
)
# Person Belonging_Car Value_Car Belonging_House Value_House
# 0 Diana Car 15 House 450
# 1 Erika Car 11 House 600
Or without the house columns:
final_df = (
df.query("Belonging == 'Car'")
.merge((df.query("Belonging == 'House' & Value > 400")
.reindex(["Person"], axis="columns")),
on="Person")
)
# Person Belonging Value
# 0 Diana Car 15
# 1 Erika Car 11
I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700
I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
I'm relatively new to Python and wasn't able to find an answer to my question.
Lets say I have saved a DataFrame into the variable movies. The DataFrame looks somewhat like this:
Genre1 Genre2 Genre3 sales
Fantasy Drama Romance 5
Action Fantasy Comedy 3
Comedy Drama ScienceFiction 4
Drama Romance Action 8
What I wanna do is get the average sales for every unique Genre that appears in any of the columns Genre1, Genre2 or Genre3.
I've tried a few different things. What I have right now is:
for x in pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel()):
mean_genre = np.mean(movies['sales'])
print(x, mean_genre)
What I get as a result is:
Fantasy 5.0
Drama 5.0
Romance 5.0
Action 5.0
Comedy 5.0
ScienceFiction 5.0
So it does get me the unique Genres across the three columns but it calculates the average for the whole column sales. How do I get it to calculate the average sales for every unique Genre that appears in any of the three columns Genre1, Genre2 and Genre3? e.g. for the Genre 'Fantasy' it should use row 1 and 2 to calculate the average sales.
Here is an even shorter version:
allGenre=pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
for genre in allGenre:
print("%s : %f") % (genre,movies[movies.isin([genre]).any(1)].sales.mean())
I'm not sure that it is what you want to achieve but this should look for the sale value for each genre (each time it is encountered) :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
dff = pd.DataFrame(columns=['Nb_sales', 'Nb_view'],
index=all_genres, data=0)
for col in ['Genre1', 'Genre2', 'Genre3']:
for genre, value in zip(movies[col].values, movies['sales'].values):
dff.loc[(genre, 'Nb_sales')] += value
dff.loc[(genre, 'Nb_view')] += 1
Then you can compute the mean value :
>>> dff['Mean'] = dff.Nb_sales / dff.Nb_view
>>> dff
Nb_sales Nb_view Mean
Romance 13 2 6.500000
Comedy 7 2 3.500000
ScienceFiction 4 1 4.000000
Fantasy 8 2 4.000000
Drama 17 3 5.666667
Action 11 2 5.500000
More compact solutions could be :
all_genres = pd.unique(movies[['Genre1','Genre2','Genre3']].values.ravel())
mean_series = pd.Series(index=all_genres)
for genre in all_genres:
mean_series[genre] = movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(genre)).values].mean()
# Or in one (long) line:
mean_df = pd.DataFrame(columns=['Genre'], data=all_genres)
mean_df['mean'] = mean_df.Genre.apply(
lambda x: movies.sales.loc[movies.eval(
'Genre1 == "{0}" or Genre2 == "{0}" or Genre3 == "{0}"'
.format(x)).values].mean())
Where they both will print your results:
>>> print(mean_series)
Fantasy 4.000000
Drama 5.666667
(....)