python - How to get rid of useless data in open dataset

python - How to get rid of useless data in open dataset - python

I am using the open dataset found at. Specifically I am using this dataset: http://files.grouplens.org/datasets/movielens/ml-100k/u.item. I am attempting to parse the dataset, when I load it into pandas as such:
movie_cols = ['movie_id', 'title','release_date','imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',sep='|',names=movie_cols)
When I attempt to run
movies.head()
It shows this:

You need parameter usecols for filter 1., 2., 3. and 5. columns in function read_csv:
movie_cols = ['movie_id', 'title', 'release_date', 'imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
sep='|',
names=movie_cols,
encoding='latin-1',
usecols = [0,1,2,4])
print (movies.head())
movie_id title release_date \
0 1 Toy Story (1995) 01-Jan-1995
1 2 GoldenEye (1995) 01-Jan-1995
2 3 Four Rooms (1995) 01-Jan-1995
3 4 Get Shorty (1995) 01-Jan-1995
4 5 Copycat (1995) 01-Jan-1995
imdb_url
0 http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 http://us.imdb.com/M/title-exact?Copycat%20(1995)

Related

How to fit a statistical model with a one hot encoded variable

I have my data frame that initially looked like this:
item_id title user_id gender .....
0 1 Toy Story (1995) 308 M
1 4 Get Shorty (1995) 308 M
2 5 Copycat (1995) 308 M
Than I ran a mixed effects regression, which worked fine:
import statsmodels.api as sm
import statsmodels.formula.api as smf
md = smf.mixedlm("rating ~ C(gender) + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())
However, afterwards I did a one hot encoding on the gender variable and the dataframe became like this:
item_id title user_id gender_M gender_F .....
0 1 Toy Story (1995) 308 1 0
1 4 Get Shorty (1995) 308 1 0
2 5 Copycat (1995) 308 1 0
Would it be correct to run the model like this (changing gender with gender_M and gender_F)? Is it the same? Or is there a better way?
md = smf.mixedlm("rating ~ gender_M + gender_F + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())

Splitting a column and assigning to the same data frame

I have a dataset for movies recommendation and want to separate the genre feature into two genre columns(genre_1,genre_2),and assign it into the same dataframe. The column has all the genres together and separates them with '|'. If it is not having two genres then genre_1 need to be assigned to genre_2.
What is the best way to do it?
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
Thanks

The split function will take apart that string when given '|' as the separator. Pro tip: keeping the genres as a list will work much better than keeping them as two variables; you can iterate over the list instead of naming each variable, and if some flick is counted as more than two genres, you're home free.

Like suggested in the comment, you should provide an example of the output you're looking for, it's not completely clear from your question.
Anyway, you can split the genre list into separate columns using:
df['genres'].str.split('|',expand=True)
e.g.:
df['genres']
Out[13]:
0 Adventure|Animation|Children|Comedy|Fantasy
1 Adventure|Children|Fantasy
2 Comedy|Romance
3 Comedy|Drama|Romance
4 Comedy
df['genres'].str.split('|',expand=True)
Out[14]:
0 1 2 3 4
0 Adventure Animation Children Comedy Fantasy
1 Adventure Children Fantasy None None
2 Comedy Romance None None None
3 Comedy Drama Romance None None
4 Comedy None None None None
.str tells pandas to treat that column as a string, and then you have most Python string manipulation methods available.
expand = True causes each "split" to be stored in a separate column.

Thanks for the replies, i have solved this problem in the following way. (got help from another friend.)
df['genre_1'],df['genre_2'],df['genre_3'] = df.genres.str.split('|',2).str
df['genre_2'] = df['genre_2'].fillna(df['genre_1'])
df= df.drop('genre_3',axis=1)

python data science Find movies with highest female ratings

I am working on an assignment for my Data Science class. I just need help getting started, as I'm having trouble understanding how to use pandas to group and selecting DISTINCT values.
I need to find the movies with the HIGHEST RATINGS by FEMALES, my code returns me movies with ratings = 5, and gender = 'F', but it also repeats the same movie over and over again, since there are more than 1 users. I'm not sure how to just show movie, count of 5-star ratings, and gender = F. below is my code:
import pandas as pd
import os
m = pd.read_csv('movies.csv')
u = pd.read_csv('users.csv')
r = pd.read_csv('ratings.csv')
ur = pd.merge(u,r)
data = pd.merge(m,ur)
df = pd.DataFrame(data)
top10 = df.loc[(df.gender == 'F')&(df.rating == 5)]
print(top10)
the data files can be downloaded here
I just need some help getting started, theres alot more to the homework, but once I figure this out I can do the rest. Just need a jump-start. thank you very much
mv_id title genres rating user_id gender
1 Toy Story (1995) Animation|Children's|Comedy 5 1 F
2 Jumanji (1995) Adventure|Children's|Fantasy 5 2 F
3 Grumpier Old Men (1995) Comedy|Romance 5 3 F
4 Waiting to Exhale (1995) Comedy|Drama 5 4 F
5 Father of the Bride Part II (1995) Comedy 5 5 F

I would try to do the filtering operation on as little data as possible. To select 5-star ratings of female users, there's no need for the movie metadata (movies.csv). It can be done on the ur data, which is easier than on the df.
# filter the data in `ur`
f_5s_ratings = ur.loc[(ur.gender == 'F')&(ur.rating == 5)]
# count rows per `movie_id`
abs_num_f_5s_ratings = f_5s_ratings.groupby("movie_id").size()
In abs_num_f_5s_ratings you now have a DataFrame counting the total number of 5-star ratings by female users per movie_id:
movie_id
1 253
2 15
3 14
...
If you join that data on the key movie_id with m as a new column (I'll leave it as an exercise to you), you can then sort by this value to get your top 10 movies with absolute number of 5-star ratings by females.

Exception when using set_index in Pandas

I am trying out the set_index() method in Pandas but I get an exception I can not explain:
df
movieId title genres
1 2 Jumanji (1995) Adventure|Children|Fantasy
5 6 Heat (1995) Action|Crime|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
df.set_index(['a' , 'b' , 'c'], inplace = True)
df
KeyError: 'a'

If want set index by nested list (double []) with same length as df:
df.set_index([['a' , 'b' , 'c']], inplace = True)
print (df)
movieId title genres
a 2 Jumanji (1995) Adventure|Children|Fantasy
b 6 Heat (1995) Action|Crime|Thriller
c 11 American President The (1995) Comedy|Drama|Romance
If use list ([]) pandas try set columns a,b,c to MultiIndex and because does not exist error is raised.
So if want set index by columns:
df.set_index(['movieId' , 'title'], inplace = True)
print (df)
genres
movieId title
2 Jumanji (1995) Adventure|Children|Fantasy
6 Heat (1995) Action|Crime|Thriller
11 American President The (1995) Comedy|Drama|Romance

pandas - How to aggregate two columns and keeping all other columns

I have the below synopsis of a df:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
What i'm looking for is count 'user id' and average 'rating' and keep all other columns intact. So the result will be something like this:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
any idea how to do that?
Thanks

If all the values are in the columns you are aggregating over are the same for each group then you can avoid the join by putting them into the group.
Then pass a dictionary of functions to agg. If you set as_index to False to keep the grouped by columns as columns:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
Note len is used to count

When you have too many columns, you probably do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - How to get rid of useless data in open dataset - python

Related

How to fit a statistical model with a one hot encoded variable

Splitting a column and assigning to the same data frame

python data science Find movies with highest female ratings

Exception when using set_index in Pandas

pandas - How to aggregate two columns and keeping all other columns

Categories

Resources