Exception when using set_index in Pandas - python

I am trying out the set_index() method in Pandas but I get an exception I can not explain:
df
movieId title genres
1 2 Jumanji (1995) Adventure|Children|Fantasy
5 6 Heat (1995) Action|Crime|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
df.set_index(['a' , 'b' , 'c'], inplace = True)
df
KeyError: 'a'

If want set index by nested list (double []) with same length as df:
df.set_index([['a' , 'b' , 'c']], inplace = True)
print (df)
movieId title genres
a 2 Jumanji (1995) Adventure|Children|Fantasy
b 6 Heat (1995) Action|Crime|Thriller
c 11 American President The (1995) Comedy|Drama|Romance
If use list ([]) pandas try set columns a,b,c to MultiIndex and because does not exist error is raised.
So if want set index by columns:
df.set_index(['movieId' , 'title'], inplace = True)
print (df)
genres
movieId title
2 Jumanji (1995) Adventure|Children|Fantasy
6 Heat (1995) Action|Crime|Thriller
11 American President The (1995) Comedy|Drama|Romance

Related

Python Pandas concatenate every 2nd row to previous row

I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

how to append the table but with condition and re rank the table with pandas

I have a data frame like this:
for example
user Top Genre
a Horror
b Romance
and I have the contentbased table for for genre :
for example
Genre Rec Rank
Horror Action 1
Horror Comedy 2
Romance Asian 1
Romance Comedy 2
i want to join table so the output will be :
for example
User Rec Rank
a Horror 1
a Action 2
a Comedy 3
b Romance 1
b Asian 2
b Comedy 3
how to process two tables so that the output is like table above with pandas
Use DataFrame.merge with right join and add same DataFrame with DataFrame.assign for new columns, sorting by both columns and last add 1 to Rank:
df11 = df1.rename(columns={'Top Genre':'Genre'})
df = df11.merge(df2, how='right').append(df11.assign(Rec = df11['Genre'], Rank=0))
df = df.sort_values(['user','Rank'], ignore_index=True)
df['Rank'] +=1
print (df)
user Genre Rec Rank
0 a Horror Horror 1
1 a Horror Action 2
2 a Horror Comedy 3
3 b Romance Romance 1
4 b Romance Asian 2
5 b Romance Comedy 3

python - How to get rid of useless data in open dataset

I am using the open dataset found at. Specifically I am using this dataset: http://files.grouplens.org/datasets/movielens/ml-100k/u.item. I am attempting to parse the dataset, when I load it into pandas as such:
movie_cols = ['movie_id', 'title','release_date','imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',sep='|',names=movie_cols)
When I attempt to run
movies.head()
It shows this:
You need parameter usecols for filter 1., 2., 3. and 5. columns in function read_csv:
movie_cols = ['movie_id', 'title', 'release_date', 'imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
sep='|',
names=movie_cols,
encoding='latin-1',
usecols = [0,1,2,4])
print (movies.head())
movie_id title release_date \
0 1 Toy Story (1995) 01-Jan-1995
1 2 GoldenEye (1995) 01-Jan-1995
2 3 Four Rooms (1995) 01-Jan-1995
3 4 Get Shorty (1995) 01-Jan-1995
4 5 Copycat (1995) 01-Jan-1995
imdb_url
0 http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 http://us.imdb.com/M/title-exact?Copycat%20(1995)

Group by and Count distinct words in Pandas DataFrame

By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1

pandas - How to aggregate two columns and keeping all other columns

I have the below synopsis of a df:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
What i'm looking for is count 'user id' and average 'rating' and keep all other columns intact. So the result will be something like this:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
any idea how to do that?
Thanks
If all the values are in the columns you are aggregating over are the same for each group then you can avoid the join by putting them into the group.
Then pass a dictionary of functions to agg. If you set as_index to False to keep the grouped by columns as columns:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
Note len is used to count
When you have too many columns, you probably do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)

Categories

Resources