pandas - How to aggregate two columns and keeping all other columns - python

I have the below synopsis of a df:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
What i'm looking for is count 'user id' and average 'rating' and keep all other columns intact. So the result will be something like this:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
any idea how to do that?
Thanks

If all the values are in the columns you are aggregating over are the same for each group then you can avoid the join by putting them into the group.
Then pass a dictionary of functions to agg. If you set as_index to False to keep the grouped by columns as columns:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
Note len is used to count

When you have too many columns, you probably do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)

Related

Applying a function to columns in a dataframe whose column headings contain a specific string

I have a dataframe called passenger_details which is shown below
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male I drive to work car 1 hour
Passenger2 26 Female I take the metro train NaN ...
Passenger3 33 Female NaN NaN 30 mins ...
Passenger4 29 Female I take the metro train NaN ...
...
I want to apply an if function that will turn missing values(NaN values) to 0 and present values to 1, to column headings that have the string 'Commute' in them.
This is basically what I'm trying to achieve
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male 1 1 1
Passenger2 26 Female 1 1 0 ...
Passenger3 33 Female 0 0 1 ...
Passenger4 29 Female 1 1 0 ...
...
However, I'm struggling with how to phrase my code. This is what I have done
passenger_details = passenger_details.filter(regex = 'Location_', axis = 1).apply(lambda value: str(value).replace('value', '1', 'NaN','0'))
But I get a Type Error of
'replace() takes at most 3 arguments (4 given)'
Any help would be appreciated
Seelct columns by Index.contains and test not missing values by DataFrame.notna and last cast to integer for True/False to 1/0 map:
c = df.columns.str.contains('Commute')
df.loc[:, c] = df.loc[:, c].notna().astype(int)
print (df)
Passenger Age Gender Commute_to_work Commute_mode Commute_time
0 Passenger1 32 Male 1 1 1
1 Passenger2 26 Female 1 1 0
2 Passenger3 33 Female 0 0 1
3 Passenger4 29 Female 1 1 0

pandas - Can't merge df/series and groupby then count

TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11
Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.

Exception when using set_index in Pandas

I am trying out the set_index() method in Pandas but I get an exception I can not explain:
df
movieId title genres
1 2 Jumanji (1995) Adventure|Children|Fantasy
5 6 Heat (1995) Action|Crime|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
df.set_index(['a' , 'b' , 'c'], inplace = True)
df
KeyError: 'a'
If want set index by nested list (double []) with same length as df:
df.set_index([['a' , 'b' , 'c']], inplace = True)
print (df)
movieId title genres
a 2 Jumanji (1995) Adventure|Children|Fantasy
b 6 Heat (1995) Action|Crime|Thriller
c 11 American President The (1995) Comedy|Drama|Romance
If use list ([]) pandas try set columns a,b,c to MultiIndex and because does not exist error is raised.
So if want set index by columns:
df.set_index(['movieId' , 'title'], inplace = True)
print (df)
genres
movieId title
2 Jumanji (1995) Adventure|Children|Fantasy
6 Heat (1995) Action|Crime|Thriller
11 American President The (1995) Comedy|Drama|Romance

select rows with except sqllite3

I have a database with a dataframe that contains the columns: Name, Award, Winner(1 means won and 0 means did not win) and some other things that are irrelevant for this question.
I want to make a dataframe with the names of people that were selected for the award actress(al awards with the name actress in them count), but never won, using sqlite 3 in python.
These are the first five rows of the dataframe:
Unnamed: 0 CeremonyNumber CeremonyYear CeremonyMonth CeremonyDay FilmYear Award Winner Name FilmDetails
0 0 1 1929 5 16 1927 Actor 1 Emil Jannings The Last Command
1 1 1 1929 5 16 1927 Actor 0 Richard Barthelmess The Noose
2 2 1 1929 5 16 1927 Actress 1 Janet Gaynor 7th Heaven
3 3 1 1929 5 16 1927 Actress 0 Louise Dresser A Ship Comes In
4 4 1 1929 5 16 1927 Actress 0 Gloria Swanson Sadie Thompson
I tried it with this query, but this resulted not in the correct result.
query = '''
select Name
from oscars
where Award like "Actress%"
except select Name
from oscars
where Award like "Actress%" and Winner == 1
'''
The outcome of this query should be a dataframe like this:
Name
0 Abigail Breslin
1 Adriana Barraza
2 Agnes Moorehead
3 Alfre Woodard
4 Ali MacGraw
In order to select all the actresses who were selected for the award and never won, you should use AND rather than EXCEPT. Something like this should work:
SELECT Name from Oscars WHERE Award LIKE "Actress%" AND Winner = 0
Refer to the sqlite docs at https://www.sqlite.org/index.html for more information.

Group by and Count distinct words in Pandas DataFrame

By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1

Categories

Resources