Filtering in Pandas dataframe - python

I'm grouping rotten tomatoes scores by director with the following:
director_counts = bigbadpanda.groupby(["Director"]).size().order(ascending = False)
print director_counts --->
Director
Woody Allen 44
Alfred Hitchcock 38
Clint Eastwood 32
Martin Scorsese 29
Steven Spielberg 29
Sidney Lumet 25
...
Question:
What's the best way for me to filter by directors with more than 2 movies?
For filtering by the average movies per director would this work? bigbadpanda.groupby(["Director"]).size().mean())

Data I created based on your info
Director,Movies
Woody Allen,44
Alfred Hitchcock,38
Clint Eastwood,32
Someone,2
Someone else,1
Simply do this:
df = pd.read_csv('data.txt')
print(df[df.Movies > 2])
Output:
Director Movies
0 Woody Allen 44
1 Alfred Hitchcock 38
2 Clint Eastwood 32

Related

Python pandas merge map with multiple values xlookup

I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!
One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
Try this way
df.pivot(index='id', columns='role', values='name')
You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))

How Can I get this output using fuzzywuzzy?

If I have two dataframes (John,Alex,harry) and (ryan, kane, king). How can I use fuzzywuzzy in python to get the following output.
fuzz.Ratio
John ryan 25
John kane 54
John king 44
alex ryan 23
alex kane 14
alex king 55
harry ryan 47
harry kane 47
harry king 50
Your ratios are wrong. What you are looking for is cartesian product of the corresponding columns of both the dataframes.
Sample code:
import itertools
df1 = pd.DataFrame({'name': ['John','Alex','harry']})
df2 = pd.DataFrame({'name': ['ryan','kane','king']})
for w1, w2 in itertools.product(
df1['name'].apply(str.lower).values, df2['name'].apply(str.lower).values):
print (f"{w1}, {w2}, {fuzz.ratio(w1,w2)}")
Output:
john, ryan, 25
john, kane, 25
john, king, 25
alex, ryan, 25
alex, kane, 50
alex, king, 0
harry, ryan, 44
harry, kane, 22
harry, king, 0
IIUC, you could do:
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
a = ('John','Alex','harry')
b = ('ryan', 'kane', 'king')
# compute the ratios for each pair
res = ((ai, bi, fuzz.ratio(ai, bi)) for ai, bi in product(a, b))
# create DataFrame filter out the values that are 0
out = pd.DataFrame([e for e in res if e[2] > 0], columns=['name_a', 'name_b', 'fuzz_ratio'])
print(out)
Output
name_a name_b fuzz_ratio
0 John ryan 25
1 John kane 25
2 John king 25
3 Alex kane 25
4 harry ryan 44
5 harry kane 22

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

Iterate through row if statement and add to new columns [Pandas/Python]

I have a dataframe with 3 columns
Hospital 2009-10 2010-11
Aberystwyth Mental Health Unit 19 19
Bro Ddyfi Community Hospital 16 10
Bronglais General Hospital 160 148
Caebryn Mental Health Unit 37 39
Carmarthen Mental Health Unit 38 31
I am trying to create a function that checks if a word is in the hospital column if so it puts the word in the new column like so:
Hospital 2009-10 2010-11 Hospital Type
Aberystwyth Mental Health Unit 19 19 Mental
Bro Ddyfi Community Hospital 16 10 Community
Bronglais General Hospital 160 148 General
Caebryn Mental Health Unit 37 39 Mental
Carmarthen Mental Health Unit 38 31 Mental
Heres the code I have tried:
def find_type(x):
if df['Hospital'].str.contains("Mental").any():
return "Mental"
if df['Hospital'].str.contains("Community").any():
return "Community"
else:
return "Other"
df['Hospital Type'] = df.apply(find_type)
The output I get instead is this:
Hospital 2009-10 2010-11 Hospital Type
Aberystwyth Mental Health Unit 19 19 NaN
Bro Ddyfi Community Hospital 16 10 NaN
Bronglais General Hospital 160 148 NaN
Caebryn Mental Health Unit 37 39 NaN
Carmarthen Mental Health Unit 38 31 NaN
How can I get it so it comes out like the expected output?
Thank you!
Use extract by keywords separated by | with fillna:
pat = r"(Mental|Community)"
df['Hospital Type'] = df['Hospital'].str.extract(pat, expand=False).fillna('Other')
print (df)
Hospital 2009-10 2010-11 Hospital Type
0 Aberystwyth Mental Health Unit 19 19 Mental
1 Bro Ddyfi Community Hospital 16 10 Community
2 Bronglais General Hospital 160 148 Other
3 Caebryn Mental Health Unit 37 39 Mental
4 Carmarthen Mental Health Unit 38 31 Mental

Categories

Resources