Groupby for 3 columns - python

I would like use groupby function based on 3 columns. First column has surname info for families, second column has name of individuals in that families.Third column has which animal every individual has in those families. I want to get information of person with name and surname who has cat or dog and how many of cat or dog those indivual has.
My data looks like
Family SubFamily Animal
Smith Karen Cat
Smith Karen Cow
Smith Karen Dog
Jackson Jason Dog
I tried
merged_family.groupby(["Family","Animal","SubFamily"]).size().loc[:,'Cat'].loc[:,'Dog']
The result might be
Family SubFamily Cat Dog
Smith Karen 1 1
or something similar
It did not work. Could you help me?

I think is a better task for pivot_table
df_merged.query("Animal.isin(['Cat', 'Dog'])")
.pivot_table(columns='Animal', index=['Family', 'SubFamily'], aggfunc='size')
.fillna(0)
.reset_index()
.rename_axis(None, axis=1)
# Family SubFamily Cat Dog
# 0 Jackson Jason 0.0 1.0
# 1 Smith Karen 1.0 1.0

Related

Most elegant way to transform this type of table?

I have a dataframe that looks something like this:
id name last attribute_1_name attribute_1_rating attribute_2_name attribute_2_rating
1 Linda Smith Age 23 Hair Brown
3 Brian Lin Hair Black Job Barista
Essentially I'd like to transform this table to look like so:
id name last attribute_name attribute_rating
1 Linda Smith Age 23
1 Linda Smith Hair Brown
3 Brian Lin Hair Black
3 Brian Lin Job Barista
What's the most elegant and efficient way to perform this transformation in Python? Assuming there are many more rows and the attribute numbers go up to 13.
Assuming attribute columns are named coherently, you can do this:
result = pd.DataFrame()
# n is the number of attribute columns
for i in range(1, n):
attribute_name_col = f'attribute_{i}_name'
attribute_rating_col = f'attribute_{i}_rating'
melted = pd.melt(
df,
id_vars=['id', 'name', 'last', attribute_name_col],
value_vars=[attribute_rating_col]
)
melted = melted.rename(
columns={attribute_name_col: 'attribute_name',
'value': 'attribute_rating'}
)
melted = melted.drop('variable', axis=1)
result = pd.concat([result, melted])
where df is your original dataframe. Then printing result gives
id name last attribute_name attribute_rating
1 Linda Smith Age 23
3 Brian Lin Hair Black
1 Linda Smith Hair Brown
3 Brian Lin Job Barista

Python - Pandas - finding matches between two data frames

Suppose I have 2 pandas data frames, both sharing the same column names, like this:
name: dob: role:
James Franco 1-1-1980 Actor
Cameron Diaz 4-2-1976 Actor
Jim Carey 12-1-1968 Actor
Miley Cyrus 5-23-1987 Actor
name: dob: role:
50 cent 4-6-1984 Singer
lil baby 12-1-1990 Singer
ghostmane 8-10-1989 Singer
Miley Cyrus 5-23-1987 Singer
And say I wanted to identify individuals who share the same name and dob, and exist in both dataframes (and thus, have 2 different roles).
How can I do this?
similar to if everything existed in 1 dataframe, and I did a df.groupby(["name", "dob"]).count())
I would like to be able to identify these individuals, print them, and count the number of occurrences.
Thank you
df2=df.append(df1)#append the two dfs
dfnew=df2[df2.duplicated(subset=['name:',"dob:"], keep=False)]#keep all duplicated on the columns you wires to check
Well ,this will give you just the matches:
df1.merge(df2, on=["name:","dob:",])
output:
name: dob: role:_x role:_y
0 Miley Cyrus 5-23-1987 Actor Singer
You can use an outer join to get all the results and filter them as you see fit:
df1.merge(df2, how="outer", on=["name:","dob:",])
Output:
name: dob: role:_x role:_y
0 James Franco 1-1-1980 Actor NaN
1 Cameron Diaz 4-2-1976 Actor NaN
2 Jim Carey 12-1-1968 Actor NaN
3 Miley Cyrus 5-23-1987 Actor Singer
4 50 cent 4-6-1984 NaN Singer
5 lil baby 12-1-1990 NaN Singer
6 ghostmane 8-10-1989 NaN Singer

Checking unique value for a variable in a different column

I currently have a dataframe which looks like this:
Owner Vehicle_Color
0 James Red
1 Peter Green
2 James Blue
3 Sally Blue
4 Steven Red
5 James Blue
6 James Red
7 Peter Blue
And I am trying to verify whether one Owner has one or multiple vehicle colors assigned to the person. Keeping in mind that my dataframe has more than a million number of different entries for owners (which can be duplicate), what would be the best solution?
One way may be to use groupby and nunique:
df.groupby('Owner')['Vehicle_Color'].nunique()
Results:
Owner
James 2
Peter 2
Sally 1
Steven 1
Name: Vehicle_Color, dtype: int64

How to 'explode' Pandas Column Value to unique row

So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk

Concatenate a set of column values based on another column in Pandas

Given a Pandas dataframe which has a few labeled series in it, say Name and Villain.
Say the dataframe has values such:
Name: {'Batman', 'Batman', 'Spiderman', 'Spiderman', 'Spiderman', 'Spiderman'}
Villain: {'Joker', 'Bane', 'Green Goblin', 'Electro', 'Venom', 'Dr Octopus'}
In total the above dataframe has 2 series(or columns) each with six datapoints.
Now, based on the Name, I want to concatenate 3 more columns: FirstName, LastName, LoveInterest to each datapoint.
The result of which adds 'Bruce; Wayne; Catwoman' to every row which has Name as Batman. And 'Peter; Parker; MaryJane' to every row which has Name as Spiderman.
The final result should be a dataframe containing 5 columns(series) and 6 rows each.
This is a classic inner-join scenario. In pandas, use the merge module-level function:
In [13]: df1
Out[13]:
Name Villain
0 Batman Joker
1 Batman Bane
2 Spiderman Green Goblin
3 Spiderman Electro
4 Spiderman Venom
5 Spiderman Dr. Octopus
In [14]: df2
Out[14]:
FirstName LastName LoveInterest Name
0 Bruce Wayne Catwoman Batman
1 Peter Parker MaryJane Spiderman
In [15]: pd.DataFrame.merge(df1,df2,on='Name')
Out[15]:
Name Villain FirstName LastName LoveInterest
0 Batman Joker Bruce Wayne Catwoman
1 Batman Bane Bruce Wayne Catwoman
2 Spiderman Green Goblin Peter Parker MaryJane
3 Spiderman Electro Peter Parker MaryJane
4 Spiderman Venom Peter Parker MaryJane
5 Spiderman Dr. Octopus Peter Parker MaryJane

Categories

Resources