Fill column with conditional mode of another column - python

Given the below list, I'd like to fill in the 'Color Guess' column with the mode of the 'Color' column conditional on 'Type' and 'Size' and ignoring NULL, #N/A, etc.
For example, what's the most common color for SMALL CATS, what's the most common color for MEDIUM DOGS, etc.
Type Size Color Color Guess
Cat small brown
Dog small black
Dog large black
Cat medium white
Cat medium #N/A
Dog large brown
Cat large white
Cat large #N/A
Dog large brown
Dog medium #N/A
Cat small #N/A
Dog small white
Dog small black
Dog small brown
Dog medium white
Dog medium #N/A
Cat large brown
Dog small white
Dog large #N/A

As BarMar already stated in the comments, we can use pd.Series.mode here from the linked answer. Only trick here is, that we have to use groupby.transform, since we want the data back in the same shape as your dataframe:
df['Color Guess'] = df.groupby(['Type', 'Size'])['Color'].transform(lambda x: pd.Series.mode(x)[0])
Type Size Color Color Guess
0 Cat small brown brown
1 Dog small black black
2 Dog large black brown
3 Cat medium white white
4 Cat medium NaN white
5 Dog large brown brown
6 Cat large white brown
7 Cat large NaN brown
8 Dog large brown brown
9 Dog medium NaN white
10 Cat small NaN brown
11 Dog small white black
12 Dog small black black
13 Dog small brown black
14 Dog medium white white
15 Dog medium NaN white
16 Cat large brown brown
17 Dog small white black
18 Dog large NaN brown

Related

Calculating the Entropy of Data in a Table or Matrix

Color Height Sex
----------------------
Red Short Male
Red Tall Male
Blue Medium Female
Green Medium Female
Green Tall Female
Green Short Male
How to compute the entropy of the table as a whole in python?

How can I get similarity rate between columns in different dataframes and label the data

I need to compute differences between rows in different dataframes. On one hand I have df1 with 150 different rows with specific phrases. On the other hand I have df2 with 250k of rows with any type of phrases. I want to compare each row in df1 with each row in df2 getting or labeling all the rows in df2 that have a value higher than a threshold (to be determinated) I use SequenceMatcher in order to get the similarity between them.
How could i do it? Any advice in order to do it in the more efficient way? How you code that?
df1 (len = 150)
Phrases
My cat is black
Dog is white
Peter is waiting
Dog is white
df2 (len = 250k)
Phrases
My cat is white
Dog is jumping
Marcos is waiting
Dog is white
Output:
Phrases
Labels
My cat is white
0.75
My cat is white
0
My cat is white
0
My cat is white
0.33
Dog is jumping
0.33
Dog is jumping
0.66
Dog is jumping
0.33
Dog is jumping
0.66
Marcos is waiting
0.33
Marcos is waiting
0.33
Marcos is waiting
0.66
Marcos is waiting
0.33
Dog is white
0.5
Dog is white
0.66
Dog is white
0.33
Dog is white
1

pandas given two columns are same, find similar elements in rows to make new column

my dataframe looks,
df =
query subject HPSame
0 cat dog HPS_1
1 cat horse HPS_2
2 king queen HPS_3
3 queen people HPS_4
4 CAR VAN HPS_5
5 dog tiger HPS_6
6 CAR TRUCK HPS_7
7 horse deer HPS_8
8 CAR JEEP HPS_9
9 TRUCK LORRY HPS_10
10 VAN TRAIN HPS_11
11 people children HPS_12
In the df, query is similar to subject, i:e, cat is similar to dog and hence label HPS_1. Also, cat is similar to horse, dog is similar to tiger, therefore, should have same match lable, HPS_1. I am looking to find similar elements like if a = b = c = d and give them same lable in new column. I have tried to simplify my question. The subject and query essentially consists of alphanumeric elements, WP_020314852.1 = WP_004217899.1 = WP_150395973.1 signifying same kind. The results expected is as follows.
df =
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
I tried,
df['query_s'] = df['query'].shift(-1)
df['HPSame_s'] = df['HPSame'].shift(-1)
condition = [(df['query'] == df['query_s'])]
ifTrue = df['HPSame']
ifFalse = df['HPSame_s']
df['match'] = np.where(condition, ifTrue, ifFalse)
This throws me ValueError: Length of values does not match length of index
We can do this using Networkx library with graph theory connected components:
import pandas as pd
import networkx as nx
import numpy as np
# Copy your input dataframe from question
df = pd.read_clipboard()
# Create a graph network
G = nx.from_pandas_edgelist(df, 'query', 'subject')
# Use connected_components method to find groups
grps = dict(enumerate(nx.connected_components(G)))
# Match back to dataframe
df['match'] = [k for i in df['query'] for k, v in grps.items() if i in v]
df['match'] = df.groupby('match')['HPSame'].transform('first')
print(df)
Output:
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
Image of the graph network from the dataframe:
fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')

Remove Duplicates values in a Panda's Record

I want to remove duplicates in each row for the column animals.
I need something like this post, but in python. I cannot figure this out right now for some reason and I am hitting a block.
Remove duplicate records in dataframe
I have tried using drop duplicates, unique, nunique, etc. No luck.
df.drop_duplicates(subset=None, keep="first", inplace=False)
df
df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})
#input:
animals
0 pink pig, pink pig, pink pig
1 brown cow, brown cow
2 pink pig, black cow
3 brown horse, pink pig, brown cow, black cow, brown cow
#I would like the output to look like this:
animals
0 pink pig
1 brown cow
2 pink pig, black cow
3 brown horse, pink pig, brown cow, black cow
This does it:
df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})
df['animals2'] = df.animals.apply(lambda x: ', '.join(list(set(x.split(', ')))))
Output:
0 pink pig
1 brown cow
2 pink pig, black cow
3 brown cow, brown horse, pink pig, black cow
Explanation:
I turned your strings into a list. Then I turned the list into a set to remove duplicates. Then I turned the set into a list, and the I split the list turning it into a string again. Please tell me if something isn't clear!
If you wish to retain the original order of the items (converting to sets makes them unordered), the following function should work.
def drop_duplicates(items):
# `items` is a comma separated string, e.g. "dog, dog, cat".
result = []
seen = set()
for item in items.split(','):
item = item.strip()
if item not in seen:
seen.update([item])
result.append(item)
return ', '.join(result)
>>> df['animals'].apply(drop_duplicates)
0 pig
1 cow
2 pig, cow
3 horse, pig, cow
Name: animals, dtype: object

Merging 2 dataframe using similar columns

I have 2 dataframe listed as follow
df
Type Breed Common Color Other Color Behaviour
Golden Big Gold White Fun
Corgi Small Brown White Crazy
Bulldog Medium Black Grey Strong
df2
Type Breed Behaviour Bark Sound
Pug Small Sleepy Ak
German Shepard Big Cool Woof
Puddle Small Aggressive Ek
I wanted to merge 2 dataframe by columns Type, Breed and Behavior.
Therefore, my desire output would be:
Type Breed Behavior
Golden Big Fun
Corgi Small Crazy
Bulldog Medium Strong
Pug Small Sleepy
German Shepard Big Cool
Puddle Small Aggressive
You need concat:
print (pd.concat([df1[['Type','Breed','Behaviour']],
df2[['Type','Breed','Behaviour']]], ignore_index=True))
Type Breed Behaviour
0 Golden Big Fun
1 Corgi Small Crazy
2 Bulldog Medium Strong
3 Pug Small Sleepy
4 German Shepard Big Cool
5 Puddle Small Aggressive
More general is use intersection for columns of both DataFrames:
cols = df1.columns.intersection(df2.columns)
print (cols)
Index(['Type', 'Breed', 'Behaviour'], dtype='object')
print (pd.concat([df1[cols], df2[cols]], ignore_index=True))
Type Breed Behaviour
0 Golden Big Fun
1 Corgi Small Crazy
2 Bulldog Medium Strong
3 Pug Small Sleepy
4 German Shepard Big Cool
5 Puddle Small Aggressive
More general if df1 and df2 have no NaN values use dropna for removing columns with NaN:
print (pd.concat([df1 ,df2], ignore_index=True))
Bark Sound Behaviour Breed Common Color Other Color Type
0 NaN Fun Big Gold White Golden
1 NaN Crazy Small Brown White Corgi
2 NaN Strong Medium Black Grey Bulldog
3 Ak Sleepy Small NaN NaN Pug
4 Woof Cool Big NaN NaN German Shepard
5 Ek Aggressive Small NaN NaN Puddle
print (pd.concat([df1 ,df2], ignore_index=True).dropna(1))
Behaviour Breed Type
0 Fun Big Golden
1 Crazy Small Corgi
2 Strong Medium Bulldog
3 Sleepy Small Pug
4 Cool Big German Shepard
5 Aggressive Small Puddle
using join dropping columns that don't overlap
df1.T.join(df2.T, lsuffix='_').dropna().T.reset_index(drop=True)

Categories

Resources