I want to remove duplicates in each row for the column animals.
I need something like this post, but in python. I cannot figure this out right now for some reason and I am hitting a block.
Remove duplicate records in dataframe
I have tried using drop duplicates, unique, nunique, etc. No luck.
df.drop_duplicates(subset=None, keep="first", inplace=False)
df
df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})
#input:
animals
0 pink pig, pink pig, pink pig
1 brown cow, brown cow
2 pink pig, black cow
3 brown horse, pink pig, brown cow, black cow, brown cow
#I would like the output to look like this:
animals
0 pink pig
1 brown cow
2 pink pig, black cow
3 brown horse, pink pig, brown cow, black cow
This does it:
df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})
df['animals2'] = df.animals.apply(lambda x: ', '.join(list(set(x.split(', ')))))
Output:
0 pink pig
1 brown cow
2 pink pig, black cow
3 brown cow, brown horse, pink pig, black cow
Explanation:
I turned your strings into a list. Then I turned the list into a set to remove duplicates. Then I turned the set into a list, and the I split the list turning it into a string again. Please tell me if something isn't clear!
If you wish to retain the original order of the items (converting to sets makes them unordered), the following function should work.
def drop_duplicates(items):
# `items` is a comma separated string, e.g. "dog, dog, cat".
result = []
seen = set()
for item in items.split(','):
item = item.strip()
if item not in seen:
seen.update([item])
result.append(item)
return ', '.join(result)
>>> df['animals'].apply(drop_duplicates)
0 pig
1 cow
2 pig, cow
3 horse, pig, cow
Name: animals, dtype: object
Related
I have the dataframe below:
details = {
'container_id' : [1, 2, 3, 4, 5, 6 ],
'container' : ['black box', 'orange box', 'blue box', 'black box','blue box', 'white box'],
'fruits' : ['apples, black currant', 'oranges','peaches, oranges', 'apples','apples, peaches, oranges', 'black berries, peaches, oranges, apples'],
}
# creating a Dataframe object
df = pd.DataFrame(details)
I want to find the frequency of each fruit separately on a list.
I tried this code
df['fruits'].str.split(expand=True).stack().value_counts()
but I get the black count 2 times instead of 1 for black currant and 1 for black berries.
You can do it like you did, but with specifying the delimiter. Be aware that when splitting the data, you get some leading whitespace unless your delimiter is a comma with a space. To be sure just use another step with str.strip.
df['fruits'].str.split(',', expand=False).explode().str.strip().value_counts()
your way (you can also use str.strip after the stack command if you want to)
df['fruits'].str.split(', ', expand=True).stack().value_counts()
Output:
apples 4
oranges 4
peaches 3
black currant 1
black berries 1
Name: fruits, dtype: int64
Specify the comma separator followed by an optional space:
df['fruits'].str.split(',\s?', expand=True).stack().value_counts()
OUTPUT:
apples 4
oranges 4
peaches 3
black currant 1
black berries 1
dtype: int64
I have the following pandas dataframe:
import pandas as pd
foo_dt = pd.DataFrame({'var_1': ['filter coffee', 'american cheesecake', 'espresso coffee', 'latte tea'],
'var_2': ['coffee', 'coffee black', 'tea', 'strawberry cheesecake']})
and the following dictionary:
foo_colors = {'coffee': 'brown', 'cheesecake': 'white', 'tea': 'green'}
I want to add two columns in foo_dt (color_var_1 and color_var_2), the values of which will be the respective value of the foo_colors dictionary which corresponds to the key depending if the key is in the value of the column var_1 or var_2 respectively.
EDIT
In other words, for every key in foo_colors , check where "it is contained" in both columns var_1 & var_2, and then give as value of the respective column (color_var_1 & color_var_2) the respective value of the dictionary
My resulting dataframe looks like this:
var_1 var_2 color_var_1 color_var_2
0 filter coffee coffee brown brown
1 american cheesecake coffee black white brown
2 espresso coffee tea brown green
3 latte tea strawberry cheesecake green white
Any idea how can I do this ?
Use Series.str.extract for get first matched substring created by join by | for regex or of keys in dict with Series.map by dict:
pat = '|'.join(r"\b{}\b".format(x) for x in foo_colors)
for c in ['var_1','var_2']:
foo_dt[f'color_{c}'] = foo_dt[c].str.extract(f'({pat})', expand=False).map(foo_colors)
print(foo_dt)
var_1 var_2 color_var_1 color_var_2
0 filter coffee coffee brown brown
1 american cheesecake coffee black white brown
2 espresso coffee tea brown green
3 latte tea strawberry cheesecake green white
my dataframe looks,
df =
query subject HPSame
0 cat dog HPS_1
1 cat horse HPS_2
2 king queen HPS_3
3 queen people HPS_4
4 CAR VAN HPS_5
5 dog tiger HPS_6
6 CAR TRUCK HPS_7
7 horse deer HPS_8
8 CAR JEEP HPS_9
9 TRUCK LORRY HPS_10
10 VAN TRAIN HPS_11
11 people children HPS_12
In the df, query is similar to subject, i:e, cat is similar to dog and hence label HPS_1. Also, cat is similar to horse, dog is similar to tiger, therefore, should have same match lable, HPS_1. I am looking to find similar elements like if a = b = c = d and give them same lable in new column. I have tried to simplify my question. The subject and query essentially consists of alphanumeric elements, WP_020314852.1 = WP_004217899.1 = WP_150395973.1 signifying same kind. The results expected is as follows.
df =
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
I tried,
df['query_s'] = df['query'].shift(-1)
df['HPSame_s'] = df['HPSame'].shift(-1)
condition = [(df['query'] == df['query_s'])]
ifTrue = df['HPSame']
ifFalse = df['HPSame_s']
df['match'] = np.where(condition, ifTrue, ifFalse)
This throws me ValueError: Length of values does not match length of index
We can do this using Networkx library with graph theory connected components:
import pandas as pd
import networkx as nx
import numpy as np
# Copy your input dataframe from question
df = pd.read_clipboard()
# Create a graph network
G = nx.from_pandas_edgelist(df, 'query', 'subject')
# Use connected_components method to find groups
grps = dict(enumerate(nx.connected_components(G)))
# Match back to dataframe
df['match'] = [k for i in df['query'] for k, v in grps.items() if i in v]
df['match'] = df.groupby('match')['HPSame'].transform('first')
print(df)
Output:
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
Image of the graph network from the dataframe:
fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')
Given the below list, I'd like to fill in the 'Color Guess' column with the mode of the 'Color' column conditional on 'Type' and 'Size' and ignoring NULL, #N/A, etc.
For example, what's the most common color for SMALL CATS, what's the most common color for MEDIUM DOGS, etc.
Type Size Color Color Guess
Cat small brown
Dog small black
Dog large black
Cat medium white
Cat medium #N/A
Dog large brown
Cat large white
Cat large #N/A
Dog large brown
Dog medium #N/A
Cat small #N/A
Dog small white
Dog small black
Dog small brown
Dog medium white
Dog medium #N/A
Cat large brown
Dog small white
Dog large #N/A
As BarMar already stated in the comments, we can use pd.Series.mode here from the linked answer. Only trick here is, that we have to use groupby.transform, since we want the data back in the same shape as your dataframe:
df['Color Guess'] = df.groupby(['Type', 'Size'])['Color'].transform(lambda x: pd.Series.mode(x)[0])
Type Size Color Color Guess
0 Cat small brown brown
1 Dog small black black
2 Dog large black brown
3 Cat medium white white
4 Cat medium NaN white
5 Dog large brown brown
6 Cat large white brown
7 Cat large NaN brown
8 Dog large brown brown
9 Dog medium NaN white
10 Cat small NaN brown
11 Dog small white black
12 Dog small black black
13 Dog small brown black
14 Dog medium white white
15 Dog medium NaN white
16 Cat large brown brown
17 Dog small white black
18 Dog large NaN brown
This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel