Groupby drop duplicates

Groupby drop duplicates - python

I have a df where the category is separated by underscores.
df
fruit cat
0 apple green_heavy_pricy
1 apple heavy_cheap
2 banana yellow
3 pear green
4 banana brown_raw_yellow
...
I want to create an agg column that gathers all unique information. I tried df.groupby("fruit")["cat"].transform("unique"). Expected Output
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

Use custom lambda function with dict.fromkeys in GroupBy.transform:
f = lambda x: '_'.join(dict.fromkeys('_'.join(x).split('_')))
#alternative solution
#f = lambda x: '_'.join(pd.unique('_'.join(x).split('_')))
#alternative2 solution
#f = lambda x: '_'.join(dict.fromkeys(y for y in x for y in y.split('-')))
df['agg'] = df.groupby("fruit")["cat"].transform(f)
print (df)
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

Related

Creating DataFrame based on two or more non equal lists

I have two lists let's say
list1 = ["apple","banana"]
list2 = ["M","T","W","TR","F","S"]
I want to create a data frame of two columns fruit and day so that the result will look something like this
fruit
day
apple
M
apple
T
apple
W
apple
TR
apple
F
apple
S
banana
M
and so on...
currently, my actual data is columnar meaning items in list2 are in columns, but I want them in rows, any help would be appreciated.

try this:
from itertools import product
import pandas as pd
list1 = ["apple","banana"]
list2 = ["M","T","W","TR","F","S"]
df = pd.DataFrame(
product(list1, list2),
columns=['fruit', 'day']
)
print(df)
>>>
fruit day
0 apple M
1 apple T
2 apple W
3 apple TR
4 apple F
5 apple S
6 banana M
7 banana T
8 banana W
9 banana TR
10 banana F
11 banana S

same result with merge:
df = pd.merge(pd.Series(list1,name='fruit'),
pd.Series(list2,name='day'),how='cross')
print(df)
'''
fruit day
0 apple M
1 apple T
2 apple W
3 apple TR
4 apple F
5 apple S
6 banana M
7 banana T
8 banana W
9 banana TR
10 banana F
11 banana S

Fill values based on adjacent column

How could I create create_col? For each row, find the previous time where that fruit was mentioned and check if the wanted column was yes?
wanted fruit create_col
0 yes apple
1 pear
2 peear < last time pear was mentioned, wanted was not yes, so blank
3 apple True < last time apple was mentioned, wanted was yes, so True

df
###
wanted fruit
0 yes apple
1 pear
2 yes pear
3 apple
4 mango
5 pear
df['cum_list'] = df[df['wanted'].eq('yes')]['fruit'].cumsum()
df['cum_list'] = df['cum_list'].shift(1).ffill()
df.fillna('', inplace=True)
df['create_col'] = np.where(df.apply(lambda x: x['fruit'] in x['cum_list'], axis=1),True, '')
df.drop(columns=['cum_list'],inplace=True)
df
###
wanted fruit create_col
0 yes apple
1 pear
2 yes pear
3 apple True
4 mango
5 pear True

filter df using key words

I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?

Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10

Extract only if there is only one matching word from the list

fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?

Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana

combine columns containing empty strings into one column in python pandas

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.

Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby drop duplicates - python

Related

Creating DataFrame based on two or more non equal lists

Fill values based on adjacent column

filter df using key words

Extract only if there is only one matching word from the list

combine columns containing empty strings into one column in python pandas

Categories

Resources