Groupby drop duplicates - python

I have a df where the category is separated by underscores.
df
fruit cat
0 apple green_heavy_pricy
1 apple heavy_cheap
2 banana yellow
3 pear green
4 banana brown_raw_yellow
...
I want to create an agg column that gathers all unique information. I tried df.groupby("fruit")["cat"].transform("unique"). Expected Output
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

Use custom lambda function with dict.fromkeys in GroupBy.transform:
f = lambda x: '_'.join(dict.fromkeys('_'.join(x).split('_')))
#alternative solution
#f = lambda x: '_'.join(pd.unique('_'.join(x).split('_')))
#alternative2 solution
#f = lambda x: '_'.join(dict.fromkeys(y for y in x for y in y.split('-')))
df['agg'] = df.groupby("fruit")["cat"].transform(f)
print (df)
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

Related

Creating DataFrame based on two or more non equal lists

I have two lists let's say
list1 = ["apple","banana"]
list2 = ["M","T","W","TR","F","S"]
I want to create a data frame of two columns fruit and day so that the result will look something like this
fruit
day
apple
M
apple
T
apple
W
apple
TR
apple
F
apple
S
banana
M
and so on...
currently, my actual data is columnar meaning items in list2 are in columns, but I want them in rows, any help would be appreciated.
try this:
from itertools import product
import pandas as pd
list1 = ["apple","banana"]
list2 = ["M","T","W","TR","F","S"]
df = pd.DataFrame(
product(list1, list2),
columns=['fruit', 'day']
)
print(df)
>>>
fruit day
0 apple M
1 apple T
2 apple W
3 apple TR
4 apple F
5 apple S
6 banana M
7 banana T
8 banana W
9 banana TR
10 banana F
11 banana S
same result with merge:
df = pd.merge(pd.Series(list1,name='fruit'),
pd.Series(list2,name='day'),how='cross')
print(df)
'''
fruit day
0 apple M
1 apple T
2 apple W
3 apple TR
4 apple F
5 apple S
6 banana M
7 banana T
8 banana W
9 banana TR
10 banana F
11 banana S

Fill values based on adjacent column

How could I create create_col? For each row, find the previous time where that fruit was mentioned and check if the wanted column was yes?
wanted fruit create_col
0 yes apple
1 pear
2 peear < last time pear was mentioned, wanted was not yes, so blank
3 apple True < last time apple was mentioned, wanted was yes, so True
df
###
wanted fruit
0 yes apple
1 pear
2 yes pear
3 apple
4 mango
5 pear
df['cum_list'] = df[df['wanted'].eq('yes')]['fruit'].cumsum()
df['cum_list'] = df['cum_list'].shift(1).ffill()
df.fillna('', inplace=True)
df['create_col'] = np.where(df.apply(lambda x: x['fruit'] in x['cum_list'], axis=1),True, '')
df.drop(columns=['cum_list'],inplace=True)
df
###
wanted fruit create_col
0 yes apple
1 pear
2 yes pear
3 apple True
4 mango
5 pear True

filter df using key words

I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?
Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10

Extract only if there is only one matching word from the list

fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?
Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana

combine columns containing empty strings into one column in python pandas

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.
Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Categories

Resources