I have a df where the category is separated by underscores.
df
fruit cat
0 apple green_heavy_pricy
1 apple heavy_cheap
2 banana yellow
3 pear green
4 banana brown_raw_yellow
...
I want to create an agg column that gathers all unique information. I tried df.groupby("fruit")["cat"].transform("unique"). Expected Output
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw
Use custom lambda function with dict.fromkeys in GroupBy.transform:
f = lambda x: '_'.join(dict.fromkeys('_'.join(x).split('_')))
#alternative solution
#f = lambda x: '_'.join(pd.unique('_'.join(x).split('_')))
#alternative2 solution
#f = lambda x: '_'.join(dict.fromkeys(y for y in x for y in y.split('-')))
df['agg'] = df.groupby("fruit")["cat"].transform(f)
print (df)
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw
I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?
Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10
I have some data that looks like this:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
I need a solution that creates a column for each item in fruits and gives a 1 or 0 value indicating whether or not col contains that value. Ideally, the output will look like this:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
I tried this but it did not work:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0
items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
Use str.extractall to extract the words, then pd.crosstab:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
Output:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
You can use below for apple column and do same for others
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)
Try:
Get all matching fruits using str.extractall
Use pd.get_dummies to get indicator values
join to original DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0
I thought of another, completely different one-liner:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0
Try using np.where from the numpy library:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)
fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?
Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana
Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?
Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon