combine columns containing empty strings into one column in python pandas

combine columns containing empty strings into one column in python pandas - python

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.

Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Related

Groupby drop duplicates

I have a df where the category is separated by underscores.
df
fruit cat
0 apple green_heavy_pricy
1 apple heavy_cheap
2 banana yellow
3 pear green
4 banana brown_raw_yellow
...
I want to create an agg column that gathers all unique information. I tried df.groupby("fruit")["cat"].transform("unique"). Expected Output
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

Use custom lambda function with dict.fromkeys in GroupBy.transform:
f = lambda x: '_'.join(dict.fromkeys('_'.join(x).split('_')))
#alternative solution
#f = lambda x: '_'.join(pd.unique('_'.join(x).split('_')))
#alternative2 solution
#f = lambda x: '_'.join(dict.fromkeys(y for y in x for y in y.split('-')))
df['agg'] = df.groupby("fruit")["cat"].transform(f)
print (df)
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

filter df using key words

I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?

Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10

Create new indicator columns based on values in another column

I have some data that looks like this:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
I need a solution that creates a column for each item in fruits and gives a 1 or 0 value indicating whether or not col contains that value. Ideally, the output will look like this:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
I tried this but it did not work:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0

items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0

Use str.extractall to extract the words, then pd.crosstab:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
Output:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0

You can use below for apple column and do same for others
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)

Try:
Get all matching fruits using str.extractall
Use pd.get_dummies to get indicator values
join to original DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0

I thought of another, completely different one-liner:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0

Try using np.where from the numpy library:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)

Extract only if there is only one matching word from the list

fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?

Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana

Python Pandas Data Frame Inserting Many Arbitrary Values

Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?

Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combine columns containing empty strings into one column in python pandas - python

Related

Groupby drop duplicates

filter df using key words

Create new indicator columns based on values in another column

Extract only if there is only one matching word from the list

Python Pandas Data Frame Inserting Many Arbitrary Values

Categories

Resources