Create new indicator columns based on values in another column

Create new indicator columns based on values in another column - python

I have some data that looks like this:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
I need a solution that creates a column for each item in fruits and gives a 1 or 0 value indicating whether or not col contains that value. Ideally, the output will look like this:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
I tried this but it did not work:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0

items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0

Use str.extractall to extract the words, then pd.crosstab:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
Output:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0

You can use below for apple column and do same for others
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)

Try:
Get all matching fruits using str.extractall
Use pd.get_dummies to get indicator values
join to original DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0

I thought of another, completely different one-liner:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0

Try using np.where from the numpy library:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)

Related

Fill values based on adjacent column

How could I create create_col? For each row, find the previous time where that fruit was mentioned and check if the wanted column was yes?
wanted fruit create_col
0 yes apple
1 pear
2 peear < last time pear was mentioned, wanted was not yes, so blank
3 apple True < last time apple was mentioned, wanted was yes, so True

df
###
wanted fruit
0 yes apple
1 pear
2 yes pear
3 apple
4 mango
5 pear
df['cum_list'] = df[df['wanted'].eq('yes')]['fruit'].cumsum()
df['cum_list'] = df['cum_list'].shift(1).ffill()
df.fillna('', inplace=True)
df['create_col'] = np.where(df.apply(lambda x: x['fruit'] in x['cum_list'], axis=1),True, '')
df.drop(columns=['cum_list'],inplace=True)
df
###
wanted fruit create_col
0 yes apple
1 pear
2 yes pear
3 apple True
4 mango
5 pear True

Pandas, find the number of times a combination of rows appear under a different column ID

I have a dataset that looks as follows:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
df
purchase item
0 1 apple
1 1 banana
2 2 apple
3 2 banana
4 2 pear
5 3 apple
And I need an output such as
item_1
item_2
purchase
apple
banana
2
banana
pear
1
apple
pear
1
A table counting how many times a combination of two fruits was purchased in the same purchase.
In this example's first row, the values are apple, banana, 2 because there are two purchases (see column purchase), purchase ID 1 and purchase ID 2, where the person bought both apple and banana. The second row is apple, pear, and 1 because there's only one purchase (purchase ID 2) where the person bought both apple and pear.
My code so far:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
dummies = pd.get_dummies(df['item'])
df2 = pd.concat([df['purchase'], dummies], axis=1)
Creates a table like this:
purchase apple banana pear
0 1 1 0 0
1 1 0 1 0
2 2 1 0 0
3 2 0 1 0
4 2 0 0 1
5 3 1 0 0
Now, I don't know how to proceed to get the wanted result (and I'm aware my output is far from the wanted one). I tried some group by's but it didn't work.

This is probably not the most efficient, but it seems to get the job done:
In [3]: from itertools import combinations
In [4]: combos = df.groupby("purchase")["item"].apply(lambda row: list(combinations(row, 2))).explode().value_counts()
In [5]: combos.reset_index()
Out[5]:
index item
0 (apple, banana) 2
1 (apple, pear) 1
2 (banana, pear) 1
From there,
In [6]: pd.DataFrame([[*x, y] for x, y in zip(combos.index, combos)], columns=["item_1", "item_2", "combo_qty"])
Out[6]:
item_1 item_2 combo_qty
0 apple banana 2
1 apple pear 1
2 banana pear 1

Here is another take that uses the behavior of join with duplicated index:
df2 = df.set_index("purchase")
df2 = df2.join(df2, rsuffix="_other")\
.groupby(["item", "item_other"])\
.size().rename("count").reset_index()
result = df2[df2.item < df2.item_other].reset_index(drop=True)
# item item_other count
# 0 apple banana 2
# 1 apple pear 1
# 2 banana pear 1
I get around 10x speedup over using builtin combinations in the following benchmark:
import numpy as np
num_orders = 200
max_order_size = 10
num_items = 50
purchases = np.repeat(np.arange(num_orders),
np.random.randint(1, max_order_size, num_orders))
items = np.random.randint(1, num_items, size=purchases.size)
test_df = pd.DataFrame({
"purchase": purchases,
"item": items,
})

Extract only if there is only one matching word from the list

fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?

Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana

combine columns containing empty strings into one column in python pandas

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.

Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Python Pandas Data Frame Inserting Many Arbitrary Values

Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?

Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new indicator columns based on values in another column - python

items = ['apple', 'pear', 'peach'] for it in items: df[it] = df['col1'].str.contains(it, case=False).astype(int) Output: >>> df col1 apple pear peach 0 i want an apple 1 0 0 1 i hate pears 0 1 0 2 please buy a peach and an apple 1 0 1 3 I want squash 0 0 0

You can use below for apple column and do same for others def has_apple(st): if "apple" in st.lower(): return 1 return 0 df['apple'] = df['col1'].apply(has_apple)

I thought of another, completely different one-liner: df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|') Output: >>> df col1 apple pear peach 0 i want an apple 1 0 0 1 i hate pears 0 0 1 2 please buy a peach and an apple 1 1 0 3 I want squash 0 0 0

Try using np.where from the numpy library: fruit = ['apple', 'pear', 'peach'] for i in fruit: df[i] = np.where(df.col1.str.contains(i), 1, 0)

Related

Fill values based on adjacent column

Pandas, find the number of times a combination of rows appear under a different column ID

Extract only if there is only one matching word from the list

combine columns containing empty strings into one column in python pandas

Python Pandas Data Frame Inserting Many Arbitrary Values

Categories

Resources