I have some data that looks like this:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
I need a solution that creates a column for each item in fruits and gives a 1 or 0 value indicating whether or not col contains that value. Ideally, the output will look like this:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
I tried this but it did not work:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0
items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
Use str.extractall to extract the words, then pd.crosstab:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
Output:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
You can use below for apple column and do same for others
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)
Try:
Get all matching fruits using str.extractall
Use pd.get_dummies to get indicator values
join to original DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0
I thought of another, completely different one-liner:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
Output:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0
Try using np.where from the numpy library:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)
Related
How could I create create_col? For each row, find the previous time where that fruit was mentioned and check if the wanted column was yes?
wanted fruit create_col
0 yes apple
1 pear
2 peear < last time pear was mentioned, wanted was not yes, so blank
3 apple True < last time apple was mentioned, wanted was yes, so True
df
###
wanted fruit
0 yes apple
1 pear
2 yes pear
3 apple
4 mango
5 pear
df['cum_list'] = df[df['wanted'].eq('yes')]['fruit'].cumsum()
df['cum_list'] = df['cum_list'].shift(1).ffill()
df.fillna('', inplace=True)
df['create_col'] = np.where(df.apply(lambda x: x['fruit'] in x['cum_list'], axis=1),True, '')
df.drop(columns=['cum_list'],inplace=True)
df
###
wanted fruit create_col
0 yes apple
1 pear
2 yes pear
3 apple True
4 mango
5 pear True
I have a dataset that looks as follows:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
df
purchase item
0 1 apple
1 1 banana
2 2 apple
3 2 banana
4 2 pear
5 3 apple
And I need an output such as
item_1
item_2
purchase
apple
banana
2
banana
pear
1
apple
pear
1
A table counting how many times a combination of two fruits was purchased in the same purchase.
In this example's first row, the values are apple, banana, 2 because there are two purchases (see column purchase), purchase ID 1 and purchase ID 2, where the person bought both apple and banana. The second row is apple, pear, and 1 because there's only one purchase (purchase ID 2) where the person bought both apple and pear.
My code so far:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
dummies = pd.get_dummies(df['item'])
df2 = pd.concat([df['purchase'], dummies], axis=1)
Creates a table like this:
purchase apple banana pear
0 1 1 0 0
1 1 0 1 0
2 2 1 0 0
3 2 0 1 0
4 2 0 0 1
5 3 1 0 0
Now, I don't know how to proceed to get the wanted result (and I'm aware my output is far from the wanted one). I tried some group by's but it didn't work.
This is probably not the most efficient, but it seems to get the job done:
In [3]: from itertools import combinations
In [4]: combos = df.groupby("purchase")["item"].apply(lambda row: list(combinations(row, 2))).explode().value_counts()
In [5]: combos.reset_index()
Out[5]:
index item
0 (apple, banana) 2
1 (apple, pear) 1
2 (banana, pear) 1
From there,
In [6]: pd.DataFrame([[*x, y] for x, y in zip(combos.index, combos)], columns=["item_1", "item_2", "combo_qty"])
Out[6]:
item_1 item_2 combo_qty
0 apple banana 2
1 apple pear 1
2 banana pear 1
Here is another take that uses the behavior of join with duplicated index:
df2 = df.set_index("purchase")
df2 = df2.join(df2, rsuffix="_other")\
.groupby(["item", "item_other"])\
.size().rename("count").reset_index()
result = df2[df2.item < df2.item_other].reset_index(drop=True)
# item item_other count
# 0 apple banana 2
# 1 apple pear 1
# 2 banana pear 1
I get around 10x speedup over using builtin combinations in the following benchmark:
import numpy as np
num_orders = 200
max_order_size = 10
num_items = 50
purchases = np.repeat(np.arange(num_orders),
np.random.randint(1, max_order_size, num_orders))
items = np.random.randint(1, num_items, size=purchases.size)
test_df = pd.DataFrame({
"purchase": purchases,
"item": items,
})
fruit_type = ['Apple','Banana','Cherries','Dragonfruit']
for row in df['sentence']:
sentence['fruit_type'] = df['sentence'].str.extract("(" + "|".join(fruit_type) +")", expand=False)
Result of the code above is:
df
sentence | fruit_type
here is an apple | apple
here is a banana, an apple | banana
here is an orange, a banana | orange
How do I revise the code so that if there are more than 1 fruit type in df['sentence'], df['fruit_type'] will return a NaN?
Instead of extract you can use exctractall combined with groupby and apply:
First, to get all matches:
df['sentence'].str.extractall("(" + "|".join(fruit_type) +")")
0
match
0 0 apple
1 0 banana
1 apple
2 0 banana
Note that there is pandas.MultiIndex.
Then, using .groupby(level=0)[0].apply(list) you will get:
0 [apple]
1 [banana, apple]
2 [banana]
And finally, after using .apply(lambda x: x[0] if len(x) == 1 else np.nan):
0 apple
1 NaN
2 banana
I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.
Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?
Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon