filter df using key words - python

I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?

Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10

Related

Python Pandas: create variable for unique combinations of 2 categorical variables?

Say I have some data:
df = pd.DataFrame({'location':['store','online','store','online','online'],
'item': ['apple','apple','orange','orange','orange']})
df
>>>
location item
0 store apple
1 online apple
2 store orange
3 online orange
4 online orange
As you'll note there are four possible variable combinations: (store,apple), (online,apple), (store,orange), (online,orange). I'd like to assign a dummy variable column. My naive approach creates four dummy variables, whereas I want a single label column:
pd.get_dummies(df,['location','item'])
>>>
location_online location_store item_apple item_orange
0 0 1 1 0
1 1 0 1 0
2 0 1 0 1
3 1 0 0 1
4 1 0 0 1
Whereas I'd prefer it to look like:
df
>>>
location item combination dummy
0 store apple (store, apple) 0
1 online apple (online, apple) 1
2 store orange (store, orange) 2
3 online orange (online, orange) 3
4 online orange (online, orange) 3
Note, the dummy only equals the index because there are only 4 rows. This obviously would not be universally true.
Edit1: Above edited in response to comment.
Edit2: I've added a 5th row to illustrate that a row can be repeated, however, it should have the same dummy/combination as its duplicate.
Let's try agg:
df['combination'] = df[['location','item']].agg(tuple, axis=1)
df['dummy'] = df['combination'].factorize()[0]
Output:
location item combination dummy
0 store apple (store, apple) 0
1 online apple (online, apple) 1
2 store orange (store, orange) 2
3 online orange (online, orange) 3
4 online orange (online, orange) 3
If you don't care about combination, you can use groupby.ngroup():
df['dummy'] = df.groupby(['location','item'], sort=False).ngroup()
Output:
location item dummy
0 store apple 0
1 online apple 1
2 store orange 2
3 online orange 3
4 online orange 3
Let's create combinations by concatenating location and item then use factorize to encode these combinations to get dummy variables:
df['combination'] = df['location'].add(', ' + df['item'])
df['dummy'] = df['combination'].factorize()[0]
location item combination dummy
0 store apple store, apple 0
1 online apple online, apple 1
2 store orange store, orange 2
3 online orange online, orange 3
4 online orange online, orange 3
You can apply a lambda function on first 2 columns. See below. d is a dictionary with dummies per pair.
d={('store', 'apple'):0, ('online', 'apple'):1, ('store', 'orange'):2, ('online', 'orange'):3}
def f(x,y):
return d[(x,y)]
df['dummy'] = df[['location','item']].apply(lambda x: f(*x), axis=1)
>>>print(df)
location item dummy
0 store apple 0
1 online apple 1
2 store orange 2
3 online orange 3
4 online orange 3

python: merge two database based on matching multiple columns in both the datasets and apply a script on the result

I have two databases with multiple column dataset-1(df1) has more than a couple of thousand rows and dataset-2(df2) is smaller... 300 rows.
I need to pickup a 'value' from column 3 in df2 based on matching 'fruit' in df1 with 'type' in df2 and 'expiry' in df1 with 'expiry' in df2
Furthermore, Instead of storing the 'Value' directly in a new column in df1, i need to perform a multiplication on the value in each row and the output gets to be stored in a new a column in df1.
So for example if expiry is 2 the value gets multiplied by 2 and if its 3 value gets multiplied by 3.. and so on and so forth!
I was able to solve this by using the code below, but.....:
for i in range(0, len(df1)):
df1_value = df2.loc[(df2['type'] == df1.iloc[i]['fruit']) & (df2['expiry'] == str(df1.iloc[i]['expiry'])].iloc[0]['value']
df1.loc[i, 'df_value'] = df1.iloc[i]['expiry']*df1_value
It creates two issues,
If the iteration throws up a null value (for example there is no 'value' for banana with expiry of 3 in df2), the process stops and it gives me an error -IndexError: single positional indexer is out-of-bounds
Because df1 has a very large number of rows, the individual iterations take a lot of time.
Is there a better way to handle this?
say df1:
fruit expiry category
apple 3 a
apple 3 b
apple 4 c
apple 4 d
orange 2 a
orange 2 b
orange 3 c
orange 3 d
orange 3 e
banana 3 a
banana 3 b
banana 3 c
banana 4 d
pineapple 2 a
pineapple 3 b
pineapple 3 c
pineapple 4 d
pineapple 4 e
df2:
type expiry value
apple 2 100
apple 3 110
apple 4 120
orange 2 200
orange 3 210
orange 4 220
banana 2 310
banana 4 320
pineapple 2 410
pineapple 3 420
pineapple 4 430
output: -revised df1
fruit expiry category df_value
apple 3 a 110*3=330
apple 3 b 110*3=330
apple 4 c 120*4=480
apple 4 d 120*4=480
orange 2 a 200...
orange 2 b 200...
orange 3 c 210...
orange 3 d 210...
orange 3 e 210...
banana 3 a 0
banana 3 b 0
banana 3 c 0
banana 4 d 320*4=1280
pineapple 2 a 410*2=820
pineapple 3 b 420...
pineapple 3 c 420...
pineapple 4 d 430....
pineapple 4 e 430....
As far as I know you can only do this by using SQL within python. SQL is used for relating different databases that have at least one column that is relatable (if you've used Power BI or Tableau you know what I mean) and querying multiple dataframes through their mutual relationships. I do not know this language so I cannot help you further than this.

combine columns containing empty strings into one column in python pandas

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.
Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Python Pandas Data Frame Inserting Many Arbitrary Values

Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?
Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon

How to return a dataframe of number of duplicates based on filtering the two columns in python

ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3

Categories

Resources