Check if a value exists using multiple conditions within group in pandas - python

Following is what my dataframe looks like. Expected_Output is my desired/target column.
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
If any Value1 == 7 AND if any Value2 == 9 within a given Group, then I want to return True.
I tried to no avail:
df['Expected_Output']= df.groupby('Group').Value1.isin(7) & df.groupby('Group').Value2.isin(9)
N.B:- Either True/False or 1/0 can be output.

Use groupby on Group column and then use transform and lambda function as:
g = df.groupby('Group')
df['Expected'] = (g['Value1'].transform(lambda x: x.eq(7).any()))&(g['Value2'].transform(lambda x: x.eq(9).any()))
Or using groupby, apply and merge using parameter how='left' as:
df.merge(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()).reset_index(),how='left').rename(columns={0:'Expected_Output'})
Or using groupby, apply and map as:
df['Expected_Output'] = df['Group'].map(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()))
print(df)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False

You can create a dataframe of the expected result by group and then merge it back to the original dataframe.
expected = (
df.groupby('Group')
.apply(lambda x: (x['Value1'].eq(7).any()
& x['Value2'].eq(9)).any())
.to_frame('Expected_Output'))
>>> expected
Expected_Output
Group
1 True
2 False
>>> df.merge(expected, left_on='Group', right_index=True)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False

Related

Pandas, create column using previous new column value

I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8

Python pandas pivot

I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1

Create mask to identify final two rows in groups in Pandas dataframe

I have a Pandas dataframe that includes a grouping variable. An example can be produced using:
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
...which looks like:
grp data
0 a 4
1 a 5
2 b 3
3 b 6
4 b 7
5 c 8
6 d 9
7 d 8
8 d 7
9 d 3
I can retrieve the last two rows of each group using:
dfgrp = df.groupby('grp').tail(2)
However, I would like to produce a mask that identifies the last two rows (or 1 row if only 1 exists), ideally producing an output that looks like:
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
I thought this would be relatively straight-forward but I haven't been able to find the solution. Suggestions would be greatly appreciated.
If your index is unique, you could do this by using isin.
import pandas as pd
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
df['mask'] = df.index.isin(df.groupby('grp').tail(2).index)
df
grp data mask
0 a 4 True
1 a 5 True
2 b 3 False
3 b 6 True
4 b 7 True
5 c 8 True
6 d 9 False
7 d 8 False
8 d 7 True
9 d 3 True

Pandas: Filter a data-frame, and assign values to top n number of rows

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)

How to use .loc to set as other column values in pandas

For example, I have a dataframe:
cond value1 value2
0 True 1 1
1 False 3 5
2 True 34 2
3 True 23 23
4 False 4 2
I hope to replace value1 to value2*2 when cond=True. So I want the result is:
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
I can achieve it by follow code:
def convert(x):
if x.cond:
x.value1= x.value2*2
return x
data = data.apply(lambda x: convert(x),axis=1)
I think it is so slow when data is big. I try it by .loc, but I don't know how to set value.
How can I achieve it by .loc or other simple ways? Thanks in advance.
Create boolean mask and multiple only filtered rows:
mask = df.cond
df.loc[mask, 'value1'] = df.loc[mask, 'value2'] * 2
print (df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
You can use where/mask:
df.value1 = df.value1.mask(df.cond, df.value2*2)
# Or,
# df.value1 = df.value1.where(~df.cond, df.value2*2)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
Using np.where :
df['value1'] = np.where(df.cond,df.value2*2,df.value1)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2

Categories

Resources