Finding rows which match both conditions - python

Say I have a dataframe:
Do
Re
Mi
Fa
So
1
0
Foo
100
50
1
1
Bar
75
20
0
0
True
59
59
1
1
False
0
12
How would I go about finding all rows where the value in columns "Do" and "Re" BOTH equal 1 AND "Fa" is higher than "So"?
I've tried a couple of ways, but they first returns an error complaining of ambiguity:
df['both_1']=((df['Do'] > df['Re']) & (df['Mi'] == df['Fa'] == 1))
I also tried breaking it down into steps, but I realised the last step will result in me bringing in both True and False statements. I only want True.
df['Do_1'] = df['Do'] == 1
df['Re_1'] = df['Re'] == 1
# This is where I realised I'm bringing in the False rows too
df['both_1'] = (df['Do1'] == df['Re_1'])```

Chain mask by another & for bitwise AND:
df['both_1']= (df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)
print (df)
Do Re Mi Fa So both_1
0 1 0 Foo 100 50 False
1 1 1 Bar 75 20 True
2 0 0 True 59 59 False
3 1 1 False 0 12 False
Or if possible multiple columns in list comapre filtered columns by subset df[['Do', 'Re']] and test if all Trues by DataFrame.all:
df['both_1']= (df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)
If need filter use boolean indexing:
df1 = df[(df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)]
For second solution:
df1 = df[(df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)]

You can combine the three conditions using & (bitwise and):
df[(df['Do'] == 1) & (df['Re'] == 1) & (df['Fa'] > df['So'])]
Output (for your sample data):
Do Re Mi Fa So
1 1 1 Bar 75 20

Try:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Do':[1,1,0,1],'Re':[0,1,0,1],'Mi':['Foo','Bar',True,False],'Fa':[100,75,59,0],'Sol':[50,20,59,12]})
df.loc[np.where((df.Do == 1) & (df.Re ==1) & (df.Fa > df.Sol))[0]]
Output:
Do Re Mi Fa Sol
1 1 1 Bar 75 20
If you do not want to import anything else, just do this:
df.assign(keep=[1 if (df.loc[i,'Do'] == 1 and df.loc[i,'Re'] == 1 and df.loc[i,'Fa'] > df.loc[i,'Sol']) else 0 for i in df.index]).query("keep == 1").drop('keep',axis=1)

Related

Computing a number table for two types of records in Panda

I have this data frame of many records with columns, let's say, from 'a' to 'z' in pandas.
I need to build a table of 4 cells with the numbers of records that have 'a' = 0 and 'z'= 0, 'a' = 0 and 'z' != 0, 'a' != 0 and 'z' = 0, 'a' != 0 and 'z' != 0.
What is the best Pythony-pandas way to do it? I can compute the four sums doing indexing and summing, but I'm sure there should be a built-in elegant way of doing it.
What is that way?
You can group by the values of whether each column is 0 and get the size of these groups:
>>> df.groupby([df['a'] == 0, df['z'] == 0]).size()
a z
False False 19
True 2
True False 4
dtype: int64
Alternately you can also create a 2-column dataframe with these series and use value_counts which yields the same result:
>>> pd.concat([df['a'] == 0, df['z'] == 0], axis='columns').value_counts()
a z
False False 19
True False 4
False True 2
dtype: int64
Calling this result counts we can then simply unstack if you want a 2 by 2 table:
>>> counts.unstack('z', fill_value=0)
z False True
a
False 19 2
True 4 0
In the first level of counts’ index, or the index in the 2x2 table, True means a == 0 and False means a != 0. In the second level or columns, True means z == 0. In this sample data no rows have a == 0 and z == 0 at the same time.
If you want to generically rename them you can do something like this:
>>> for level, name in enumerate(counts.index.names):
... counts.rename({True: f'{name} == 0', False: '{name} != 0'}, level=level, inplace=True)
>>> counts
a z
a != 0 z != 0 19
a == 0 z != 0 4
a != 0 z == 0 2
dtype: int64
>>> counts.unstack('z', fill_value=0)
z z != 0 z == 0
a
a != 0 19 2
a == 0 4 0
Alternately to flatten the index to a single level, and this also works generically with any number of variables:
>>> counts.index = counts.index.map(lambda tup: ' & '.join(f'~{var}' if truth else var for truth, var in zip(tup, counts.index.names)))
>>> counts
a & z 19
~a & z 4
a & ~z 2
a = df['a'] == 0
z = df['z'] == 0
count = pd.Series([sum(a & z), sum(a & ~z), sum(~a & z), sum(~a & ~z)],
index=['a & z', 'a & ~z', '~a & z', '~a & ~z'])
>>> count
a & z 3
a & ~z 2
~a & z 2
~a & ~z 13
dtype: int64
Use pd.crosstab(), as follows:
a = df['a'] == 0
z = df['z'] == 0
count_tab = pd.crosstab(a, z, rownames=["'a' = 0 ?"], colnames=["'z' = 0 ?"])
Result:
print(count_tab)
'z' = 0 ? False True
'a' = 0 ?
False 13 2
True 2 3

Vectorized function with counter on pandas dataframe column

Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)

Can we combine conditional statements using column indexes in Python?

For the following dataframe:
Headers: Name P1 P2 P3
L1: A 1 0 2
L2: B 1 1 1
L3: C 0 5 6
I want to get yes where all P1, P2 and P3 are greater than 0.
Currently I am using either of the following methods:
Method1:
df['Check']= np.where((df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0),'Yes','No')
Method2:
df.loc[(df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0), 'Check'] = "Yes"
I have a large dataset with a lot of columns where the conditions are to be applied.
Is there a shorter alternative to the multiple & conditions, wherein I won't have to write the conditions for each and every variable and instead use a combined index range for the multiple columns?
I think need DataFrame.all for check all Trues per rows:
cols = ['P1','P2','P3']
df['Check']= np.where((df[cols] > 0).all(axis=1),'Yes','No')
print (df)
Name P1 P2 P3 Check
0 A 1 0 2 No
1 B 1 1 1 Yes
2 C 0 5 6 No
print ((df[cols] > 0))
P1 P2 P3
0 True False True
1 True True True
2 False True True
print ((df[cols] > 0).all(axis=1))
0 False
1 True
2 False
dtype: bool

How to use previous values from Dataframe in if statement

I have DataFrame df
df=pd.DataFrame([[47,55,47,50], [33,37,30,25],[61,65,54,57],[25,26,21,22], [25,29,23,28]], columns=['open','high','low','close'])
print(df)
open high low close
0 47 55 47 50
1 33 37 30 25
2 61 65 54 57
3 25 26 21 22
4 25 29 23 20
I want to use previous values in if statement to compare.
Logic is as below:
if (close[i-1]/close[i] > 2) and (high[i] < low[i-1]) and
((open[i] > high[i-1]) or (open[i] <low[i-1])) :
I have written a code :
for idx,i in df.iterrows():
if idx !=0 :
if ((prv_close/i['close'])>2) and (i['high'] < prv_low) and ((i['open'] > prv_high) or (i['open'] < prv_low)):
print("Successful")
else:
print("Not Successful")
prv_close = i['close']
prv_open = i['open']
prv_low = i['low']
prv_high = i['high']
output:
Not Successful
Not Successful
Successful
Not Successful
But For millions of rows it's taking too much time. Is there any other way to implement it faster?
P.S : Action to be taken on data are different in if statement. For simplicity I'm using print statement. My columns may not have same order this is why I'm using iterrows() instead of itertuples().
Any suggestions are welcome.
Thanks.
d0 = df.shift()
cond0 = (d0.close / df.close) > 2
cond1 = df.high < d0.low
cond2 = df.open > d0.high
cond3 = df.open < d0.low
mask = cond0 & cond1 & (cond2 | cond3)
mask
0 False
1 False
2 False
3 True
4 False
dtype: bool

python value conditional on two columns

I have the following dataset:
id Rank condition1 condition2 result
1 2 50 0 0
1 2 50 0 0
2 55 50 1 0
2 55 50 1 0
I want to make the result column to 1 conditional on the two columns condition 1 and condition 2.
The Result should become 1 if rank <= condition 1 AND if condition2 = 0
id Rank condition1 condition2 result
1 2 50 0 1
1 2 50 0 1
2 55 50 1 0
2 55 50 1 0
I have tried the following code but get "invalid syntax".
df["result"][df[condition2] = 0 & df["Rank"]<= df["condition1"]] = 1
Can somebody help me in finding the error? I know how to make this command conditional on one condition, but I do not know how to incorporate the second condition with the AND command.
You need to use == for equality checks, the single = is for assignments not for comparisons:
df["result"][(df['condition2'] == 0) & (df["Rank"]<= df["condition1"])] = 1
You also forgot the ' for condition2 and I included some parenthesis to seperate the conditions because & has higher precedence than == or <=.
Pandas also provides methods for comparisons (eq and le in this case), so you could also use:
df["result"][df['condition2'].eq(0) & df['Rank'].le(df['condition1'])] = 1

Categories

Resources