Pandas dataframe conditional statement didn't give me what i expected - python

I have a dataframe like this
import numpy as np
import pandas as pd
lbl = [0, 1, 2, 3]
lbl2 = [0, 1, 2, 3, 4, 5]
label = lbl + lbl2
df = pd.DataFrame({"label":label})
#matching lbl and lbl2
pairs =[]
for i in range(3):
pair = (i,i+1)
pairs.append(pair)
when I hit this on terminal
(df.loc[:,'label'][df.index > (num_old_lbl - 1)]) & (df['label'] == pairs[0][1])
I got what I expected like this
(df.loc[:,'label'][df.index > (num_old_lbl - 1)]) & (df['label'] == pairs[1][1])
but when I changed the line like this, every values show False. I expected row 6 will be True.

We get the expected result using this writing (setting up num_old_lbl to 3) :
>>> (df.index > (num_old_lbl - 1)) & (df['label'] == pairs[1][1])
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
Name: label, dtype: bool
It also works as expected with the given example :
>>> (df.index > (num_old_lbl - 1)) & (df['label'] == pairs[0][1])
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: label, dtype: bool

Related

Count occurrences of stings in a row Pandas

I'm trying to count the number of instances of a certain sting in a row in a pandas dataframe.
In the example here I utilized a lambda function and pandas .count() to try and count the number of times 'True' exists in each row.
Though instead of a count of 'True' it is just returning a boolean whether or not it exists in the row...
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trus or False in each row
df['Count'] = df.apply(lambda row: row.astype(str).str.count('True').any(), axis=1)
print(df)
The desired outcome is:
Period Result Result1 Result2 Count
1 True True True 3
2 None None None 0
3 False False False 0
4 True True True 3
1 False False False 0
2 True True True 3
3 False False False 0
... ... ... ... ......
You can use np.where:
df['count'] = np.where(df == 'True', 1, 0).sum(axis=1)
Regarding why your apply returns a boolean: both any and all
returns boolean, not numbers
Edit: You can include df.isin for multiple conditions:
df['count'] = np.where(df.isin(['True', 'False']), 1, 0).sum(axis=1)
Use eq with sum:
df.eq("True").sum(axis=1)
Use apply with lambda function.
df.apply(lambda x: x.eq("True").sum(), axis=1)
For more than 1 text matching try
df.iloc[:,1:].apply(lambda x: x.str.contains("True|False")).sum(axis=1)
Avoiding using the apply function, as it can be slow:
df[["Result", "Result1", "Result2"]].sum(axis=1).str.count("True")
This also will work for when you have strings that are like:
"this sentence contains True"
Your lambda is not working correctly, try this:
import pandas as pd
#create dataframe
d = {'Period': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Result': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result1': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False'],
'Result2': ['True','None','False','True','False','True','False','True','True','False','False','True','False','True','False','False']}
df = pd.DataFrame(data=d)
#count instances of Trues or False in each row
df['Count'] = df.apply(lambda row: sum(row[1:4] == 'True') ,axis=1)
print(df)
# Output:
# >> Period Result Result1 Result2 Count
# >> 0 1 True True True 3
# >> 1 2 None None None 0
# >> 2 3 False False False 0
# >> 3 4 True True True 3
# >> 4 1 False False False 0
# >> 5 2 True True True 3
# >> 6 3 False False False 0
# >> 7 4 True True True 3
# >> 8 1 True True True 3
# >> 9 2 False False False 0
# >> 10 3 False False False 0
# >> 11 4 True True True 3
# >> 12 1 False False False 0
# >> 13 2 True True True 3
# >> 14 3 False False False 0
# >> 15 4 False False False 0

find groups of neighboring True in pandas series

I have a series with True and False and need to find all groups of True.
This means that I need to find the start index and end index of neighboring Truevalues.
The following code gives the intended result but is very slow, inefficient and clumsy.
import pandas as pd
def groups(ser):
g = []
flag = False
start = None
for idx, s in ser.items():
if flag and not s:
g.append((start, idx-1))
flag = False
elif not flag and s:
start = idx
flag = True
if flag:
g.append((start, idx))
return g
if __name__ == "__main__":
ser = pd.Series([1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1], dtype=bool)
print(ser)
g = groups(ser)
print("\ngroups of True:")
for start, end in g:
print("from {} until {}".format(start, end))
pass
output is:
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 True
8 True
9 True
10 True
11 False
12 True
13 False
14 True
groups of True:
from 0 until 1
from 4 until 4
from 7 until 10
from 12 until 12
from 14 until 14
There are similar questions out there but non is looking to find the indices of the group starts/ends.
Label contiguous groups of True elements within a pandas Series
Streaks of True or False in pandas Series
It's common to use cumsum on the negation to check for consecutive blocks. For example:
for _,x in s[s].groupby((1-s).cumsum()):
print(f'from {x.index[0]} to {x.index[-1]}')
Output:
from 0 to 1
from 4 to 4
from 7 to 10
from 12 to 12
from 14 to 14
You can use itertools:
In [478]: from operator import itemgetter
...: from itertools import groupby
In [489]: a = ser[ser].index.tolist() # Create a list of indexes having `True` in `ser`
In [498]: for k, g in groupby(enumerate(a), lambda ix : ix[0] - ix[1]):
...: l = list(map(itemgetter(1), g))
...: print(f'from {l[0]} to {l[-1]}')
...:
from 0 to 1
from 4 to 4
from 7 to 10
from 12 to 12
from 14 to 14

Pandas: How to create a column that indicates when a value is present in another column a set number of rows in advance?

I'm trying to ascertain how I can create a column that indicates in advance (X rows) when the next occurrence of a value in another column will occur with pandas that in essence performs the following functionality (In this instance X = 3):
df
rowid event indicator
1 True 1 # Event occurs
2 False 0
3 False 0
4 False 1 # Starts indicator
5 False 1
6 True 1 # Event occurs
7 False 0
Apart from doing a iterative/recursive loop through every row:
i = df.index[df['event']==True]
dfx = [df.index[z-X:z] for z in i]
df['indicator'][dfx]=1
df['indicator'].fillna(0)
However this seems inefficient, is there a more succinct method of achieving the aforementioned example? Thanks
Here's a NumPy based approach using flatnonzero:
X = 3
# ndarray of indices where indicator should be set to one
nd_ixs = np.flatnonzero(df.event)[:,None] - np.arange(X-1, -1, -1)
# flatten the indices
ixs = nd_ixs.ravel()
# filter out negative indices an set to 1
df['indicator'] = 0
df.loc[ixs[ixs>=0], 'indicator'] = 1
print(df)
rowid event indicator
0 1 True 1
1 2 False 0
2 3 False 0
3 4 False 1
4 5 False 1
5 6 True 1
6 7 False 0
Where nd_ixs is obtained through the broadcasted subtraction of the indices where event is True and an arange up to X:
print(nd_ixs)
array([[-2, -1, 0],
[ 3, 4, 5]], dtype=int64)
A pandas and numpy solution:
# Make a variable shift:
def var_shift(series, X):
return [series] + [series.shift(i) for i in range(-X + 1, 0, 1)]
X = 3
# Set indicator to default to 1
df["indicator"] = 1
# Use pd.Series.where and np.logical_or with the
# var_shift function to get a bool array, setting
# 0 when False
df["indicator"] = df["indicator"].where(
np.logical_or.reduce(var_shift(df["event"], X)),
0,
)
# rowid event indicator
# 0 1 True 1
# 1 2 False 0
# 2 3 False 0
# 3 4 False 1
# 4 5 False 1
# 5 6 True 1
# 6 7 False 0
In [77]: np.logical_or.reduce(var_shift(df["event"], 3))
Out[77]: array([True, False, False, True, True, True, nan], dtype=object)

Rolling Conditional Pandas DataFrame Column

How would I be able to write a rolling condition apply to a column in pandas?
import pandas as pd
import numpy as np
lst = np.random.random_integers(low = -10, high = 10, size = 10)
lst2 = np.random.random_integers(low = -10, high = 10, size = 10)
#lst = [ -2 10 -10 -6 4 2 -5 4 9 3]
#lst2 = [-7 5 6 -4 7 1 -4 -6 -1 -4]
df = pandas.DataFrame({'a' : lst, 'b' : lst2})
Given a dataframe, namely 'df', I want to create a column 'C' such that it will display True if the element in a > 0 and b > 0 or False if a < 0 and b < 0.
For the rows that do not meet this condition, I want to roll the entry in the previous row to the current one (i.e. if the previous row has value 'True' but does not meet the specified conditions, it should have value 'True'.)
How can I do this?
Follow-up Question : How would I do this for conditions a > 1 and b > 1 returns True or a < -1 and b < -1 returns False?
I prefer to do this with a little mathemagic on the signs.
i = np.sign(df.a)
j = np.sign(df.b)
i = i.mask(i != j).ffill()
i >= 0
# for your `lst` and `lst2` input
0 False
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 False
Name: a, dtype: bool
As long as you don't have to worry about integer overflows, this works just fine.
i = np.sign(df.a)
j = np.sign(df.b)
i.mask(i != j).ffill().ge(0)

Find first 'True' value in blocks in pandas data frame

I have a dataframe, where one column contains only True or False values in blocks. For example:
df =
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
I need to find the beginning of block with True:
>> find_first_true(df)
>> array([1, 4, 13])
Any elegant solutions?
EDIT
Thanks for the proposed solution. I am wondering, what's the easiest way to extract blocks of a certain length, starting from the indices I found?
For example, I need to take blocks (number of rows) of length 4 before the indices. So, if my indices (found previously)
index = array([1, 4, 13])
then I need blocks:
[df.loc[0:4], df.loc[9:13]]
or
b
0 False
1 True
2 True
3 False
4 True
9 False
10 False
11 False
12 False
13 True
I am looping over indices, but wonder about more pandasian solution
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
b
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
In [11]: np.where(((df.b != df.b.shift(1)) & df.b).values)[0]
Out[11]: array([ 1, 4, 13], dtype=int64)
def find_first_true(df):
#finds indexes of true elements
a = list(map(lambda e: e[0] + 1 if e[1] else 0, enumerate(df)))
a = list(filter(bool, a))
a = list(map(lambda x: x - 1, a))
#removes consecutive elements
ta = [a[0]] + list(filter(lambda x: a[x] - a[x-1] != 1, range(1, len(a))))
a = list(map(lambda x: a[x], ta))
return a
find_first = []
for i in range(len(df)):
if (df.loc[i, 'b'] == False and df.loc[i+1, 'b'] == True):
find_first.append(i+1)

Categories

Resources