Broadcasting boolean operations in Pandas - python

Lets say I have an NxM boolean dataframe X and an Nx1 boolean dataframe Y. I would like to perform a boolean operation on each column returning a new dataframe that is NxM. For example:
x = pd.DataFrame([[True, True, True], [True, False, True], [False, False, True]])
y = pd.DataFrame([[False], [True], [True]])
I would like x & y to return:
0 1 2
0 False False False
1 True False True
2 False False True
But instead it returns:
0 1 2
0 False NaN NaN
1 True NaN NaN
2 False NaN NaN
Instead treating y as a series with
x & y[0]
gives:
0 1 2
0 False True True
1 False False True
2 False False True
Which appears to be broadcasting by row. Is there a correct way to do this other than transposing applying the operation with the Series and than untransposing?
(x.T & y[0]).T
0 1 2
0 False False False
1 True False True
2 False False True
It seems that that fails when the row index is not the same as the column labels

You could call apply and pass a lambda and call squeeze to flatten the Series into a 1-D array:
In [152]:
x.apply(lambda s: s & y.squeeze())
Out[152]:
0 1 2
0 False False False
1 True False True
2 False False True
I'm not sure if this is quicker though, here we're applying the mask column-wise by calling apply on the df which is why transposing is unnecessary
Actually you could use np.logical_and:
In [156]:
np.logical_and(x,y)
Out[156]:
0 1 2
0 False False False
1 True False True
2 False False True

Related

Count number of consecutive True in column, restart when False

I work with the following column in a pandas df:
A
True
True
True
False
True
True
I want to add column B that counts the number of consecutive "True" in A. I want to restart everytime a "False" comes up. Desired output:
A B
True 1
True 2
True 3
False 0
True 1
True 2
Using cumsum identify the blocks of rows where the values in column A stays True, then group the column A on these blocks and calculate cumulative sum to assign ordinal numbers
df['B'] = df['A'].groupby((~df['A']).cumsum()).cumsum()
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2
Using a simple & native approach
(For a small code sample it worked fine)
import pandas as pd
df = pd.DataFrame({'A': [True, False, True, True, True, False, True, True]})
class ToNums:
counter = 0
#staticmethod
def convert(bool_val):
if bool_val:
ToNums.counter += 1
else:
ToNums.counter = 0
return ToNums.counter
df['B'] = df.A.map(ToNums.convert)
df
A B
0 True 1
1 False 0
2 True 1
3 True 2
4 True 3
5 False 0
6 True 1
7 True 2
Here's an example
v=0
for i,val in enumerate(df['A']):
if val =="True":
df.loc[i,"C"]= v =v+1
else:
df.loc[i,"C"]=v=0
df.head()
This will give the desired output
A C
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
You can use a combination of groupby, cumsum, and cumcount
df['B'] = (df.groupby((df['A']&
~df['A'].shift(1).fillna(False) # row is True and next is False
)
.cumsum() # make group id
)
.cumcount().add(1) # make cumulated count
*df['A'] # multiply by 0 where initially False, 1 otherwise
)
output:
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2

Python dataframe : Add a column that increments when another column changes

I have a dataframe as follows
df = pd.DataFrame({
'Values' : [False, False, True, False, False, True, True, False, False, True]
})
df
Values
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 False
8 False
9 True
I would like to add another column named 'count' which increment by one whenever a True detects in 'Values' column
My expected output is as follows
Values Count
0 False 0
1 False 0
2 True 1
3 False 1
4 False 1
5 True 2
6 True 3
7 False 3
8 False 3
9 True 4
Now I am doing as follows
counter = [0]
def handleOneRow(row):
if row['Values'] == True:
counter[0] = counter[0] + 1
return counter[0]
df['count'] = df.apply(lambda x : handleOneRow(x), axis=1)
Is there any other simple way in dataframe ?

Mask DataFrame with list of values as condition

My DataFrame contains multiple time series, I want to flag whenever a point in each time series goes one standard deviation above the mean.
df = pd.DataFrame(np.random.rand(3, 10), index=['ts_A', 'ts_B','ts_C'])
std = df.std(axis=1)
mean = df.mean(axis=1)
And then I was hoping to be able to do:
df.mask(df > (std + mean), 'True', inplace=True)
Which should return the original DataFrame where any value which is more than one standard deviation above the mean for that row/time series is replaced by True.
However, instead this returns false for every element. If I use df.where instead the entire DataFrame gets filled with True.
I could do this by iterating through the index and masking each row in turn but I'm sure there must be a better way.
Using gt with axis=0
df.mask(df.gt(std + mean,axis=0), 'True', inplace=True)
df
0 1 2 3 4 5 6
ts_A 0.003797 0.060297 0.265496 0.442663 True 0.498443 0.436738
ts_B 0.127535 0.644332 True 0.079317 0.0411021 True 0.830672
ts_C 0.693698 0.429689 0.371802 0.312407 0.0555868 True True
7 8 9
ts_A 0.403529 0.392445 0.238355
ts_B 0.732539 0.030451 0.895976
ts_C 0.907143 0.912002 0.098821
If need return T and F
TorF=df.gt(std + mean,axis=0)
TorF
Out[31]:
0 1 2 3 4 5 6 7 8 9
ts_A False False False False True False False False False False
ts_B False False True False False True False False False False
ts_C False False False False False True True False False False

Shift boolean horizontally in pandas dataframe

Say I have a dataframe of booleans, called original:
original = pd.DataFrame([
[True, False, False, True, False],
[False, True, False, False, False]
])
0 1 2 3 4
0 True False False True False
1 False True False False False
And I want to create the following boolean dataframe (all to the right of a True should now be True):
0 1 2 3 4
0 False True True True True
1 False False True True True
I've accomplished this as follows, but was wondering if anyone had a less cumbersome method:
original.shift(axis=1).fillna(False).astype(int) \
.T.replace(to_replace=0, method='ffill').T.astype(bool)
cummax
original.cummax(1).shift(axis=1).fillna(False)
0 1 2 3 4
0 False True True True True
1 False False True True True
IIUC
original[original].shift(1,axis=1).ffill(1).fillna(0).astype(bool)
Out[77]:
0 1 2 3 4
0 False True True True True
1 False False True True True

inconsistent any vs all pd dataframe

This was asked in other forums but with focus on nan.
I have a simple dataframe:
y=[[1,2,3,4,1],[1,2,0,4,5]]
df = pd.DataFrame(y)
I am having difficulties understanding how any and all work. According to the pandas documentation 'any' returns "...whether any element is True over requested axis".
If I use:
~(df == 0)
Out[77]:
0 1 2 3 4
0 True True True True True
1 True True False True True
~(df == 0).any(1)
Out[78]:
0 True
1 False
dtype: bool
From my understanding the second command means: Return 'True' if any element is True over requested axis, and it should return True, True for both rows (since both contain at least one true value) but instead I get True, False. Why is that?
You need one () because priority of operators:
print (df == 0)
0 1 2 3 4
0 False False False False False
1 False False True False False
print (~(df == 0))
0 1 2 3 4
0 True True True True True
1 True True False True True
print ((~(df == 0)).any(1))
0 True
1 True
dtype: bool
Because:
print ((df == 0).any(1))
0 False
1 True
dtype: bool
print (~(df == 0).any(1))
0 True
1 False
dtype: bool
Python interprets your call as:
~ ( (df == 0).any(1) )
So it **evaluates any first. Now if we take a look at df == 0, we see:
>>> df == 0
0 1 2 3 4
0 False False False False False
1 False False True False False
So this means that in the first row, there is no such True, in the second one there is, so:
>>> (df == 0).any(1)
0 False
1 True
dtype: bool
Now we negate this with ~, so False becomes True and vice versa:
>>> ~ (df == 0).any(1)
0 True
1 False
dtype: bool
In case we first negate, we see:
>>> (~ (df == 0)).any(1)
0 True
1 True
dtype: bool
Both are True, since in both rows there is at least one column that is True.

Categories

Resources