Pandas dataframe: count consecutive True / False values - python

I have a boolean True/False-column "Mask" in a dataframe, e.g.:
Mask
True
True
True
False
False
True
False
False
Now I am trying to add a column with the count of the consecutive True/False lines, where True is a positive sum (counts of +1) and False is a negative sum (counts of -1), e.g.
Mask Count
True 3
True 3
True 3
False -2
False -2
True 1
False -2
False -2
I tried it with groupby and sum but now I got a knot in my head.
Tried something like
mask.groupby((~mask).cumsum()).cumsum().astype(int)
(mask is the condition for the True/False) but this only counts the Trues and does a count instead of showing the sum.
Would really appreciate any suggestions!

You can get the group number of consecutive True/False by .cumsum() and put into g.
Then, group by g and get the size/count of each group by .transform() + .size(). Set the sign by multiplying the return value (1 or -1) of np.where(), as follows:
g = df['Mask'].ne(df['Mask'].shift()).cumsum()
df['Count'] = df.groupby(g)['Mask'].transform('size') * np.where(df['Mask'], 1, -1)
Result:
print(df)
Mask Count
0 True 3
1 True 3
2 True 3
3 False -2
4 False -2
5 True 1
6 False -2
7 False -2

Related

Count number of consecutive True in column, restart when False

I work with the following column in a pandas df:
A
True
True
True
False
True
True
I want to add column B that counts the number of consecutive "True" in A. I want to restart everytime a "False" comes up. Desired output:
A B
True 1
True 2
True 3
False 0
True 1
True 2
Using cumsum identify the blocks of rows where the values in column A stays True, then group the column A on these blocks and calculate cumulative sum to assign ordinal numbers
df['B'] = df['A'].groupby((~df['A']).cumsum()).cumsum()
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2
Using a simple & native approach
(For a small code sample it worked fine)
import pandas as pd
df = pd.DataFrame({'A': [True, False, True, True, True, False, True, True]})
class ToNums:
counter = 0
#staticmethod
def convert(bool_val):
if bool_val:
ToNums.counter += 1
else:
ToNums.counter = 0
return ToNums.counter
df['B'] = df.A.map(ToNums.convert)
df
A B
0 True 1
1 False 0
2 True 1
3 True 2
4 True 3
5 False 0
6 True 1
7 True 2
Here's an example
v=0
for i,val in enumerate(df['A']):
if val =="True":
df.loc[i,"C"]= v =v+1
else:
df.loc[i,"C"]=v=0
df.head()
This will give the desired output
A C
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
You can use a combination of groupby, cumsum, and cumcount
df['B'] = (df.groupby((df['A']&
~df['A'].shift(1).fillna(False) # row is True and next is False
)
.cumsum() # make group id
)
.cumcount().add(1) # make cumulated count
*df['A'] # multiply by 0 where initially False, 1 otherwise
)
output:
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2

In python, how to shift and fill with a specific values for all the shifted rows in DataFrame?

I have a following dataframe.
y = pd.DataFrame(np.zeros((10,1), dtype = 'bool'), columns = ['A'])
y.iloc[[3,5], 0] = True
A
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 False
9 False
And I want to make 'True' for the next three rows from where 'True' is found in the above dataframe. The expected results is shown in the below.
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
I can do that in the following way, but I wonder if there is a smarter way to do so.
y['B'] = y['A'].shift()
y['C'] = y['B'].shift()
y['D'] = y.any(axis = 1)
y['A'] = y['D']
y = y['A']
Thank you for the help in advance.
I use parameter limit in forward filling missing values with replace False to missing values and last replace NaNs to False:
y.A = y.A.replace(False, np.nan).ffill(limit=2).fillna(False)
print (y)
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
Another idea with Rolling.apply and any for test at least one True per window:
y.A = y.A.rolling(3, min_periods=1).apply(lambda x: x.any()).astype(bool)

Mask DataFrame with list of values as condition

My DataFrame contains multiple time series, I want to flag whenever a point in each time series goes one standard deviation above the mean.
df = pd.DataFrame(np.random.rand(3, 10), index=['ts_A', 'ts_B','ts_C'])
std = df.std(axis=1)
mean = df.mean(axis=1)
And then I was hoping to be able to do:
df.mask(df > (std + mean), 'True', inplace=True)
Which should return the original DataFrame where any value which is more than one standard deviation above the mean for that row/time series is replaced by True.
However, instead this returns false for every element. If I use df.where instead the entire DataFrame gets filled with True.
I could do this by iterating through the index and masking each row in turn but I'm sure there must be a better way.
Using gt with axis=0
df.mask(df.gt(std + mean,axis=0), 'True', inplace=True)
df
0 1 2 3 4 5 6
ts_A 0.003797 0.060297 0.265496 0.442663 True 0.498443 0.436738
ts_B 0.127535 0.644332 True 0.079317 0.0411021 True 0.830672
ts_C 0.693698 0.429689 0.371802 0.312407 0.0555868 True True
7 8 9
ts_A 0.403529 0.392445 0.238355
ts_B 0.732539 0.030451 0.895976
ts_C 0.907143 0.912002 0.098821
If need return T and F
TorF=df.gt(std + mean,axis=0)
TorF
Out[31]:
0 1 2 3 4 5 6 7 8 9
ts_A False False False False True False False False False False
ts_B False False True False False True False False False False
ts_C False False False False False True True False False False

Whats the fastest way to loop through a DataFrame and count occurrences within the DataFrame whilst some condition is fulfilled (in Python)?

I have a dataframe with two Boolean fields (as below).
import pandas as pd
d = [{'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':False}, {'a1':False, 'a2':True},
{'a1': False, 'a2': False}, {'a1':False, 'a2':False}, {'a1':True, 'a2':False}, {'a1':False, 'a2':True}, {'a1':False, 'a2':False},]
df = pd.DataFrame(d)
df
Out[1]:
a1 a2
0 False False
1 True False
2 True False
3 False False
4 False True
5 False False
6 False False
7 True False
8 False True
9 False False
I am trying to find the fastest and most "Pythonic" way of achieving the following:
If a1==True, count instances from current row where a2==False (e.g. row 1: a1=True, a2 is False for three rows from row 1)
At first instance of a2==True, stop counting (e.g. row 4, count = 3)
Set value of 'count' to new df column 'a3' on row where counting began (e.g. 'a3' = 3 on row 1)
Target result set as follows.
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
I have been trying to accomplish this using for loops, iterrows and while loops and so far haven't been able to produce a good nested combination which provides the results I want. Any help appreciated. I apologize if the problem is not totally clear.
How about this:
df['a3'] = df.apply(lambda x: 0 if not x.a1 else len(df.a2[x.name:df.a2.tolist()[x.name:].index(True)+x.name]), axis=1)
So, if a1 is False write 0 else write the length of list that goes from that row until next True.
This will do the trick:
df['a3'] = 0
# loop throught every value of 'a1'
for i in xrange(df['a1'].__len__()):
# if 'a1' at position i is 'True'...
if df['a1'][i] == True:
count = 0
# loop over the remaining items in 'a2'
# remaining: __len__() - i
# i: position of 'True' value in 'a1'
for j in xrange(df['a2'].__len__() - i):
# if the value of 'a2' is 'False'...
if df['a2'][j + i] == False:
# count the occurances of 'False' values in a row...
count += 1
else:
# ... if it's not 'False' break the loop
break
# write the number of occurances on the right position (i) in 'a3'
df['a3'][i] = count
and produce the following output:
a1 a2 a3
0 False False 0
1 True False 3
2 True False 2
3 False False 0
4 False True 0
5 False False 0
6 False False 0
7 True False 1
8 False True 0
9 False False 0
Edit: added comments in the code

Broadcasting boolean operations in Pandas

Lets say I have an NxM boolean dataframe X and an Nx1 boolean dataframe Y. I would like to perform a boolean operation on each column returning a new dataframe that is NxM. For example:
x = pd.DataFrame([[True, True, True], [True, False, True], [False, False, True]])
y = pd.DataFrame([[False], [True], [True]])
I would like x & y to return:
0 1 2
0 False False False
1 True False True
2 False False True
But instead it returns:
0 1 2
0 False NaN NaN
1 True NaN NaN
2 False NaN NaN
Instead treating y as a series with
x & y[0]
gives:
0 1 2
0 False True True
1 False False True
2 False False True
Which appears to be broadcasting by row. Is there a correct way to do this other than transposing applying the operation with the Series and than untransposing?
(x.T & y[0]).T
0 1 2
0 False False False
1 True False True
2 False False True
It seems that that fails when the row index is not the same as the column labels
You could call apply and pass a lambda and call squeeze to flatten the Series into a 1-D array:
In [152]:
x.apply(lambda s: s & y.squeeze())
Out[152]:
0 1 2
0 False False False
1 True False True
2 False False True
I'm not sure if this is quicker though, here we're applying the mask column-wise by calling apply on the df which is why transposing is unnecessary
Actually you could use np.logical_and:
In [156]:
np.logical_and(x,y)
Out[156]:
0 1 2
0 False False False
1 True False True
2 False False True

Categories

Resources