Mask DataFrame with list of values as condition - python

My DataFrame contains multiple time series, I want to flag whenever a point in each time series goes one standard deviation above the mean.
df = pd.DataFrame(np.random.rand(3, 10), index=['ts_A', 'ts_B','ts_C'])
std = df.std(axis=1)
mean = df.mean(axis=1)
And then I was hoping to be able to do:
df.mask(df > (std + mean), 'True', inplace=True)
Which should return the original DataFrame where any value which is more than one standard deviation above the mean for that row/time series is replaced by True.
However, instead this returns false for every element. If I use df.where instead the entire DataFrame gets filled with True.
I could do this by iterating through the index and masking each row in turn but I'm sure there must be a better way.

Using gt with axis=0
df.mask(df.gt(std + mean,axis=0), 'True', inplace=True)
df
0 1 2 3 4 5 6
ts_A 0.003797 0.060297 0.265496 0.442663 True 0.498443 0.436738
ts_B 0.127535 0.644332 True 0.079317 0.0411021 True 0.830672
ts_C 0.693698 0.429689 0.371802 0.312407 0.0555868 True True
7 8 9
ts_A 0.403529 0.392445 0.238355
ts_B 0.732539 0.030451 0.895976
ts_C 0.907143 0.912002 0.098821
If need return T and F
TorF=df.gt(std + mean,axis=0)
TorF
Out[31]:
0 1 2 3 4 5 6 7 8 9
ts_A False False False False True False False False False False
ts_B False False True False False True False False False False
ts_C False False False False False True True False False False

Related

Pandas dataframe: count consecutive True / False values

I have a boolean True/False-column "Mask" in a dataframe, e.g.:
Mask
True
True
True
False
False
True
False
False
Now I am trying to add a column with the count of the consecutive True/False lines, where True is a positive sum (counts of +1) and False is a negative sum (counts of -1), e.g.
Mask Count
True 3
True 3
True 3
False -2
False -2
True 1
False -2
False -2
I tried it with groupby and sum but now I got a knot in my head.
Tried something like
mask.groupby((~mask).cumsum()).cumsum().astype(int)
(mask is the condition for the True/False) but this only counts the Trues and does a count instead of showing the sum.
Would really appreciate any suggestions!
You can get the group number of consecutive True/False by .cumsum() and put into g.
Then, group by g and get the size/count of each group by .transform() + .size(). Set the sign by multiplying the return value (1 or -1) of np.where(), as follows:
g = df['Mask'].ne(df['Mask'].shift()).cumsum()
df['Count'] = df.groupby(g)['Mask'].transform('size') * np.where(df['Mask'], 1, -1)
Result:
print(df)
Mask Count
0 True 3
1 True 3
2 True 3
3 False -2
4 False -2
5 True 1
6 False -2
7 False -2

Is there a way to select interior True values for portions of a DataFrame?

I have a DataFrame that looks like the following:
df = pd.DataFrame({'a':[True]*5+[False]*5+[True]*5,'b':[False]+[True]*3+[False]+[True]*5+[False]*4+[True]})
a b
0 True False
1 True True
2 True True
3 True True
4 True False
5 False True
6 False True
7 False True
8 False True
9 False True
10 True False
11 True False
12 True False
13 True False
14 True False
How can I select blocks where column a is True only when the interior values over the same rows for column b are True?
I know that I could find break apart the DataFrame into consecutive True regions, and apply a function to each DataFrame chunk, but this is for a much larger problem with 10 million+ rows, and I don't think such a solution would scale up very well.
My expected output would be the following:
a b c
0 True False True
1 True True True
2 True True True
3 True True True
4 True False True
5 False True False
6 False True False
7 False True False
8 False True False
9 False True False
10 True False False
11 True False False
12 True False False
13 True False False
14 True True False
You can do a groupby on the a values and then look at the b values in a function, like this:
groupby_consec_a = df.groupby(df.a.diff().ne(0).cumsum())
all_interior = lambda x: x.iloc[1:-1].all()
df['c'] = df.a & groupby_consec_a.b.transform(all_interior)
Try out whether it's fast enough on your data. If not, the lambda will have to be replaced by pandas functions, but that will be more code.

In python, how to shift and fill with a specific values for all the shifted rows in DataFrame?

I have a following dataframe.
y = pd.DataFrame(np.zeros((10,1), dtype = 'bool'), columns = ['A'])
y.iloc[[3,5], 0] = True
A
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 False
9 False
And I want to make 'True' for the next three rows from where 'True' is found in the above dataframe. The expected results is shown in the below.
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
I can do that in the following way, but I wonder if there is a smarter way to do so.
y['B'] = y['A'].shift()
y['C'] = y['B'].shift()
y['D'] = y.any(axis = 1)
y['A'] = y['D']
y = y['A']
Thank you for the help in advance.
I use parameter limit in forward filling missing values with replace False to missing values and last replace NaNs to False:
y.A = y.A.replace(False, np.nan).ffill(limit=2).fillna(False)
print (y)
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
Another idea with Rolling.apply and any for test at least one True per window:
y.A = y.A.rolling(3, min_periods=1).apply(lambda x: x.any()).astype(bool)

Pandas: series of True or False in a column, select in another column the value when there is a True

My dataframe looks like this:
price High_cross
0.00224311 False
0.00224473 False
0.00224422 False
0.00224697 True
0.00224899 True
0.00224668 True
0.00224967 True
0.00224967 True
0.00224983 True
0.00225143 False
And what I need to do is loop on the column High_cross, when there is a True select the related price and compare it with the price the final True of the series. If the first price is below the second price, notify it in a new column movement by True. In this example it should look at something like that:
price High_cross Movement
0.00224311 False
0.00224473 False
0.00224422 False
0.00224697 True True
0.00224899 True
0.00224668 True
0.00224967 True
0.00224967 True
0.00224983 True
0.00225143 False
(because 0.00224983 is bigger than 0.00224697)!
I tried to play with the index but I am relatively stuck... any solution/idea? Thanks
Consider the below df:
price High_cross
0 0.002243 False
1 0.002245 False
2 0.002244 False
3 0.002247 True
4 0.002249 True
5 0.002247 True
6 0.002250 True
7 0.002250 True
8 0.002250 True
9 0.002251 False
10 0.002251 True
11 0.002250 True
Use:
df['identifier']=(df.High_cross.ne(df.High_cross.shift())).cumsum()
df['Movement']=df[df.High_cross].groupby('identifier')['price'].\
transform(lambda x: x.iloc[0]<x.iloc[-1])
print(df.drop('identifier',1))
price High_cross Movement
0 0.002243 False NaN
1 0.002245 False NaN
2 0.002244 False NaN
3 0.002247 True True
4 0.002249 True True
5 0.002247 True True
6 0.002250 True True
7 0.002250 True True
8 0.002250 True True
9 0.002251 False NaN
10 0.002251 True False
11 0.002250 True False
I'm not sure if I understood completely what your goal is here. I assuemd:
Select price if respective High_Cross == True
Compare the price to the last price where High_Cross == True
Set respective Movement = True if current price < last price[High_Cross == True]
but the following code achieves what I think you want:
import numpy as np
np.random.seed(5)
X = np.random.choice(range(10), size=10, replace=True).tolist()
Y = np.random.randint(2, size=10)
Y = [bool(y) for y in Y]
lst = []
movement = []
# Extract list of all true values
for price,cross in zip(X,Y):
# Create list of tuples
cross == True and lst.append((price,cross)) # If one liner avoiding the otherwise mandatory else statement
# Now do the work itself
for price,cross in zip(X,Y):
movement.append(True) if cross == True and price > lst[-1][0] else movement.append(False)
print("Price="+str(price)+", High_Cross="+str(cross)+", Movement="+str(movement[-1]))
Produces:
Price=3, High_Cross=True, Movement=True
Price=6, High_Cross=False, Movement=False
Price=6, High_Cross=True, Movement=True
Price=0, High_Cross=True, Movement=False
Price=9, High_Cross=True, Movement=True
Price=8, High_Cross=True, Movement=True
Price=4, High_Cross=False, Movement=False
Price=7, High_Cross=True, Movement=True
Price=0, High_Cross=False, Movement=False
Price=0, High_Cross=True, Movement=False
`

Broadcasting boolean operations in Pandas

Lets say I have an NxM boolean dataframe X and an Nx1 boolean dataframe Y. I would like to perform a boolean operation on each column returning a new dataframe that is NxM. For example:
x = pd.DataFrame([[True, True, True], [True, False, True], [False, False, True]])
y = pd.DataFrame([[False], [True], [True]])
I would like x & y to return:
0 1 2
0 False False False
1 True False True
2 False False True
But instead it returns:
0 1 2
0 False NaN NaN
1 True NaN NaN
2 False NaN NaN
Instead treating y as a series with
x & y[0]
gives:
0 1 2
0 False True True
1 False False True
2 False False True
Which appears to be broadcasting by row. Is there a correct way to do this other than transposing applying the operation with the Series and than untransposing?
(x.T & y[0]).T
0 1 2
0 False False False
1 True False True
2 False False True
It seems that that fails when the row index is not the same as the column labels
You could call apply and pass a lambda and call squeeze to flatten the Series into a 1-D array:
In [152]:
x.apply(lambda s: s & y.squeeze())
Out[152]:
0 1 2
0 False False False
1 True False True
2 False False True
I'm not sure if this is quicker though, here we're applying the mask column-wise by calling apply on the df which is why transposing is unnecessary
Actually you could use np.logical_and:
In [156]:
np.logical_and(x,y)
Out[156]:
0 1 2
0 False False False
1 True False True
2 False False True

Categories

Resources