My DataFrame contains multiple time series, I want to flag whenever a point in each time series goes one standard deviation above the mean.
df = pd.DataFrame(np.random.rand(3, 10), index=['ts_A', 'ts_B','ts_C'])
std = df.std(axis=1)
mean = df.mean(axis=1)
And then I was hoping to be able to do:
df.mask(df > (std + mean), 'True', inplace=True)
Which should return the original DataFrame where any value which is more than one standard deviation above the mean for that row/time series is replaced by True.
However, instead this returns false for every element. If I use df.where instead the entire DataFrame gets filled with True.
I could do this by iterating through the index and masking each row in turn but I'm sure there must be a better way.
Using gt with axis=0
df.mask(df.gt(std + mean,axis=0), 'True', inplace=True)
df
0 1 2 3 4 5 6
ts_A 0.003797 0.060297 0.265496 0.442663 True 0.498443 0.436738
ts_B 0.127535 0.644332 True 0.079317 0.0411021 True 0.830672
ts_C 0.693698 0.429689 0.371802 0.312407 0.0555868 True True
7 8 9
ts_A 0.403529 0.392445 0.238355
ts_B 0.732539 0.030451 0.895976
ts_C 0.907143 0.912002 0.098821
If need return T and F
TorF=df.gt(std + mean,axis=0)
TorF
Out[31]:
0 1 2 3 4 5 6 7 8 9
ts_A False False False False True False False False False False
ts_B False False True False False True False False False False
ts_C False False False False False True True False False False
Related
I have a boolean True/False-column "Mask" in a dataframe, e.g.:
Mask
True
True
True
False
False
True
False
False
Now I am trying to add a column with the count of the consecutive True/False lines, where True is a positive sum (counts of +1) and False is a negative sum (counts of -1), e.g.
Mask Count
True 3
True 3
True 3
False -2
False -2
True 1
False -2
False -2
I tried it with groupby and sum but now I got a knot in my head.
Tried something like
mask.groupby((~mask).cumsum()).cumsum().astype(int)
(mask is the condition for the True/False) but this only counts the Trues and does a count instead of showing the sum.
Would really appreciate any suggestions!
You can get the group number of consecutive True/False by .cumsum() and put into g.
Then, group by g and get the size/count of each group by .transform() + .size(). Set the sign by multiplying the return value (1 or -1) of np.where(), as follows:
g = df['Mask'].ne(df['Mask'].shift()).cumsum()
df['Count'] = df.groupby(g)['Mask'].transform('size') * np.where(df['Mask'], 1, -1)
Result:
print(df)
Mask Count
0 True 3
1 True 3
2 True 3
3 False -2
4 False -2
5 True 1
6 False -2
7 False -2
I have a DataFrame that looks like the following:
df = pd.DataFrame({'a':[True]*5+[False]*5+[True]*5,'b':[False]+[True]*3+[False]+[True]*5+[False]*4+[True]})
a b
0 True False
1 True True
2 True True
3 True True
4 True False
5 False True
6 False True
7 False True
8 False True
9 False True
10 True False
11 True False
12 True False
13 True False
14 True False
How can I select blocks where column a is True only when the interior values over the same rows for column b are True?
I know that I could find break apart the DataFrame into consecutive True regions, and apply a function to each DataFrame chunk, but this is for a much larger problem with 10 million+ rows, and I don't think such a solution would scale up very well.
My expected output would be the following:
a b c
0 True False True
1 True True True
2 True True True
3 True True True
4 True False True
5 False True False
6 False True False
7 False True False
8 False True False
9 False True False
10 True False False
11 True False False
12 True False False
13 True False False
14 True True False
You can do a groupby on the a values and then look at the b values in a function, like this:
groupby_consec_a = df.groupby(df.a.diff().ne(0).cumsum())
all_interior = lambda x: x.iloc[1:-1].all()
df['c'] = df.a & groupby_consec_a.b.transform(all_interior)
Try out whether it's fast enough on your data. If not, the lambda will have to be replaced by pandas functions, but that will be more code.
I have a following dataframe.
y = pd.DataFrame(np.zeros((10,1), dtype = 'bool'), columns = ['A'])
y.iloc[[3,5], 0] = True
A
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 False
9 False
And I want to make 'True' for the next three rows from where 'True' is found in the above dataframe. The expected results is shown in the below.
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
I can do that in the following way, but I wonder if there is a smarter way to do so.
y['B'] = y['A'].shift()
y['C'] = y['B'].shift()
y['D'] = y.any(axis = 1)
y['A'] = y['D']
y = y['A']
Thank you for the help in advance.
I use parameter limit in forward filling missing values with replace False to missing values and last replace NaNs to False:
y.A = y.A.replace(False, np.nan).ffill(limit=2).fillna(False)
print (y)
A
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 False
Another idea with Rolling.apply and any for test at least one True per window:
y.A = y.A.rolling(3, min_periods=1).apply(lambda x: x.any()).astype(bool)
My dataframe looks like this:
price High_cross
0.00224311 False
0.00224473 False
0.00224422 False
0.00224697 True
0.00224899 True
0.00224668 True
0.00224967 True
0.00224967 True
0.00224983 True
0.00225143 False
And what I need to do is loop on the column High_cross, when there is a True select the related price and compare it with the price the final True of the series. If the first price is below the second price, notify it in a new column movement by True. In this example it should look at something like that:
price High_cross Movement
0.00224311 False
0.00224473 False
0.00224422 False
0.00224697 True True
0.00224899 True
0.00224668 True
0.00224967 True
0.00224967 True
0.00224983 True
0.00225143 False
(because 0.00224983 is bigger than 0.00224697)!
I tried to play with the index but I am relatively stuck... any solution/idea? Thanks
Consider the below df:
price High_cross
0 0.002243 False
1 0.002245 False
2 0.002244 False
3 0.002247 True
4 0.002249 True
5 0.002247 True
6 0.002250 True
7 0.002250 True
8 0.002250 True
9 0.002251 False
10 0.002251 True
11 0.002250 True
Use:
df['identifier']=(df.High_cross.ne(df.High_cross.shift())).cumsum()
df['Movement']=df[df.High_cross].groupby('identifier')['price'].\
transform(lambda x: x.iloc[0]<x.iloc[-1])
print(df.drop('identifier',1))
price High_cross Movement
0 0.002243 False NaN
1 0.002245 False NaN
2 0.002244 False NaN
3 0.002247 True True
4 0.002249 True True
5 0.002247 True True
6 0.002250 True True
7 0.002250 True True
8 0.002250 True True
9 0.002251 False NaN
10 0.002251 True False
11 0.002250 True False
I'm not sure if I understood completely what your goal is here. I assuemd:
Select price if respective High_Cross == True
Compare the price to the last price where High_Cross == True
Set respective Movement = True if current price < last price[High_Cross == True]
but the following code achieves what I think you want:
import numpy as np
np.random.seed(5)
X = np.random.choice(range(10), size=10, replace=True).tolist()
Y = np.random.randint(2, size=10)
Y = [bool(y) for y in Y]
lst = []
movement = []
# Extract list of all true values
for price,cross in zip(X,Y):
# Create list of tuples
cross == True and lst.append((price,cross)) # If one liner avoiding the otherwise mandatory else statement
# Now do the work itself
for price,cross in zip(X,Y):
movement.append(True) if cross == True and price > lst[-1][0] else movement.append(False)
print("Price="+str(price)+", High_Cross="+str(cross)+", Movement="+str(movement[-1]))
Produces:
Price=3, High_Cross=True, Movement=True
Price=6, High_Cross=False, Movement=False
Price=6, High_Cross=True, Movement=True
Price=0, High_Cross=True, Movement=False
Price=9, High_Cross=True, Movement=True
Price=8, High_Cross=True, Movement=True
Price=4, High_Cross=False, Movement=False
Price=7, High_Cross=True, Movement=True
Price=0, High_Cross=False, Movement=False
Price=0, High_Cross=True, Movement=False
`
Lets say I have an NxM boolean dataframe X and an Nx1 boolean dataframe Y. I would like to perform a boolean operation on each column returning a new dataframe that is NxM. For example:
x = pd.DataFrame([[True, True, True], [True, False, True], [False, False, True]])
y = pd.DataFrame([[False], [True], [True]])
I would like x & y to return:
0 1 2
0 False False False
1 True False True
2 False False True
But instead it returns:
0 1 2
0 False NaN NaN
1 True NaN NaN
2 False NaN NaN
Instead treating y as a series with
x & y[0]
gives:
0 1 2
0 False True True
1 False False True
2 False False True
Which appears to be broadcasting by row. Is there a correct way to do this other than transposing applying the operation with the Series and than untransposing?
(x.T & y[0]).T
0 1 2
0 False False False
1 True False True
2 False False True
It seems that that fails when the row index is not the same as the column labels
You could call apply and pass a lambda and call squeeze to flatten the Series into a 1-D array:
In [152]:
x.apply(lambda s: s & y.squeeze())
Out[152]:
0 1 2
0 False False False
1 True False True
2 False False True
I'm not sure if this is quicker though, here we're applying the mask column-wise by calling apply on the df which is why transposing is unnecessary
Actually you could use np.logical_and:
In [156]:
np.logical_and(x,y)
Out[156]:
0 1 2
0 False False False
1 True False True
2 False False True