Python Pandas: Vectorized Way of Cleaning Buy and Sell Signals - python

I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.
To elaborate, I've developed a buy_signal column and a sell_signal column. These columns contain booleans in the form of 1s and 0s.
Looking at the signals from the top-down, I don't want to trigger a second buy_signal before a sell_signal triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal and buy_signal are 1, set them both to 0.
What is the best way to remove these irrelevant signals?
Here's an example:
import pandas as pd
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
print(df)
buy_signal sell_signal
0 1 0
1 1 0
2 1 1
3 1 1
4 0 1
5 0 0
6 1 0
7 1 0
8 1 1
9 0 0
Here's the result I want:
buy_signal sell_signal
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 0
7 0 0
8 0 1
9 0 0

As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.
The key idea is to consider a quantity q (or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]. That quantity is cumsum(buy - sell) after cleaning.
Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy or pandas, but that we can code quite efficiently using numba. The code below processes 1 million rows in 37 ms.
import numpy as np
from numba import njit
#njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
res = np.empty_like(a)
c = 0
for i in range(len(a)):
c = min(max(c + a[i], xmin), xmax)
res[i] = c
return res
def clean_buy_sell(df, xmin=0, xmax=1):
# model the quantity held: cumulative sum of buy-sell clipped in
# [xmin, xmax]
# note that, when buy and sell are equal, there is no change
q = cumsum_clip(
(df['buy_signal'] - df['sell_signal']).values,
xmin=xmin, xmax=xmax)
# derive actual transactions: positive for buy, negative for sell, 0 for hold
trans = np.diff(np.r_[0, q])
df = df.assign(
buy_signal=np.clip(trans, 0, None),
sell_signal=np.clip(-trans, 0, None),
)
return df
Now:
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
new_df = clean_buy_sell(df)
>>> new_df
buy_signal sell_signal
0 1 0
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 0
7 0 0
8 0 0
9 0 0
Speed and correctness
n = 1_000_000
np.random.seed(0) # repeatable example
df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
columns=['buy_signal', 'sell_signal'])
%timeit clean_buy_sell(df)
37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Correctness tests:
z = clean_buy_sell(df)
q = (z['buy_signal'] - z['sell_signal']).cumsum()
# q is quantity held through time; must be in {0, 1}
assert q.isin({0, 1}).all()
# we should not have introduced any new buy signal:
# check that any buy == 1 in z was also 1 in df
assert not (z['buy_signal'] & ~df['buy_signal']).any()
# same for sell signal:
assert not (z['sell_signal'] & ~df['sell_signal']).any()
# finally, buy and sell should never be 1 on the same row:
assert not (z['buy_signal'] & z['sell_signal']).any()
Bonus: other limits, fractional buys and sells
For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]. There is nothing to change to the current version of clean_buy_sell, which is general enough to handle these conditions.
np.random.seed(0)
df = pd.DataFrame(
np.random.uniform(0, 1, (100, 2)),
columns=['buy_signal', 'sell_signal'],
)
# set limits to -1, 2: we can sell short (borrow) up to 1 unit
# and own up to 2 units.
z = clean_buy_sell(df, -1, 2)
(z['buy_signal'] - z['sell_signal']).cumsum().plot()

Related

To calculate the number of times the two dataframe columns are equal to -1 at the same time

I have two dataframe columns containing sequences of 0 and -1.
Using Python command “count” I can calculate the number of times the 1st column equal '-1' ( =3 times) and the number of times the 2nd column equals '-1' ( =2 times). Actually, I would like to calculate the number of times that both columns x and y are equal to '-1' simultaneously ( = it should be equal to 1 in the given example)(something like calculating: count = df1['x'][df1['x'] == df1['y'] == -1]. count() but I cannot put 2 conditions directly in command 'count'..).
Is there a simpe way to do it (using count or some other workaround)?
Thanks in advance!
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
df1 = pd.DataFrame({
"x": [0, 0, 0, -1 , 0, -1, 0, 0, 0, 0 , 0, 0, 0, -1, 0],
"y": [0, 0, 0, 0 , 0, 0, -1, 0, 0, 0 , 0, 0, 0, -1, 0],
})
df1
x y
0 0 0
1 0 0
2 0 0
3 -1 0
4 0 0
5 -1 0
6 0 -1
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 -1 -1
14 0 0
count = df1['x'][df1['x'] == -1]. count()
count
3
count = df1['y'][df1['y'] == -1]. count()
count
2
You can use eq + all to get a boolean Series that returns True if both columns are equal to -1 at the same time. Then sum fetches the total:
out = df1[['x','y']].eq(-1).all(axis=1).sum()
Output:
1
Sum x and y, and count the ones where they add to -2. ie both are -1
(df1.x + df1.y).eq(-2).sum()
1

Alternately fill numpy array between non-zero values

I have a binary numpy array, mostly zero-valued, and I want to fill the gaps bewteen non-zero values with a given value, but in an alternate way.
For example:
[0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0]
should result in either
[0,0,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0]
or
[1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,0,1,1,1]
The idea is: while scanning the array left to right, fill 0 values with 1 up the next 1, if you didn't do it up to the previous 1.
I can do this iteratively and in this way
A = np.array([0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0])
ones_index = np.where(A == 1)[0]
begins = ones_index[::2] # beginnings of filling section
ends = ones_index[1::2] # ends of filling sections
from itertools import zip_longest
# fill those sections
for begin, end in zip_longest(begins, ends, fillvalue=len(A)):
A[begin:end] = 1
but I'm looking for a more efficent solution, maybe with numpy broadcasting. Any ideas?
One nice answer to this question is that we can produce the first result via np.logical_xor.accumulate(arr) | arr and the second via ~np.logical_xor.accumulate(arr) | arr. A quick demonstration:
A = np.array([0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0])
print(np.logical_xor.accumulate(A) | A)
print(~np.logical_xor.accumulate(A) | A)
The resulting output:
[0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0]
[1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1]
np.where(arr.cumsum() % 2 == 1, 1, arr)
# array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
# 0, 0, 1, 1, 1, 1, 0, 0])

Pandas: create column based on first off occurrences in column B after signal in column A

I have A column with signal on == 1 and B column with signal off == 1 ,the rest values are zero.
data = {'A': [1, 0, 0, 0, 0, 1, 0],
'B': [1, 0, 1, 1, 0, 0, 1]}
df = pd.DataFrame.from_dict(data)
I need to create a column C where:
A == 1 and B == 0 or 1, C= 1
C = 1 till to B == 1, than C = 0
Here what the result should be:
df['C'] = [1, 1, 0, 0, 0, 1, 0]
I used
df.loc[df['A'] == 1, 'C'] = 1
to set at 1 the row where A == 1, but I can not find the way to get first non zero in B column, after the 1 signal on A, and replace the other with zeros till to next 1 in A.
You can do mask, with transform idxmax , mask here is to set B to 0 when A equal to 1 , since no matter what value of B, the C will be 1.
df['C']=(df.index<df.B.mask(df.A.eq(1),0).groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
Update
s=df.B.mask(df.A.eq(1),0)
s=(s==1)&(s.shift(-1)==0)
df['C']=(df.index<s.groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df.loc[df.A==1,'C']=1
Hello and welcome to stackoverflow.
This is a case you usually wouldn't use pandas for as the value of C depends on previous rows. And pandas is more about using "split-apply-combine" on independent measurements
If it is not runtime-critical I would probably write a plain old loop for this:
In [4]: C = []
...: signal = 0
...: for _, row in df.iterrows():
...: if ((signal == 1) and (row.B == 1)):
...: signal = 0
...: elif(row.A == 1):
...: signal = 1
...: C.append(signal)
...:
In [5]: C
Out[5]: [1, 1, 0, 0, 0, 1, 0]
In [6]: df['C'] = C
In [7]: df
Out[7]:
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
This won't have a good performance, but imho it is worth it to cleanly express the intent of your code if it is still "fast enough".
Solution based on iterrows (as proposed in one of other answers)
may be too slow.
Define the following function computing the output signal for a group
of input rows (starting on each case of A == 1):
def signal(grp):
return pd.Series(np.equal(np.where(grp.A == 1, 0, grp.B)
.cumsum(), 0).astype(int), index=grp.index)
Then group df and apply this function:
df['C'] = df.groupby(df.A.cumsum()).apply(signal)\
.reset_index(level=0, drop=True)
Edit
Yet faster solution, without grouping, is:
sig = df.A.replace(0, np.nan)
sig.update(df.A.lt(df.B).astype(int).replace(0, np.nan) - 1)
df['C'] = sig.ffill().fillna(0, downcast='infer')
For a sample of 7000 rows (your data repeated 1000 times) the execution
time of this solution is 14 times shorter than the solution by YOBEN_S.

Vectorized cummulative sum based on value in array numpy [duplicate]

Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.
Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop
Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0
You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64
You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Categories

Resources