Vectorized cummulative sum based on value in array numpy [duplicate] - python

Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.

Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop

Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0

You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64

You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.

Related

Python Pandas: Vectorized Way of Cleaning Buy and Sell Signals

I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.
To elaborate, I've developed a buy_signal column and a sell_signal column. These columns contain booleans in the form of 1s and 0s.
Looking at the signals from the top-down, I don't want to trigger a second buy_signal before a sell_signal triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal and buy_signal are 1, set them both to 0.
What is the best way to remove these irrelevant signals?
Here's an example:
import pandas as pd
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
print(df)
buy_signal sell_signal
0 1 0
1 1 0
2 1 1
3 1 1
4 0 1
5 0 0
6 1 0
7 1 0
8 1 1
9 0 0
Here's the result I want:
buy_signal sell_signal
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 0
7 0 0
8 0 1
9 0 0
As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.
The key idea is to consider a quantity q (or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]. That quantity is cumsum(buy - sell) after cleaning.
Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy or pandas, but that we can code quite efficiently using numba. The code below processes 1 million rows in 37 ms.
import numpy as np
from numba import njit
#njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
res = np.empty_like(a)
c = 0
for i in range(len(a)):
c = min(max(c + a[i], xmin), xmax)
res[i] = c
return res
def clean_buy_sell(df, xmin=0, xmax=1):
# model the quantity held: cumulative sum of buy-sell clipped in
# [xmin, xmax]
# note that, when buy and sell are equal, there is no change
q = cumsum_clip(
(df['buy_signal'] - df['sell_signal']).values,
xmin=xmin, xmax=xmax)
# derive actual transactions: positive for buy, negative for sell, 0 for hold
trans = np.diff(np.r_[0, q])
df = df.assign(
buy_signal=np.clip(trans, 0, None),
sell_signal=np.clip(-trans, 0, None),
)
return df
Now:
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
new_df = clean_buy_sell(df)
>>> new_df
buy_signal sell_signal
0 1 0
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 0
7 0 0
8 0 0
9 0 0
Speed and correctness
n = 1_000_000
np.random.seed(0) # repeatable example
df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
columns=['buy_signal', 'sell_signal'])
%timeit clean_buy_sell(df)
37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Correctness tests:
z = clean_buy_sell(df)
q = (z['buy_signal'] - z['sell_signal']).cumsum()
# q is quantity held through time; must be in {0, 1}
assert q.isin({0, 1}).all()
# we should not have introduced any new buy signal:
# check that any buy == 1 in z was also 1 in df
assert not (z['buy_signal'] & ~df['buy_signal']).any()
# same for sell signal:
assert not (z['sell_signal'] & ~df['sell_signal']).any()
# finally, buy and sell should never be 1 on the same row:
assert not (z['buy_signal'] & z['sell_signal']).any()
Bonus: other limits, fractional buys and sells
For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]. There is nothing to change to the current version of clean_buy_sell, which is general enough to handle these conditions.
np.random.seed(0)
df = pd.DataFrame(
np.random.uniform(0, 1, (100, 2)),
columns=['buy_signal', 'sell_signal'],
)
# set limits to -1, 2: we can sell short (borrow) up to 1 unit
# and own up to 2 units.
z = clean_buy_sell(df, -1, 2)
(z['buy_signal'] - z['sell_signal']).cumsum().plot()

Add a state column when another column is increasing/decreasing

I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Python: Partial sum of a set in a matrix by columns

I have two large matrices (1800L;1800C), epeq and triax, that have columns like:
epeq=
0
1
1
2
1
0
3
3
1
1
0
2
1
1
1
triax=
-1
1
3
1
-2
-3
-1
1
2
3
2
1
-1
-3
-1
1
as you can see, triax columns have cycles of positive and negative elements. I want a cumulative sum in epeq in the beginning of each cycle in triax and that this value stay constant during the cycle, like this:
epeq_cr=
0
1
1
1
1
1
1
11
11
11
11
11
11
11
11
17
and apply this procedure to all columns of the epeq matrix. I have that code but something miss.
epeq_cr = np.copy(epeq)
for g in range(1,len(epeq_cr)):
for h in range(len(epeq_cr[g])):
if (triax[g-1][h]<0 and triax[g][h]>0):
epeq_cr[g][h] = np.cumsum()...
I've run out of time to look at this now but I'd start by figuring out where the cycles start in the triax:
epeq = np.array([1, 1, 2, 1, 0, 3, 3, 1, 1, 0, 2, 1, 1, 1])
triax = np.array([-1, 1, 3, 1, -2, -3, -1, 1, 2, 3, 2, 1, -1, -3, -1, 1])
t_shift = np.roll(triax, 1)
t_shift[0] = 0
cycle_starts = np.argwhere((triax > 0) & (t_shift < 0)).flatten()
array([ 1, 7, 15])
So for any position, i, in epeq_cr you need to find the largest number less than i in cycle_starts and sum(epeq[:position]).
epeq_cr = np.copy(epeq)
for g in range(1,len(epeq_cr)):
for h in range(len(epeq_cr[g])):
if (triax[g-1][h]<=0 and triax[g][h]>=0):
epeq_cr[g][h]=sum(epeq[v][h] for v in range(g+1))
else:
epeq_cr[g][h]=epeq_cr[g-1][h]

Encoding column labels in Pandas for machine learning

I am working on car evaulation dataset for machine learning and the dataset is like this
buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
i want to convert these strings to unique enumerated integers columnwise. i see that pandas.factorize() is the way to go, but it only works on one column. how do i factorize the dataframe in one go with one command.
i tried lambda function and it is not working.
df.apply(lambda c:pd.factorize(c),axis=1)
Output:
0 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...
1 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...
2 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...
3 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...
4 ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])
5 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
i see the encoded values but cant pull that out from above array
Factorize returns a tuple of (values, labels). You'll just want the values in the DataFrame.
In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']
In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]:
buying maint lug_boot safety class
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 0 2 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 0 1 2 0
Then concat that to the numeric data.
A word of warning though: this implies that "low" safety and "high" safety are the same distance from "med" safety. You might be better off using pd.get_dummies:
In [37]: dummies = []
In [38]: for col in cols:
....: dummies.append(pd.get_dummies(df[col]))
....:
In [39]: pd.concat(dummies, axis=1)
Out[39]:
vhigh vhigh med small high low med unacc
0 1 1 0 1 0 1 0 1
1 1 1 0 1 0 0 1 1
2 1 1 0 1 1 0 0 1
3 1 1 1 0 0 1 0 1
4 1 1 1 0 0 0 1 1
5 1 1 1 0 1 0 0 1
get_dummies has some optional parameters to control the naming, which you'll probably want.

Categories

Resources