My algorithm stepped up from 35 seconds to 15 minutes runtime when implementing this feature over a daily timeframe. The algo retrieves daily history in bulk and iterates over a subset of the dataframe (from t0 to tX where tX is the current row of iteration). It does this to emulate what would happen during the real time operations of the algo. I know there are ways of improving it by utilizing memory between frame calculations but I was wondering if there was a more pandas-ish implementation that would see immediate benefit.
Assume that self.Step is something like 0.00001 and self.Precision is 5; they are used for binning the ohlc bar information into discrete steps for the sake of finding the poc. _frame is a subset of rows of the entire dataframe, and _low/_high are respective to that. The following block of code executes on the entire _frame which could be upwards of ~250 rows every time there is a new row added by the algo (when calculating yearly timeframe on daily data). I believe it's the iterrows that's causing the major slowdown. The dataframe has columns such as high, low, open, close, volume. I am calculating time price opportunity and volume point of control.
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
you can use this function as a base and to adjust it:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
Here's what I worked out. I'm still not sure the answer you code is producing is correct, I think your line volume_prices[_prices] += state.Volume / _prices.size is not being applied to every record in volume_prices, but here it is with benchmarking. About a 9x improvement.
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've managed to get it down to 2 mins instead of 15 - at on daily timeframes anyway. It's still slow on lower timeframes (10 minutes on Hourly over a 2 year period with a precision of 2 for equities). Working with DataFrames as opposed to Series was FAR slower. I'm hoping for more but I don't know what I can do aside from the following solution:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
Related
I'm building a time series, trying to get a more efficient way to do this - ideally vectorized.
The pandas apply with list comprehension step is very slow (on a big data set).
import datetime
import pandas as pd
# Dummy data:
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=4, freq='D')
categories = list(2*'A') + list(2*'B')
d = {'xdate': xdates, 'periods': [8]*2 + [2]*2, 'interval': [3]*2 + [12]*2}
df = pd.DataFrame(d,index=categories)
# This step is slow:
df['sdates'] = df.apply(lambda x: [x.xdate + pd.DateOffset(months=k*x.interval) for k in range(x.periods)], axis=1)
# This step is quite quick, but shown here for completeness
df = df.explode('sdates')
Maybe something like this:
df['sdates'] = [df.xdate + df.periods * [df.interval.astype('timedelta64[M]')]]
but the syntax isn't quite right.
This code
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdate'] = df.xdate + df.m_offsets * pd.DateOffset(months=1)
I think is similar to one of the answers, but the last step, pd.DateOffset gives a warning:
PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized
I tried building something along the lines of one answer, but as mentioned the modular arithmatic needs tweaking a lot to deal with edge cases, and haven't figured that out yet (calendar monthrange wasn't playing nicely).
This function doesn't run:
from calendar import monthrange
def add_months(df, date_col, n_col):
""" Adds ncol months do date_col """
z = df.copy()
# calculate new year/month/day and convert to datetime
z['year'] = (z[date_col].dt.year * 12 + (z[date_col].dt.month-1) + z[n_col]) // 12
z['month'] = ((z[date_col].dt.month + z[n_col] - 1) % 12) + 1
x,x = monthrange(z.year, z.month)
z['days_in_month'] = monthrange(z.year, z.month)
z['target_day'] = z[date_col].dt.day
# z['day'] = min(z.target_day, z.days_in_month)
z['day'] = z.days_in_month
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
return z['sdates']
This works, for now, but the dateoffset is a really heavy step.
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdates'] = df.apply(lambda x: x.xdate + pd.DateOffset(months=x.m_offsets), axis=1)
Here's one option. You're adding months, so we can actually calculate new year/month/day by only dealing with integers in a vectorized way, and then create datetime from these y/m/d combinations:
def f_proposed(df):
z = df.copy()
z = z.reset_index()
# repeat xdate as many times as the number of periods
z = z.loc[np.repeat(z.index, z['periods'])]
# calculate k number of months to add
z['k'] = z.groupby(level=0).cumcount() * z['interval']
# calculate new year/month/day and convert to datetime
z['year'] = (z['xdate'].dt.year * 12 + z['xdate'].dt.month - 1 + z['k']) // 12
z['month'] = (z['xdate'].dt.month - 1 + z['k']) % 12 + 1
# clip day to days_in_month
z['days_in_month'] = pd.to_datetime(
z['year'].astype(str)+'-'+z['month'].astype(str)+'-01').dt.days_in_month
z['day'] = np.clip(z['xdate'].dt.day, 0, z['days_in_month'])
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
# drop temporary columns
z = z.set_index('index').drop(columns=['k', 'year', 'month', 'day', 'days_in_month'])
return z
To compare performance with the original, I've generated a test dataset with 10,000 rows.
Here's my timings (~23x speedup for 10K):
%timeit f_proposed(z)
82.7 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f_original(z)
1.92 s ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
P.S. For 170K it takes about 1.39s with f_proposed and 33.6 s with f_original on my machine
Semi-vectorized way
As I say below, I don't think there is a pure vectorized way to add a variable and general DateOffset to a Series of Timestamps. #perl solution works in the case where the DateOffset is an exact multiple of 1 month.
Now, adding a single constant DateOffset is vectorized, so we can use the following. It capitalizes on the fact that there is a limited set of distinct values for the date offset. It is also relatively fast, and it is correct for any DateOffset and dates:
n = df['periods'].values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index')
z = z.assign(madd=period_no * z['interval'])
z['sdates'] = z['xdate']
for madd in set(z['madd'].unique()):
z.loc[z['madd'] == madd, 'sdates'] += pd.DateOffset(months=madd)
Timing:
# modified large dummy data:
N = 170_000
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=N, freq='H')
categories = np.random.choice(list('ABCDE'), N)
d = {'xdate': xdates, 'periods': np.random.randint(1,10,N), 'interval': np.random.randint(1,12,N)}
df = pd.DataFrame(d,index=categories)
%%time (the above)
CPU times: user 3.49 s, sys: 13.5 ms, total: 3.51 s
Wall time: 3.51 s
(Note: for 10K rows using the generation above, I see times of ~240ms, but of course it is dependent on how many distinct month offsets you have in your data).
Example result (for one draw of 170K rows as per above):
>>> z.tail()
xdate periods interval madd sdates
index
B 2040-08-25 06:00:00 8 8 48 2044-08-25 06:00:00
B 2040-08-25 06:00:00 8 8 56 2045-04-25 06:00:00
D 2040-08-25 07:00:00 3 2 0 2040-08-25 07:00:00
D 2040-08-25 07:00:00 3 2 2 2040-10-25 07:00:00
D 2040-08-25 07:00:00 3 2 4 2040-12-25 07:00:00
Correction on the initial answer
I stand corrected: my original answer is not vectorized either. The first part, exploding the DataFrame and building the number of months to add, is vectorized and very fast. But the second part, adding a DateOffset of a variable number of months, is not.
I hope I am wrong, but I don't think there is currently a way to do that second part in a vectorized way.
Direct date-parts manipulation (e.g. month = (month - 1 + n_months) % 12 + 1, etc.) are bound to fail for corner cases (e.g. '2021-02-31'). Short of replicating the logic used in DateOffset, this is not going to work for certain cases.
Initial answer
Here is a vectorized way:
n = df.periods.values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index').assign(period_no=period_no)
z['sdates'] = z['period_no'] * z['interval'] * pd.DateOffset(months=1) + z['xdate']
consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0
I recently asked a question about calculating maximum drawdown where Alexander gave a very succinct and efficient way of calculating it with DataFrame methods in pandas.
I wanted to follow up by asking how others are calculating maximum active drawdown?
This calculates Max Drawdown. NOT! Max Active Drawdown
This is what I implemented for max drawdown based on Alexander's answer to question linked above:
def max_drawdown_absolute(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
It takes a return series and gives back the max_drawdown along with the indices for which the drawdown occured.
We start by generating a series of cumulative returns to act as a return index.
r = returns.add(1).cumprod()
At each point in time, the current drawdown is calcualted by comparing the current level of the return index with the maximum return index for all periods prior.
dd = r.div(r.cummax()).sub(1)
The max drawdown is then just the minimum of all the calculated drawdowns.
My question:
I wanted to follow up by asking how others are calculating maximum
active drawdown?
Assumes that the solution will extend on the solution above.
Starting with a series of portfolio returns and benchmark returns, we build cumulative returns for both. the variables below are assumed to already be in cumulative return space.
The active return from period j to period i is:
Solution
This is how we can extend the absolute solution:
def max_draw_down_relative(p, b):
p = p.add(1).cumprod()
b = b.add(1).cumprod()
pmb = p - b
cam = pmb.expanding(min_periods=1).apply(lambda x: x.argmax())
p0 = pd.Series(p.iloc[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.iloc[cam.values.astype(int)].values, index=b.index)
dd = (p * b0 - b * p0) / (p0 * b0)
mdd = dd.min()
end = dd.argmin()
start = cam.ix[end]
return mdd, start, end
Explanation
Similar to the absolute case, at each point in time, we want to know what the maximum cumulative active return has been up to that point. We get this series of cumulative active returns with p - b. The difference is that we want to keep track of what the p and b were at this time and not the difference itself.
So, we generate a series of 'whens' captured in cam (cumulative argmax) and subsequent series of portfolio and benchmark values at those 'whens'.
p0 = pd.Series(p.ix[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.ix[cam.values.astype(int)].values, index=b.index)
The drawdown caclulation can now be made analogously using the formula above:
dd = (p * b0 - b * p0) / (p0 * b0)
Demonstration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(314)
p = pd.Series(np.random.randn(200) / 100 + 0.001)
b = pd.Series(np.random.randn(200) / 100 + 0.001)
keys = ['Portfolio', 'Benchmark']
cum = pd.concat([p, b], axis=1, keys=keys).add(1).cumprod()
cum['Active'] = cum.Portfolio - cum.Benchmark
mdd, sd, ed = max_draw_down_relative(p, b)
f, a = plt.subplots(2, 1, figsize=[8, 10])
cum[['Portfolio', 'Benchmark']].plot(title='Cumulative Absolute', ax=a[0])
a[0].axvspan(sd, ed, alpha=0.1, color='r')
cum[['Active']].plot(title='Cumulative Active', ax=a[1])
a[1].axvspan(sd, ed, alpha=0.1, color='r')
You may have noticed that your individual components do not equal the whole, either in an additive or geometric manner:
>>> cum.tail(1)
Portfolio Benchmark Active
199 1.342179 1.280958 1.025144
This is always a troubling situation, as it indicates that some sort of leakage may be occurring in your model.
Mixing single period and multi-period attribution is always always a challenge. Part of the issue lies in the goal of the analysis, i.e. what are you trying to explain.
If you are looking at cumulative returns as is the case above, then one way you perform your analysis is as follows:
Ensure the portfolio returns and the benchmark returns are both excess returns, i.e. subtract the appropriate cash return for the respective period (e.g. daily, monthly, etc.).
Assume you have a rich uncle who lends you $100m to start your fund. Now you can think of your portfolio as three transactions, one cash and two derivative transactions:
a) Invest your $100m in a cash account, conveniently earning the offer rate.
b) Enter into an equity swap for $100m notional
c) Enter into a swap transaction with a zero beta hedge fund, again for $100m notional.
We will conveniently assume that both swap transactions are collateralized by the cash account, and that there are no transaction costs (if only...!).
On day one, the stock index is up just over 1% (an excess return of exactly 1.00% after deducting the cash expense for the day). The uncorrelated hedge fund, however, delivered an excess return of -5%. Our fund is now at $96m.
Day two, how do we rebalance? Your calculations imply that we never do. Each is a separate portfolio that drifts on forever... For the purpose of attribution, however, I believe it makes total sense to rebalance daily, i.e. 100% to each of the two strategies.
As these are just notional exposures with ample cash collateral, we can just adjust the amounts. So instead of having $101m exposure to the equity index on day two and $95m of exposure to the hedge fund, we will instead rebalance (at zero cost) so that we have $96m of exposure to each.
How does this work in Pandas, you might ask? You've already calculated cum['Portfolio'], which is the cumulative excess growth factor for the portfolio (i.e. after deducting cash returns). If we apply the current day's excess benchmark and active returns to the prior day's portfolio growth factor, we calculate the daily rebalanced returns.
import numpy as np
import pandas as pd
np.random.seed(314)
df_returns = pd.DataFrame({
'Portfolio': np.random.randn(200) / 100 + 0.001,
'Benchmark': np.random.randn(200) / 100 + 0.001})
df_returns['Active'] = df.Portfolio - df.Benchmark
# Copy return dataframe shape and fill with NaNs.
df_cum = pd.DataFrame()
# Calculate cumulative portfolio growth
df_cum['Portfolio'] = (1 + df_returns.Portfolio).cumprod()
# Calculate shifted portfolio growth factors.
portfolio_return_factors = pd.Series([1] + df_cum['Portfolio'].shift()[1:].tolist(), name='Portfolio_return_factor')
# Use portfolio return factors to calculate daily rebalanced returns.
df_cum['Benchmark'] = (df_returns.Benchmark * portfolio_return_factors).cumsum()
df_cum['Active'] = (df_returns.Active * portfolio_return_factors).cumsum()
Now we see that the active return plus the benchmark return plus the initial cash equals the current value of the portfolio.
>>> df_cum.tail(3)[['Benchmark', 'Active', 'Portfolio']]
Benchmark Active Portfolio
197 0.303995 0.024725 1.328720
198 0.287709 0.051606 1.339315
199 0.292082 0.050098 1.342179
By construction, df_cum['Portfolio'] = 1 + df_cum['Benchmark'] + df_cum['Active'].
Because this method is difficult to calculate (without Pandas!) and understand (most people won't get the notional exposures), industry practice generally defines the active return as the cumulative difference in returns over a period of time. For example, if a fund was up 5.0% in a month and the market was down 1.0%, then the excess return for that month is generally defined as +6.0%. The problem with this simplistic approach, however, is that your results will drift apart over time due to compounding and rebalancing issues that aren't properly factored into the calculations.
So given our df_cum.Active column, we could define the drawdown as:
drawdown = pd.Series(1 - (1 + df_cum.Active)/(1 + df_cum.Active.cummax()), name='Active Drawdown')
>>> df_cum.Active.plot(legend=True);drawdown.plot(legend=True)
You can then determine the start and end points of the drawdown as you have previously done.
Comparing my cumulative Active return contribution with the amounts you calculated, you will find them to be similar at first, and then drift apart over time (my return calcs are in green):
My cheap two pennies in pure Python:
def find_drawdown(lista):
peak = 0
trough = 0
drawdown = 0
for n in lista:
if n > peak:
peak = n
trough = peak
if n < trough:
trough = n
temp_dd = peak - trough
if temp_dd > drawdown:
drawdown = temp_dd
return -drawdown
In piRSquared answer I would suggest amending
pmb = p - b
to
pmb = p / b
to find the rel. maxDD. df3 using pmb = p-b identifies a rel. MaxDD of US$851 (-48.9%). df2 using pmb = p/b identifies the rel. MaxDD as US$544.6 (-57.9%)
import pandas as pd
import datetime
import pandas_datareader.data as pdr
import matplotlib.pyplot as plt
import yfinance as yfin
yfin.pdr_override()
stocks = ["AMZN", "SPY"]
df = pdr.get_data_yahoo(stocks, start="2020-01-01", end="2022-02-18")
df = df[['Adj Close']]
df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)
df.Date=df.Date.dt.date
df2 = df[df.Date.isin([datetime.date(2020,7,9), datetime.date(2022,2,3)])].copy()
df2['AMZN/SPY'] = df2.AMZN / df2.SPY
df2['AMZN-SPY'] = df2.AMZN - df2.SPY
df2['USDdiff'] = df2['AMZN-SPY'].diff().round(1)
df2[["p", "b"]] = df2[['AMZN','SPY']].pct_change(1).round(4)
df2['p-b'] = df2.p - df2.b
df2.replace(np. nan,'',regex=True, inplace=True)
df2 = df2.round(2)
print(df2)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-07-09 3182.63 307.7 10.34 2874.93
2022-02-03 2776.91 446.6 6.22 2330.31 -544.6 -0.1275 0.4514 -0.5789
df3 = df[df.Date.isin([datetime.date(2020,9,2), datetime.date(2022,2,3)])].copy()
df3['AMZN/SPY'] = df3.AMZN / df3.SPY
df3['AMZN-SPY'] = df3.AMZN - df3.SPY
df3['USDdiff'] = df3['AMZN-SPY'].diff().round(1)
df3[["p", "b"]] = df3[['AMZN','SPY']].pct_change(1).round(4)
df3['p-b'] = df3.p - df3.b
df3.replace(np. nan,'',regex=True, inplace=True)
df3 = df3.round(2)
print(df3)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-09-02 3531.45 350.09 10.09 3181.36
2022-02-03 2776.91 446.60 6.22 2330.31 -851.0 -0.2137 0.2757 -0.4894
PS: I don't have enough reputation to comment.
I'm trying to figure out how to get the interest and principal to display correctly over the years. Here is the part of my code I am having trouble with:
print ('Luke\n-----')
print ('Year\tPrincipal\tInterest\t Total')
LU_RATE = .05
YEAR = 1
Principal = 100
for YEAR in range (1,28):
# Calculating Luke's total using formula for compounding interest
Lu_Total = (Principal * ((1 + LU_RATE) ** YEAR))
# I realize it's a logical error occurring somewhere here
Lu_Interest = #I'm not sure what to code here
Lu_Principal = #And here
# Displaying the Principal, Interest, and Total over the 27
print (YEAR,'\t%.02f\t\t %.02f\t\t %.02f' %(Lu_Principal, Lu_Interest, Lu_Total))
This is what gets displayed (minus the comment symbols of course):
Luke
-----
Year Principal Interest Total
1 # # 105.00
2 # # 110.25
3 # # 115.76
4 # # 121.55
5 # # 127.63
6 # # 134.01
#etc etc....
Every equation I've tried to code had the correct Interest for year one but ends up putting the Principal as the Total. Every year past that calculates out to the wrong numbers.
It should look like:
Luke
-----
Year Principal Interest Total
1 100.00 5.00 105.00
2 105.00 5.25 110.25
3 110.25 5.51 115.76
#etc etc....
I've been working at it on and off throughout the day and just can't seem to figure it out. Thank you in advance for any help or suggestions.
This sounds like homework, so I'll be a little vague:
You have a loop. Your program executes from the top of the loop to the bottom of the loop, and then goes back and starts over at the top of the loop again.
You can change things by setting values in the bottom of the loop that will be used in the top of the loop next time.
For example, you can compute the interest based on this year's principal. You're doing that in the top of the loop.
At the bottom of the loop, after you print everything out for this year, you could change the (next year's) principal by adding (this year's) interest to it. Then 100 would become 105, etc.
And another contestant ;-)
print ('Luke\n-----')
print ('Year\tPrincipal\tInterest\t Total')
rate = .05
principal = 100.
for year in range (1, 28):
# calculate interest and total
interest = principal * rate
total = principal + interest
# displaying this year's values
print(year,'\t%.02f\t\t %.02f\t\t %.02f' %(principal, interest, total))
# next year's principal == this year's total
principal = total
produces
Luke
-----
Year Principal Interest Total
1 100.00 5.00 105.00
2 105.00 5.25 110.25
3 110.25 5.51 115.76
4 115.76 5.79 121.55
# ... etc ...
Here is what I did:
print ('Luke\n-----')
print ('Year\tPrincipal\tInterest\t Total')
LU_RATE = .05
YEAR = 1
Principal = 100
Prev_Principal = 100 #added to store previous year principal
for YEAR in range (1,28):
# Calculating Luke's total using formula for compounding interest
Lu_Total = (Principal * ((1 + LU_RATE) ** YEAR))
Lu_Interest = Lu_Total - Prev_Principal
Lu_Principal = Lu_Total - Lu_Interest
Prev_Principal = Lu_Total
# Displaying the Principal, Interest, and Total over the 27
print (YEAR,'\t%.02f\t\t %.02f\t\t %.02f' %(Lu_Principal, Lu_Interest, Lu_Total))
There may be another way to do this, but I think you have a few issues. One is that you need to base your "total" calculation (where you're multiplying the principal by the 1+rate ** year) on the original principal value, and you need to keep this value separate from the rest of the calculations.
So you can work with two names like p0 and pN, where p0 represents the initial principal at year 0, and pN represents the original principal PLUS accrued interest at year N, then we reassign pN at the end of the loop.
r = .05
p0, pN = 100, p0
for y in range(1,5):
total = p0 * ((1+r)**y)
i = total - pN
print (y,'\t%.02f\t\t %.02f\t\t %.02f' %(pN, i, total))
pN = total
The output is as you expect:
I try to calculate how often a state is entered and how long it lasts. For example I have the three possible states 1,2 and 3, which state is active is logged in a pandas Dataframe:
test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14))
For example the state 1 is entered two times (at index 3 and 12), the first time it lasts three hours, the second time two hours (so on average 2.5). State 2 is entered 3 times, on average for 2.66 hours.
I know that I can mask data I'm not interested in, for example to analyize for state 1:
state1 = test.mask(test!=1)
but from there on I can't find a way to go on.
I hope the comments give enough explanation - the key point is you can use a custom rolling window function and then cumsum to group the rows into "clumps" of the same state.
# set things up
freq = "1h"
df = pd.DataFrame(
[2,2,2,1,1,1,2,2,2,3,2,2,1,1],
index=pd.date_range('00:00', freq=freq, periods=14)
)
# add a column saying if a row belongs to the same state as the one before it
df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1)
# the cumulative sum - each "clump" gets its own integer id
df["value_group"] = df["is_first"].cumsum()
# get the rows corresponding to states beginning
start = df.groupby("value_group", as_index=False).nth(0)
# get the rows corresponding to states ending
end = df.groupby("value_group", as_index=False).nth(-1)
# put the timestamp indexes of the "first" and "last" state measurements into
# their own data frame
start_end = pd.DataFrame(
{
"start": start.index,
# add freq to get when the state ended
"end": end.index + pd.Timedelta(freq),
"value": start[0]
}
)
# convert timedeltas to seconds (float)
start_end["duration"] = (
(start_end["end"] - start_end["start"]).apply(float) / 1e9
)
# get average state length and counts
agg = start_end.groupby("value").agg(["mean", "count"])["duration"]
agg["mean"] = agg["mean"] / (60 * 60)
And the output:
mean count
value
1 2.500000 2
2 2.666667 3
3 1.000000 1