I recently asked a question about calculating maximum drawdown where Alexander gave a very succinct and efficient way of calculating it with DataFrame methods in pandas.
I wanted to follow up by asking how others are calculating maximum active drawdown?
This calculates Max Drawdown. NOT! Max Active Drawdown
This is what I implemented for max drawdown based on Alexander's answer to question linked above:
def max_drawdown_absolute(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
It takes a return series and gives back the max_drawdown along with the indices for which the drawdown occured.
We start by generating a series of cumulative returns to act as a return index.
r = returns.add(1).cumprod()
At each point in time, the current drawdown is calcualted by comparing the current level of the return index with the maximum return index for all periods prior.
dd = r.div(r.cummax()).sub(1)
The max drawdown is then just the minimum of all the calculated drawdowns.
My question:
I wanted to follow up by asking how others are calculating maximum
active drawdown?
Assumes that the solution will extend on the solution above.
Starting with a series of portfolio returns and benchmark returns, we build cumulative returns for both. the variables below are assumed to already be in cumulative return space.
The active return from period j to period i is:
Solution
This is how we can extend the absolute solution:
def max_draw_down_relative(p, b):
p = p.add(1).cumprod()
b = b.add(1).cumprod()
pmb = p - b
cam = pmb.expanding(min_periods=1).apply(lambda x: x.argmax())
p0 = pd.Series(p.iloc[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.iloc[cam.values.astype(int)].values, index=b.index)
dd = (p * b0 - b * p0) / (p0 * b0)
mdd = dd.min()
end = dd.argmin()
start = cam.ix[end]
return mdd, start, end
Explanation
Similar to the absolute case, at each point in time, we want to know what the maximum cumulative active return has been up to that point. We get this series of cumulative active returns with p - b. The difference is that we want to keep track of what the p and b were at this time and not the difference itself.
So, we generate a series of 'whens' captured in cam (cumulative argmax) and subsequent series of portfolio and benchmark values at those 'whens'.
p0 = pd.Series(p.ix[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.ix[cam.values.astype(int)].values, index=b.index)
The drawdown caclulation can now be made analogously using the formula above:
dd = (p * b0 - b * p0) / (p0 * b0)
Demonstration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(314)
p = pd.Series(np.random.randn(200) / 100 + 0.001)
b = pd.Series(np.random.randn(200) / 100 + 0.001)
keys = ['Portfolio', 'Benchmark']
cum = pd.concat([p, b], axis=1, keys=keys).add(1).cumprod()
cum['Active'] = cum.Portfolio - cum.Benchmark
mdd, sd, ed = max_draw_down_relative(p, b)
f, a = plt.subplots(2, 1, figsize=[8, 10])
cum[['Portfolio', 'Benchmark']].plot(title='Cumulative Absolute', ax=a[0])
a[0].axvspan(sd, ed, alpha=0.1, color='r')
cum[['Active']].plot(title='Cumulative Active', ax=a[1])
a[1].axvspan(sd, ed, alpha=0.1, color='r')
You may have noticed that your individual components do not equal the whole, either in an additive or geometric manner:
>>> cum.tail(1)
Portfolio Benchmark Active
199 1.342179 1.280958 1.025144
This is always a troubling situation, as it indicates that some sort of leakage may be occurring in your model.
Mixing single period and multi-period attribution is always always a challenge. Part of the issue lies in the goal of the analysis, i.e. what are you trying to explain.
If you are looking at cumulative returns as is the case above, then one way you perform your analysis is as follows:
Ensure the portfolio returns and the benchmark returns are both excess returns, i.e. subtract the appropriate cash return for the respective period (e.g. daily, monthly, etc.).
Assume you have a rich uncle who lends you $100m to start your fund. Now you can think of your portfolio as three transactions, one cash and two derivative transactions:
a) Invest your $100m in a cash account, conveniently earning the offer rate.
b) Enter into an equity swap for $100m notional
c) Enter into a swap transaction with a zero beta hedge fund, again for $100m notional.
We will conveniently assume that both swap transactions are collateralized by the cash account, and that there are no transaction costs (if only...!).
On day one, the stock index is up just over 1% (an excess return of exactly 1.00% after deducting the cash expense for the day). The uncorrelated hedge fund, however, delivered an excess return of -5%. Our fund is now at $96m.
Day two, how do we rebalance? Your calculations imply that we never do. Each is a separate portfolio that drifts on forever... For the purpose of attribution, however, I believe it makes total sense to rebalance daily, i.e. 100% to each of the two strategies.
As these are just notional exposures with ample cash collateral, we can just adjust the amounts. So instead of having $101m exposure to the equity index on day two and $95m of exposure to the hedge fund, we will instead rebalance (at zero cost) so that we have $96m of exposure to each.
How does this work in Pandas, you might ask? You've already calculated cum['Portfolio'], which is the cumulative excess growth factor for the portfolio (i.e. after deducting cash returns). If we apply the current day's excess benchmark and active returns to the prior day's portfolio growth factor, we calculate the daily rebalanced returns.
import numpy as np
import pandas as pd
np.random.seed(314)
df_returns = pd.DataFrame({
'Portfolio': np.random.randn(200) / 100 + 0.001,
'Benchmark': np.random.randn(200) / 100 + 0.001})
df_returns['Active'] = df.Portfolio - df.Benchmark
# Copy return dataframe shape and fill with NaNs.
df_cum = pd.DataFrame()
# Calculate cumulative portfolio growth
df_cum['Portfolio'] = (1 + df_returns.Portfolio).cumprod()
# Calculate shifted portfolio growth factors.
portfolio_return_factors = pd.Series([1] + df_cum['Portfolio'].shift()[1:].tolist(), name='Portfolio_return_factor')
# Use portfolio return factors to calculate daily rebalanced returns.
df_cum['Benchmark'] = (df_returns.Benchmark * portfolio_return_factors).cumsum()
df_cum['Active'] = (df_returns.Active * portfolio_return_factors).cumsum()
Now we see that the active return plus the benchmark return plus the initial cash equals the current value of the portfolio.
>>> df_cum.tail(3)[['Benchmark', 'Active', 'Portfolio']]
Benchmark Active Portfolio
197 0.303995 0.024725 1.328720
198 0.287709 0.051606 1.339315
199 0.292082 0.050098 1.342179
By construction, df_cum['Portfolio'] = 1 + df_cum['Benchmark'] + df_cum['Active'].
Because this method is difficult to calculate (without Pandas!) and understand (most people won't get the notional exposures), industry practice generally defines the active return as the cumulative difference in returns over a period of time. For example, if a fund was up 5.0% in a month and the market was down 1.0%, then the excess return for that month is generally defined as +6.0%. The problem with this simplistic approach, however, is that your results will drift apart over time due to compounding and rebalancing issues that aren't properly factored into the calculations.
So given our df_cum.Active column, we could define the drawdown as:
drawdown = pd.Series(1 - (1 + df_cum.Active)/(1 + df_cum.Active.cummax()), name='Active Drawdown')
>>> df_cum.Active.plot(legend=True);drawdown.plot(legend=True)
You can then determine the start and end points of the drawdown as you have previously done.
Comparing my cumulative Active return contribution with the amounts you calculated, you will find them to be similar at first, and then drift apart over time (my return calcs are in green):
My cheap two pennies in pure Python:
def find_drawdown(lista):
peak = 0
trough = 0
drawdown = 0
for n in lista:
if n > peak:
peak = n
trough = peak
if n < trough:
trough = n
temp_dd = peak - trough
if temp_dd > drawdown:
drawdown = temp_dd
return -drawdown
In piRSquared answer I would suggest amending
pmb = p - b
to
pmb = p / b
to find the rel. maxDD. df3 using pmb = p-b identifies a rel. MaxDD of US$851 (-48.9%). df2 using pmb = p/b identifies the rel. MaxDD as US$544.6 (-57.9%)
import pandas as pd
import datetime
import pandas_datareader.data as pdr
import matplotlib.pyplot as plt
import yfinance as yfin
yfin.pdr_override()
stocks = ["AMZN", "SPY"]
df = pdr.get_data_yahoo(stocks, start="2020-01-01", end="2022-02-18")
df = df[['Adj Close']]
df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)
df.Date=df.Date.dt.date
df2 = df[df.Date.isin([datetime.date(2020,7,9), datetime.date(2022,2,3)])].copy()
df2['AMZN/SPY'] = df2.AMZN / df2.SPY
df2['AMZN-SPY'] = df2.AMZN - df2.SPY
df2['USDdiff'] = df2['AMZN-SPY'].diff().round(1)
df2[["p", "b"]] = df2[['AMZN','SPY']].pct_change(1).round(4)
df2['p-b'] = df2.p - df2.b
df2.replace(np. nan,'',regex=True, inplace=True)
df2 = df2.round(2)
print(df2)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-07-09 3182.63 307.7 10.34 2874.93
2022-02-03 2776.91 446.6 6.22 2330.31 -544.6 -0.1275 0.4514 -0.5789
df3 = df[df.Date.isin([datetime.date(2020,9,2), datetime.date(2022,2,3)])].copy()
df3['AMZN/SPY'] = df3.AMZN / df3.SPY
df3['AMZN-SPY'] = df3.AMZN - df3.SPY
df3['USDdiff'] = df3['AMZN-SPY'].diff().round(1)
df3[["p", "b"]] = df3[['AMZN','SPY']].pct_change(1).round(4)
df3['p-b'] = df3.p - df3.b
df3.replace(np. nan,'',regex=True, inplace=True)
df3 = df3.round(2)
print(df3)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-09-02 3531.45 350.09 10.09 3181.36
2022-02-03 2776.91 446.60 6.22 2330.31 -851.0 -0.2137 0.2757 -0.4894
PS: I don't have enough reputation to comment.
Related
I am trying to hack together some code that looks like it should print our risk and returns of a portfolio, but the first return is 0.00, and that can't be right. Here's the code that I'm testing.
import pandas as pd
# initialize list of lists
data = [[130000, 150000, 190000, 200000], [100000, 200000, 300000, 900000], [350000, 450000, 890000, 20000], [400000, 10000, 500000, 600000]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['HOSPITAL', 'HOTEL', 'STADIUM', 'SUBWAY'])
# print dataframe.
data
That gives me this data frame.
symbols = data.columns
# convert daily stock prices into daily returns
returns = data.pct_change()
r = np.asarray(np.mean(returns, axis=1))
r = np.nan_to_num(r)
C = np.asmatrix(np.cov(returns))
C = np.nan_to_num(C)
# print expected returns and risk
for j in range(len(symbols)):
print ('%s: Exp ret = %f, Risk = %f' %(symbols[j],r[j], C[j,j]**0.5))
The result is this.
The hospital risk and return can't be zero. That doesn't make sense. Something is off here, but I'm not sure what.
Finally, I am trying to optimize the portfolio. So, I hacked together this code.
# Number of variables
n = len(data)
# The variables vector
x = Variable(n)
# The minimum return
req_return = 0.02
# The return
ret = r.T*x
# The risk in xT.Q.x format
risk = quad_form(x, C)
# The core problem definition with the Problem class from CVXPY
prob = Problem(Minimize(risk), [sum(x)==1, ret >= req_return, x >= 0])
try:
prob.solve()
print ("Optimal portfolio")
print ("----------------------")
for s in range(len(symbols)):
print (" Investment in {} : {}% of the portfolio".format(symbols[s],round(100*x.value[s],2)))
print ("----------------------")
print ("Exp ret = {}%".format(round(100*ret.value,2)))
print ("Expected risk = {}%".format(round(100*risk.value**0.5,2)))
except:
print ("Error")
It seems to run but I don't know how to add a constraint. I want to invest at least 5% in every asset and don't invest more than 40% in any one asset. How can I add a constraint to do that?
The idea comes from this link.
https://tirthajyoti.github.io/Notebooks/Portfolio_optimization.html
Based on the idea from the link, they skip the NaN row from the monthly return dataframe, and after converting the return to a matrix, the following step is transposing the matrix, that is the step you are missing hence you are getting the returns and risk as 0 for Hospital. You might want to add this line C = np.asmatrix(np.cov(returns.dropna().transpose()))to skip the first NaN line. This should give you the correct Returns and Risk values.
As for your second question, i had a quick glance into the class definition of cxpy Problem class and there doesnt seem to be a provision for add constraints. The function was programmed to solve equations based on the Minimizing or Maximizing constraint given to it.
For a work around you might want to try taking the outputs and then capping the investment to 40% and and the remaining you distribute it among other ventures proportionally. Example lets say it tells you to invest 5%, 80% and 15% of your assets in A, B and C. You could cap investment in B to 40% and the part remainder of the asset (5/(5+15))*40 = 10% more into A and 30% of your total investing asset more ,into B.
DISCLAIMER: I am not an expert in finance, i am just stating my opinion.
My goal is to write a function that returns a vector of portfolio returns for each period (i.e. day) from a pandas dataframe of security prices. For simplicity, let's assume that the initial weights are equally split between securities A and B. Prices are given by the following dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=20)
prices = pd.DataFrame({'A': np.linspace(20, 50, num=20),
'B': np.linspace(100, 200, num=20)},
index=dates)
Further, we assume that asset A is the asset where we initiate a short position and we go long asset B.
Calculating discrete returns from a "zero-investment position" like a short position (i.e. in asset A) in a first step and overall portfolio returns from the weighted returns of single assets that constitute the portfolio in a second step is not trivial, and before I put my so far attempt, which is not working correctly (key problem being the loss from the short position in asset A exceeding -100% on 2013-01-14), I am greatful for any kind of help - may it be theoretical or code.
You are forgetting asset “C”, as in collateral. No matter how generous your broker might be (not!), most exchanges and national regulatory organizations would require collateral. you might read about some wealthy hedge fund guy doing these trades with just long and short positions, but when it goes south you also read about the HF guy losing his art collection— which was the collateral.
Equity margin requirements in the USA would require 50% collateral to start, and at least 25% maintenance margin while the short trades were open. This is enforced by exchanges and regulatory authorities. Treasury bonds might have more favorable requirements, but even then the margin (collateral) is not zero.
Since a long/short doubles your risk (what happens if the long position goes down, and short position goes up?), your broker likely would require more margin than the minimum.
Add asset “C”, collateral, to your calculations and the portfolio returns become straight forward
Thank you for your answer #Stripedbass. Based on your comment, can the return process of a portfolio consisting of the two stocks be described by the following equations?:
The terms with + and - are the market values of the long and short position respectively such that the difference of them represents the net value of the two positions. If we assume that we want to be "market neutral" at the beginning of the trade t=0, the net value is zero.
For t > 0 these net positions represent the unrealised gains or losses of the long and short position that were opened and have not yet been closed. The term C denotes the money that we actually hold. It consists of the initial collateral and the cumulative gains and losses from the stock positions.
The overall return per period from trading the two securtities is then calculated as the simple return of the account V.
Based on this, you could define the following function and for short posititions choose option type='shares':
def weighted_return(type, df, weights):
capital = 100
#given the input dataframe contains return series
if type == "returns":
# create price indices
df.fillna(0, inplace=True)
df_price_index = pd.DataFrame(index=df.index, columns=df.columns)
df_price_index.iloc[0] = 100 + df.iloc[0]
for i in np.arange(1, len(df_price_index)):
for col in df_price_index.columns:
df_price_index[col].iloc[i] = df_price_index[col].iloc[i - 1] * (1 + df[col].iloc[i])
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df_price_index[stock].diff()) * ind_capital / df_price_index[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.sum(weights * df.iloc[0].values)
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains price series
if type == "prices":
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df[stock].diff()) * ind_capital / df[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.NaN
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains return series and the strategy is long/short
if type == "shares":
exposures = []
for stock in df.columns:
shares = 1/df[stock][0]
exposure = df[stock] * shares
exposures.append(exposure)
df_temp = pd.concat(exposures, axis=1)
index_long = np.where(np.array(weights) == 1)
index_short = np.where(np.array(weights) == -1)
df_temp_account = pd.DataFrame(df_temp.iloc[:,index_long[0]].values - df_temp.iloc[:,index_short[0]].values) + 1
df_temp_returns_combined = df_temp_account.pct_change()
df_temp_returns_combined.columns = ["combinedReturns"]
df_temp_returns_combined.index = df.index
return pd.DataFrame(df_temp_returns_combined)
I have a piece of code which calculates the amortisation profile of a loan, and allows for defaults (cdr = constant default rate) and prepayments (cpr = constant prepayment rate).
I would also like to include recoveries, but this recoveries should be received in a future period. In the below example I am applying a 3% cdr, and I would like 60% of the defaulted loan balance to be recovered six months later.
I am struggling with this as in each loop it would need to refer back to a previous period.
On way to solve this by first creating a table / dataframe without recoveries, and then at a second stage apply the recoveries by adding 60% of the defaults in the column recoveries, offset by 6 months.
However I am hoping there is a better / cleaner way of doing this inside the amortize function.
Any help would be appreciated.
import pandas as pd
import numpy as np
from datetime import date
from collections import OrderedDict
from dateutil.relativedelta import *
pd.options.display.float_format = '{:,.2f}'.format
def amortize(principal, int_rate,periods, cpr, cdr, date, recovery_rate, recovery_timing):
p = 0
beg_balance = principal
end_balance = principal
while end_balance > 1:
default = round((1-(1-cdr/100)**(1/12)) * beg_balance,2)
interest = round((int_rate/12)*max(beg_balance-default,0),2)
if p < periods:
pmt = -round(np.pmt(int_rate/12, periods -p,
beg_balance - default),2)
else:
pmt = 0
principal = pmt - interest
prepay = round((1-(1-cpr/100)**(1/12)) * (beg_balance - principal),2)
end_balance = max(beg_balance - principal - prepay - default,0)
recovery = default * recovery_rate/100
total_cash = pmt + prepay + recovery #plus a recovery lag
yield OrderedDict([('Period',p+1),
('Month', date),
('Begin_Bal', beg_balance),
('Default',default),
('Sched Princ',principal),
('Prepay Princ',prepay),
('Interest',interest),
('Recovery',recovery),
('Total CF',total_cash),
('End Balance', end_balance)])
p += 1
date += relativedelta(months=1)
beg_balance = end_balance
table = amortize(300000,0.03,360,10,3,date(2017,12,11),60,6)
pd.DataFrame(table).head()
Your solution sounds clean. You can also create a list of past defaults with defaults.append(..) then refer to defaults[-6]
If the loan term is very long, you can use collections.deque with a fixed capacity of 6 instead of a list to avoid keeping excessive old calculations.
I am trying to build an equity curve in Python using Pandas. For those not in the know, an equity curve is a cumulative tally of investing profits/losses day by day. The code below works but it is incredibly slow. I've tried to build an alternate using Pandas .iloc and such but nothing is working. I'm not sure if it is possible to do this outside of a loop given how I have to reference the prior row(s).
for today in range(len(f1)): #initiate a loop that runs the length of the "f1" dataframe
if today == 0: #if the index value is zero (aka first row in the dataframe) then...
f1.loc[today,'StartAUM'] = StartAUM #Set intial assets
f1.loc[today,'Shares'] = 0 #dummy placeholder for shares; no trading on day 1
f1.loc[today,'PnL'] = 0 #dummy placeholder for P&L; no trading day 1
f1.loc[today,'EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = today - 1 #used to reference the rows (see below)
f1.loc[today,'StartAUM'] = f1.loc[yesterday,'EndAUM'] #todays starting aseets are yesterday's ending assets
f1.loc[today,'Shares'] = f1.loc[yesterday,'EndAUM']//f1.loc[yesterday,'Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
f1.loc[today,'PnL'] = f1.loc[today,'Shares']*f1.loc[today,'Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
f1.loc[today,'EndAUM'] = f1.loc[today,'StartAUM']+f1.loc[today,'PnL'] #ending assets are starting assets + today's P&L
There is a good example here: http://www.pythonforfinance.net/category/basic-data-analysis/ and I know that there is an example in Wes McKinney's book Python for Data Analysis. You might be able to find it here: http://wesmckinney.com/blog/python-for-financial-data-analysis-with-pandas/
Have you tried using iterrows() to construct the for loop?
for index, row in f1.iterrows():
if today == 0:
row['StartAUM'] = StartAUM #Set intial assets
row['Shares'] = 0 #dummy placeholder for shares; no trading on day 1
row['PnL'] = 0 #dummy placeholder for P&L; no trading day 1
row['EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = row[today] - 1 #used to reference the rows (see below)
row['StartAUM'] = row['EndAUM'] #todays starting aseets are yesterday's ending assets
row['Shares'] = row['EndAUM']//['Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
row['PnL'] = row['Shares']*row['Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
row['EndAUM'] = row['StartAUM']+row['PnL'] #ending assets are starting assets + today's P&L
Probably the code is so slow as loc goes through f1 from beginning every time. iterrows() uses the same dataframe as it loops through it row by row.
See more details about iterrows() here.
You need to vectorize the operations (don't iterate with for but rather compute whole column at once)
# fill the initial values
f1['StartAUM'] = StartAUM # Set intial assets
f1['Shares'] = 0 # dummy placeholder for shares; no trading on day 1
f1['PnL'] = 0 # dummy placeholder for P&L; no trading day 1
f1['EndAUM'] = StartAUM # s
#do the computations (vectorized)
f1['StartAUM'].iloc[1:] = f1['EndAUM'].iloc[:-1]
f1['Shares'].iloc[1:] = f1['EndAUM'].iloc[:-1] // f1['Shareprice'].iloc[:-1]
f1['PnL'] = f1['Shares'] * f1['Outcome1']
f1['EndAUM'] = f1['StartAUM'] + f1 ['PnL']
EDIT: this will not work correctly since StartAUM, EndAUM, Shares depend on each other and cannot be computed one without another. I didn't notice that before.
Can you try the following:
#import relevant modules
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
#download data into DataFrame and create moving averages columns
f1 = data.DataReader('AAPL', 'yahoo',start='1/1/2017')
StartAUM = 1000000
#populate DataFrame with starting values
f1['Shares'] = 0
f1['PnL'] = 0
f1['EndAUM'] = StartAUM
#Set shares held to be the previous day's EndAUM divided by the previous day's closing price
f1['Shares'] = f1['EndAUM'].shift(1) / f1['Adj Close'].shift(1)
#Set the day's PnL to be the number of shares held multiplied by the change in closing price from yesterday to today's close
f1['PnL'] = f1['Shares'] * (f1['Adj Close'] - f1['Adj Close'].shift(1))
#Set day's ending AUM to be previous days ending AUM plus daily PnL
f1['EndAUM'] = f1['EndAUM'].shift(1) + f1['PnL']
#Plot the equity curve
f1['EndAUM'].plot()
Does the above solve your issue?
The solution was to use the Numba package. It performs the loop task in a fraction of the time.
https://numba.pydata.org/
The arguments/dataframe can be passed to the numba module/function. I will try to write up a more detailed explanation with code when time permits.
Thanks to all
In case others come across this, you can definitely make an equity curve without loops.
Dummy up some data
import pandas as pd
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13, 10)
# Some data to work with
np.random.seed(1)
stock = pd.DataFrame(
np.random.randn(100).cumsum() + 10,
index=pd.date_range('1/1/2020', periods=100, freq='D'),
columns=['Close']
)
stock['ma_5'] = stock['Close'].rolling(5).mean()
stock['ma_15'] = stock['Close'].rolling(15).mean()
Holdings: simple long/short based on moving average crossover signals
longs = stock['Close'].where(stock['ma_5'] > stock['ma_15'], np.nan)
shorts = stock['Close'].where(stock['ma_5'] < stock['ma_15'], np.nan)
# Quick plot
stock.plot()
longs.plot(lw=5, c='green')
shorts.plot(lw=5, c='red')
EQUITY CURVE:
Identify which side (l/s) has first holding (ie: first trade, in this case, short), then keep the initial trade price and subsequently cumulatively sum the daily changes (there would normally be more nan's in the series if you have exit rules as well for when you are out of the market), and finally forward fill over the nan values and fill any last remaining nans with zeros. Its basically the same for the second opposite holdings (in this case, long) except don't keep the starting price. The other important thing is to invert the short daily changes (ie: negative changes should be positive to the PnL).
lidx = np.where(longs > 0)[0][0]
sidx = np.where(shorts > 0)[0][0]
startdx = min(lidx, sidx)
# For first holding side, keep first trade price, then calc daily change fwd and ffill nan's
# For second holdng side, get cumsum of daily changes, ffill and fillna(0) (make sure short changes are inverted)
if lidx == startdx:
lcurve = longs.diff() # get daily changes
lcurve[lidx] = longs[lidx] # put back initial starting price
lcurve = lcurve.cumsum().ffill() # add dialy changes/ffill to build curve
scurve = -shorts.diff().cumsum().ffill().fillna(0) # get daily changes (make declines positive changes)
else:
scurve = -shorts.diff() # get daily changes (make declines positive changes)
scurve[sidx] = shorts[sidx] # put back initial starting price
scurve = scurve.cumsum().ffill() # add dialy changes/ffill to build curve
lcurve = longs.diff().cumsum().ffill().fillna(0) # get daily changes
Add the 2 long/short curves together to get the final equity curve
eq_curve = lcurve + scurve
# quick plot
stock.iloc[:, :3].plot()
longs.plot(lw=5, c='green', label='Long')
shorts.plot(lw=5, c='red', label='Short')
eq_curve.plot(lw=2, ls='dotted', c='orange', label='Equity Curve')
plt.legend()
consider the df
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
I want to calculate the sum over a trailing 5 days, every 3 days.
I expect something that looks like this
this was edited
what I had was incorrect. #ivan_pozdeev and #boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.
criteria
I'm looking for smart efficient solutions that can be scaled to large data sets.
I'll be timing solutions and also considering elegance.
Solutions should also be generalizable for a variety of sample and look back frequencies.
from comments
I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
I want the timestamp to be the last observed timestamp within the look back period.
the df you gave us is :
A
2012-12-31 0
2013-01-01 1
2013-01-02 2
2013-01-03 3
2013-01-04 4
2013-01-05 5
2013-01-06 6
2013-01-07 7
2013-01-08 8
2013-01-09 9
2013-01-10 10
you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.
df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]:
A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.
Scenario #1 : Multiple entries per date, but no missing dates
Approach #1 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #2 :
# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values
# Extract IDs for bin based summing
mask = np.append(False,indx[1:] > indx[:-1])
date_id = mask.cumsum()
# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))
# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)
# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]
# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
Approach #3 :
def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
We could replace the convolution part with direct sliced summation for a modified version of it -
def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #2 : Multiple entries per date and missing dates
Approach #4 :
def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
Scenario #3 : Consecutive dates and exactly one entry per date
Approach #5 :
def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
Suggestions for creating test-data
Scenario #1 :
# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
Scenario #2 :
To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -
df_data = np.random.randint(0,9,(len(idx0)))
Scenario #3 :
# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3 # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.
Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.
from __future__ import division
import numpy as np
#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
index=0
currentsum=0
while index<lim_index:
for _ in range(Mp):
#np.take is awkward, requiring a full list of indices to take
for i in range(currentsum,currentsum+nsums-1):
sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)
To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).
Optimized: pure Python
Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.
def calc_trailing_data_with_interval(data,n,k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
lim_index=len(data)-k+1
nsums = int(np.ceil(n/k))
sums = np.zeros(nsums,dtype=data.dtype)
M=n%k
Mp=k-M
RM=range(M) #cache for efficiency
RMp=range(Mp) #cache for efficiency
index=0
currentsum=0
currentsum_ranges=[range(currentsum,currentsum+nsums-1)
for currentsum in range(nsums)] #cache for efficiency
while index<lim_index:
for _ in RMp:
#np.take is unusable as it allocates another array rather than view
for i in currentsum_ranges[currentsum]:
sums[i%nsums]+=data[index]
index+=1
for _ in RM:
sums+=data[index]
index+=1
yield sums[currentsum]
currentsum=(currentsum+1)%nsums
Optimized: Cython
Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.
def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
"""Iterate over `data', computing sums of `n' trailing elements
for each `k'th element.
#type data: 1-d ndarray
#param n: number of trailing elements to sum up
#param k: interval with which to calculate sums
"""
if not data.ndim==1: raise TypeError("One-dimensional array required")
cdef int lim_index=data.size-k+1
cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
cdef int rindex = 0
cdef int nsums = int(np.ceil(float(n)/k))
cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)
#optional speedup for dtype=np.int
cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
cdef int[:] cdata = data
cdef int[:] csums = sums
cdef int[:] cresult = result
cdef int M=n%k
cdef int Mp=k-M
cdef int index=0
cdef int currentsum=0
cdef int _,i
while index<lim_index:
for _ in range(Mp):
#np.take is unusable as it allocates another array rather than view
for i in range(currentsum,currentsum+nsums-1):
if use_int_buffer: csums[i%nsums]+=cdata[index] #optional speedup
else: sums[i%nsums]+=data[index]
index+=1
for _ in range(M):
if use_int_buffer:
for i in range(nsums): csums[i]+=cdata[index] #optional speedup
else: sums+=data[index]
index+=1
if use_int_buffer: cresult[rindex]=csums[currentsum] #optional speedup
else: result[rindex]=sums[currentsum]
currentsum=(currentsum+1)%nsums
rindex+=1
return result
For regularly-spaced dates only
Here are two methods, first a pandas way and second a numpy function.
>>> n=5 # trailing periods for rolling sum
>>> k=3 # frequency of rolling sum calc
>>> df.rolling(n).sum()[-1::-k][::-1]
A
2013-01-01 NaN
2013-01-04 10.0
2013-01-07 25.0
2013-01-10 40.0
And here's a numpy function (adapted from Jaime's numpy moving_average):
def rolling_sum(a, n=5, k=3):
ret = np.cumsum(a.values)
ret[n:] = ret[n:] - ret[:-n]
return pd.DataFrame( ret[n-1:][-1::-k][::-1],
index=a[n-1:][-1::-k][::-1].index )
rolling_sum(df,n=6,k=4) # default n=5, k=3
For irregularly-spaced dates (or regularly-spaced)
Simply precede with:
df.resample('D').sum().fillna(0)
For example, the above methods become:
df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]
and
rolling_sum( df.resample('D').sum().fillna(0) )
Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.
For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:
np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking
import pandas as pd
import numpy as np
tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
sample_freq = 3 #days
sample_width = 5 #days
sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day
times = df.index.astype(np.int64)//10**9 #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix() #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars
def yieldstep(mat, freq):
normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
for i in range(max(normtime)+1):
yield np.searchsorted(normtime, i) #yield beginning of window index
def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
normtime = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
#return rightmost timestamp of window in seconds from unix epoch and sum of window
return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier
windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Looks like a rolling centered window where you pick up data every n days:
def rolleach(df, ndays, window):
return df.rolling(window, center=True).sum()[ndays-1::ndays]
rolleach(df, 3, 5)
Out[95]:
A
2013-01-02 10.0
2013-01-05 25.0
2013-01-08 40.0