I am trying to build a simulation of an investment portfolio where there is flexibility to adjust a couple of manual inputs. I have the returns simulation running fine, but I can't figure out how to loop the ending value for previous period to equal the beginning value for the next period.
def mcs(n_years = 10, n_scenarios=10, mu=0.07, sigma=0.15, steps_per_year=12, balance=100,
expenses=10, personal_income=5):
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
beginning_value=balance
ending_value=balance
for i in range(n_steps):
rets_mcs = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=(n_steps, n_scenarios))
ending_value = balance + balance*rets_mcs - expenses + personal_income
df = pd.DataFrame(data=ending_value,index=range(n_steps))
df.iloc[0]=balance
return df
I keep ending up results based off the original value. Any code help or resources would be appreciated.
Managed to solve this problem using the below code. This previous post was helpful (Dataframe with Monte Carlo Simulation calculation next row Problem)
n_years = 10
n_scenarios=2
mu=0.07
sigma=0.15
steps_per_year=12
s_0=100
expenses=10
personal_income=5
inflation=0
wage_growth=0
output = []
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
for i in range(n_steps):
rets_ngbm = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=n_scenarios)
ending_value = s_0 + s_0*rets_ngbm - expenses + personal_income
output.append(ending_value)
s_0 = ending_value
df = pd.DataFrame(data=output,index=range(n_steps),columns=range(n_scenarios))
Related
My goal is to write a function that returns a vector of portfolio returns for each period (i.e. day) from a pandas dataframe of security prices. For simplicity, let's assume that the initial weights are equally split between securities A and B. Prices are given by the following dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=20)
prices = pd.DataFrame({'A': np.linspace(20, 50, num=20),
'B': np.linspace(100, 200, num=20)},
index=dates)
Further, we assume that asset A is the asset where we initiate a short position and we go long asset B.
Calculating discrete returns from a "zero-investment position" like a short position (i.e. in asset A) in a first step and overall portfolio returns from the weighted returns of single assets that constitute the portfolio in a second step is not trivial, and before I put my so far attempt, which is not working correctly (key problem being the loss from the short position in asset A exceeding -100% on 2013-01-14), I am greatful for any kind of help - may it be theoretical or code.
You are forgetting asset “C”, as in collateral. No matter how generous your broker might be (not!), most exchanges and national regulatory organizations would require collateral. you might read about some wealthy hedge fund guy doing these trades with just long and short positions, but when it goes south you also read about the HF guy losing his art collection— which was the collateral.
Equity margin requirements in the USA would require 50% collateral to start, and at least 25% maintenance margin while the short trades were open. This is enforced by exchanges and regulatory authorities. Treasury bonds might have more favorable requirements, but even then the margin (collateral) is not zero.
Since a long/short doubles your risk (what happens if the long position goes down, and short position goes up?), your broker likely would require more margin than the minimum.
Add asset “C”, collateral, to your calculations and the portfolio returns become straight forward
Thank you for your answer #Stripedbass. Based on your comment, can the return process of a portfolio consisting of the two stocks be described by the following equations?:
The terms with + and - are the market values of the long and short position respectively such that the difference of them represents the net value of the two positions. If we assume that we want to be "market neutral" at the beginning of the trade t=0, the net value is zero.
For t > 0 these net positions represent the unrealised gains or losses of the long and short position that were opened and have not yet been closed. The term C denotes the money that we actually hold. It consists of the initial collateral and the cumulative gains and losses from the stock positions.
The overall return per period from trading the two securtities is then calculated as the simple return of the account V.
Based on this, you could define the following function and for short posititions choose option type='shares':
def weighted_return(type, df, weights):
capital = 100
#given the input dataframe contains return series
if type == "returns":
# create price indices
df.fillna(0, inplace=True)
df_price_index = pd.DataFrame(index=df.index, columns=df.columns)
df_price_index.iloc[0] = 100 + df.iloc[0]
for i in np.arange(1, len(df_price_index)):
for col in df_price_index.columns:
df_price_index[col].iloc[i] = df_price_index[col].iloc[i - 1] * (1 + df[col].iloc[i])
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df_price_index[stock].diff()) * ind_capital / df_price_index[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.sum(weights * df.iloc[0].values)
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains price series
if type == "prices":
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df[stock].diff()) * ind_capital / df[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.NaN
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains return series and the strategy is long/short
if type == "shares":
exposures = []
for stock in df.columns:
shares = 1/df[stock][0]
exposure = df[stock] * shares
exposures.append(exposure)
df_temp = pd.concat(exposures, axis=1)
index_long = np.where(np.array(weights) == 1)
index_short = np.where(np.array(weights) == -1)
df_temp_account = pd.DataFrame(df_temp.iloc[:,index_long[0]].values - df_temp.iloc[:,index_short[0]].values) + 1
df_temp_returns_combined = df_temp_account.pct_change()
df_temp_returns_combined.columns = ["combinedReturns"]
df_temp_returns_combined.index = df.index
return pd.DataFrame(df_temp_returns_combined)
I have a pd.DataFrame of return series corresponding to years with a fixed spending rate of 5%. I am looking to find the ending portfolio value after spending for each year. val_after_spending in year t is equal to the average of year t val_before_spending with year t-1 val_after_spending times the spending rate. For the first year, the val_after_spending in t-1 is assumed to be 1.
I right now have a working implementation (below), but it is incredibly slow. Is there a faster way to implement this?
import pandas as pd
import numpy as np
port_rets = pd.DataFrame({'port_ret': [.10,-.25,.15]})
spending_rate = .05
for index, row in port_rets.iterrows():
if index != 0:
port_rets.at[index, 'val_before_spending'] = port_rets['val_after_spending'][index - 1] * (1 + port_rets['port_ret'][index])
port_rets.at[index, 'spending'] = np.mean([port_rets['val_after_spending'][index - 1], port_rets['val_before_spending'][index]]) * spending_rate
else:
port_rets.at[index, 'val_before_spending'] = 1 * (1 + port_rets['port_ret'][index])
port_rets.at[index, 'spending'] = np.mean([1, port_rets['val_before_spending'][index]]) * spending_rate
port_rets.at[index, 'val_after_spending'] = port_rets['val_before_spending'][index] - port_rets['spending'][index]
# port_ret val_before_spending spending val_after_spending
#0 0.100000 1.100000 0.052500 1.047500
#1 -0.250000 0.785625 0.045828 0.739797
#2 0.150000 0.850766 0.039764 0.811002
You very heavily interface with pandas in your code, which seems to be a bad idea as far as performance is concerned. To make it as easy to use as it is, pandas needs to do a lot of book keeping, which leads to reduced performance.
We do all the calculation in numpy and then having got all the building blocks, build the dataframe in the end. Thus, the code translates to :
def get_vals(rates, spending_rate):
n = len(rates)
vals_after_spending = np.zeros((n+1, ))
vals_before_spending = np.zeros((n+1, ))
vals_after_spending[0] = 1.0
for i in range(n):
vals_before_spending[i+1] = vals_after_spending[i] * (1 + rates[i])
spending = np.mean(np.array([vals_after_spending[i], vals_before_spending[i+1]])) * spending_rate
vals_after_spending[i+1] = vals_before_spending[i+1] - spending
return vals_before_spending[1:], vals_after_spending[1:]
rates = np.array(port_rets["port_ret"].tolist())
vals_before_spending, vals_after_spending = get_vals(rates, spending_rate)
port_rets = pd.DataFrame({'port_ret': rates, "val_before_spending": vals_before_spending, "val_after_spending": vals_after_spending})
We can further improve by JIT compiling the code, as python loops are slow.
Below I use numba :
import numba as nb
#nb.njit(cache=True) # as easy as putting this decorator
def get_vals(rates, spending_rate):
n = len(rates)
vals_after_spending = np.zeros((n+1, ))
vals_before_spending = np.zeros((n+1, ))
# ... code remains same, we are just compiling the function
If we consider a random list of rates like this :
port_rets = pd.DataFrame({'port_ret': np.random.uniform(low=-1.0, high=1.0, size=(100000,))})
We get the performance comparisons:
Your code : 15.758s
get_vals : 1.407s
JITed get_vals : 0.093s (on second run to discount for compile time)
UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this
Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.
I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.
Also, for some economical issue, the portfolios must include the same sectors.
For example:
if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.
There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.
The followings are my code:
def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
res=pd.DataFrame()
regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
error1, error2 = regress1.resid, regress2.resid
test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
if test1[1] < pvalue and test1[1] < test2[1] and\
regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
regress2.beta["x"] < beta_upper:
res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
else:
pass
def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
# dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
# pvalue = level of significance of adf test
# nstocks = number of stocks considered for adf test (equal weight)
# if nstocks > 2, coint return cointegration between portfolios
# beta_lower = lower bound for slope of linear regression
# beta_upper = upper bound for slope of linear regression
a=time.time()
tickers = dataFrame.columns
tcomb = itertools.combinations(dataFrame.columns, nstocks)
res = pd.DataFrame()
sec = pd.DataFrame()
for pair in tcomb:
xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
xSector = list(SNP.ix[xind]["Sector"])
ySector = list(SNP.ix[yind]["Sector"])
if set(xSector) == set(ySector):
sector = [[(xSector, ySector)]]
x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
res1 = adf(x,y,xName,yName)
if res1 is None:
continue
elif res.size==0:
res=res1
sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
print("added : ", pair)
else:
res=res.append(res1)
sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
print("added : ", pair)
res = pd.concat([res,sec],axis=1)
res=res.sort_values(by=["pvalue"],ascending=True)
b=time.time()
print("time taken : ", b-a, "sec")
return res
when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)
I collected 'Adj Close' data from yahoo finance using pandas_datareader.data
and the index is time and columns are different tickers
Any suggestions or help will be appreciated
I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.
If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.
I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:
https://docs.python.org/3.6/library/profile.html
You can either invoke it in your code:
import cProfile
cProfile.run('your_function(inputs)')
Or if a script is an easier entrypoint:
python -m cProfile [-o output_file] [-s sort_order] your-script.py
I recently asked a question about calculating maximum drawdown where Alexander gave a very succinct and efficient way of calculating it with DataFrame methods in pandas.
I wanted to follow up by asking how others are calculating maximum active drawdown?
This calculates Max Drawdown. NOT! Max Active Drawdown
This is what I implemented for max drawdown based on Alexander's answer to question linked above:
def max_drawdown_absolute(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
It takes a return series and gives back the max_drawdown along with the indices for which the drawdown occured.
We start by generating a series of cumulative returns to act as a return index.
r = returns.add(1).cumprod()
At each point in time, the current drawdown is calcualted by comparing the current level of the return index with the maximum return index for all periods prior.
dd = r.div(r.cummax()).sub(1)
The max drawdown is then just the minimum of all the calculated drawdowns.
My question:
I wanted to follow up by asking how others are calculating maximum
active drawdown?
Assumes that the solution will extend on the solution above.
Starting with a series of portfolio returns and benchmark returns, we build cumulative returns for both. the variables below are assumed to already be in cumulative return space.
The active return from period j to period i is:
Solution
This is how we can extend the absolute solution:
def max_draw_down_relative(p, b):
p = p.add(1).cumprod()
b = b.add(1).cumprod()
pmb = p - b
cam = pmb.expanding(min_periods=1).apply(lambda x: x.argmax())
p0 = pd.Series(p.iloc[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.iloc[cam.values.astype(int)].values, index=b.index)
dd = (p * b0 - b * p0) / (p0 * b0)
mdd = dd.min()
end = dd.argmin()
start = cam.ix[end]
return mdd, start, end
Explanation
Similar to the absolute case, at each point in time, we want to know what the maximum cumulative active return has been up to that point. We get this series of cumulative active returns with p - b. The difference is that we want to keep track of what the p and b were at this time and not the difference itself.
So, we generate a series of 'whens' captured in cam (cumulative argmax) and subsequent series of portfolio and benchmark values at those 'whens'.
p0 = pd.Series(p.ix[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.ix[cam.values.astype(int)].values, index=b.index)
The drawdown caclulation can now be made analogously using the formula above:
dd = (p * b0 - b * p0) / (p0 * b0)
Demonstration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(314)
p = pd.Series(np.random.randn(200) / 100 + 0.001)
b = pd.Series(np.random.randn(200) / 100 + 0.001)
keys = ['Portfolio', 'Benchmark']
cum = pd.concat([p, b], axis=1, keys=keys).add(1).cumprod()
cum['Active'] = cum.Portfolio - cum.Benchmark
mdd, sd, ed = max_draw_down_relative(p, b)
f, a = plt.subplots(2, 1, figsize=[8, 10])
cum[['Portfolio', 'Benchmark']].plot(title='Cumulative Absolute', ax=a[0])
a[0].axvspan(sd, ed, alpha=0.1, color='r')
cum[['Active']].plot(title='Cumulative Active', ax=a[1])
a[1].axvspan(sd, ed, alpha=0.1, color='r')
You may have noticed that your individual components do not equal the whole, either in an additive or geometric manner:
>>> cum.tail(1)
Portfolio Benchmark Active
199 1.342179 1.280958 1.025144
This is always a troubling situation, as it indicates that some sort of leakage may be occurring in your model.
Mixing single period and multi-period attribution is always always a challenge. Part of the issue lies in the goal of the analysis, i.e. what are you trying to explain.
If you are looking at cumulative returns as is the case above, then one way you perform your analysis is as follows:
Ensure the portfolio returns and the benchmark returns are both excess returns, i.e. subtract the appropriate cash return for the respective period (e.g. daily, monthly, etc.).
Assume you have a rich uncle who lends you $100m to start your fund. Now you can think of your portfolio as three transactions, one cash and two derivative transactions:
a) Invest your $100m in a cash account, conveniently earning the offer rate.
b) Enter into an equity swap for $100m notional
c) Enter into a swap transaction with a zero beta hedge fund, again for $100m notional.
We will conveniently assume that both swap transactions are collateralized by the cash account, and that there are no transaction costs (if only...!).
On day one, the stock index is up just over 1% (an excess return of exactly 1.00% after deducting the cash expense for the day). The uncorrelated hedge fund, however, delivered an excess return of -5%. Our fund is now at $96m.
Day two, how do we rebalance? Your calculations imply that we never do. Each is a separate portfolio that drifts on forever... For the purpose of attribution, however, I believe it makes total sense to rebalance daily, i.e. 100% to each of the two strategies.
As these are just notional exposures with ample cash collateral, we can just adjust the amounts. So instead of having $101m exposure to the equity index on day two and $95m of exposure to the hedge fund, we will instead rebalance (at zero cost) so that we have $96m of exposure to each.
How does this work in Pandas, you might ask? You've already calculated cum['Portfolio'], which is the cumulative excess growth factor for the portfolio (i.e. after deducting cash returns). If we apply the current day's excess benchmark and active returns to the prior day's portfolio growth factor, we calculate the daily rebalanced returns.
import numpy as np
import pandas as pd
np.random.seed(314)
df_returns = pd.DataFrame({
'Portfolio': np.random.randn(200) / 100 + 0.001,
'Benchmark': np.random.randn(200) / 100 + 0.001})
df_returns['Active'] = df.Portfolio - df.Benchmark
# Copy return dataframe shape and fill with NaNs.
df_cum = pd.DataFrame()
# Calculate cumulative portfolio growth
df_cum['Portfolio'] = (1 + df_returns.Portfolio).cumprod()
# Calculate shifted portfolio growth factors.
portfolio_return_factors = pd.Series([1] + df_cum['Portfolio'].shift()[1:].tolist(), name='Portfolio_return_factor')
# Use portfolio return factors to calculate daily rebalanced returns.
df_cum['Benchmark'] = (df_returns.Benchmark * portfolio_return_factors).cumsum()
df_cum['Active'] = (df_returns.Active * portfolio_return_factors).cumsum()
Now we see that the active return plus the benchmark return plus the initial cash equals the current value of the portfolio.
>>> df_cum.tail(3)[['Benchmark', 'Active', 'Portfolio']]
Benchmark Active Portfolio
197 0.303995 0.024725 1.328720
198 0.287709 0.051606 1.339315
199 0.292082 0.050098 1.342179
By construction, df_cum['Portfolio'] = 1 + df_cum['Benchmark'] + df_cum['Active'].
Because this method is difficult to calculate (without Pandas!) and understand (most people won't get the notional exposures), industry practice generally defines the active return as the cumulative difference in returns over a period of time. For example, if a fund was up 5.0% in a month and the market was down 1.0%, then the excess return for that month is generally defined as +6.0%. The problem with this simplistic approach, however, is that your results will drift apart over time due to compounding and rebalancing issues that aren't properly factored into the calculations.
So given our df_cum.Active column, we could define the drawdown as:
drawdown = pd.Series(1 - (1 + df_cum.Active)/(1 + df_cum.Active.cummax()), name='Active Drawdown')
>>> df_cum.Active.plot(legend=True);drawdown.plot(legend=True)
You can then determine the start and end points of the drawdown as you have previously done.
Comparing my cumulative Active return contribution with the amounts you calculated, you will find them to be similar at first, and then drift apart over time (my return calcs are in green):
My cheap two pennies in pure Python:
def find_drawdown(lista):
peak = 0
trough = 0
drawdown = 0
for n in lista:
if n > peak:
peak = n
trough = peak
if n < trough:
trough = n
temp_dd = peak - trough
if temp_dd > drawdown:
drawdown = temp_dd
return -drawdown
In piRSquared answer I would suggest amending
pmb = p - b
to
pmb = p / b
to find the rel. maxDD. df3 using pmb = p-b identifies a rel. MaxDD of US$851 (-48.9%). df2 using pmb = p/b identifies the rel. MaxDD as US$544.6 (-57.9%)
import pandas as pd
import datetime
import pandas_datareader.data as pdr
import matplotlib.pyplot as plt
import yfinance as yfin
yfin.pdr_override()
stocks = ["AMZN", "SPY"]
df = pdr.get_data_yahoo(stocks, start="2020-01-01", end="2022-02-18")
df = df[['Adj Close']]
df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)
df.Date=df.Date.dt.date
df2 = df[df.Date.isin([datetime.date(2020,7,9), datetime.date(2022,2,3)])].copy()
df2['AMZN/SPY'] = df2.AMZN / df2.SPY
df2['AMZN-SPY'] = df2.AMZN - df2.SPY
df2['USDdiff'] = df2['AMZN-SPY'].diff().round(1)
df2[["p", "b"]] = df2[['AMZN','SPY']].pct_change(1).round(4)
df2['p-b'] = df2.p - df2.b
df2.replace(np. nan,'',regex=True, inplace=True)
df2 = df2.round(2)
print(df2)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-07-09 3182.63 307.7 10.34 2874.93
2022-02-03 2776.91 446.6 6.22 2330.31 -544.6 -0.1275 0.4514 -0.5789
df3 = df[df.Date.isin([datetime.date(2020,9,2), datetime.date(2022,2,3)])].copy()
df3['AMZN/SPY'] = df3.AMZN / df3.SPY
df3['AMZN-SPY'] = df3.AMZN - df3.SPY
df3['USDdiff'] = df3['AMZN-SPY'].diff().round(1)
df3[["p", "b"]] = df3[['AMZN','SPY']].pct_change(1).round(4)
df3['p-b'] = df3.p - df3.b
df3.replace(np. nan,'',regex=True, inplace=True)
df3 = df3.round(2)
print(df3)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-09-02 3531.45 350.09 10.09 3181.36
2022-02-03 2776.91 446.60 6.22 2330.31 -851.0 -0.2137 0.2757 -0.4894
PS: I don't have enough reputation to comment.