Nested ANOVA in statsmodels - python

The issue:
When trying to compute two-way nested ANOVA, the results do not equal the appropriate results from R (formulas and data are the same).
Sample:
We use "atherosclerosis" dataset from here: https://stepik.org/media/attachments/lesson/9250/atherosclerosis.csv.
To get nested data we replace dose values for age == 2:
df['dose'] = np.where((df['age']==2) & (df['dose']=='D1'),'D3', df.dose)
df['dose'] = np.where((df['age']==2) & (df['dose']=='D2'),'D4', df.dose)
So we have dose factor nested into age: values D1 and D2 are in first age and values D3 and D4 are only in the 2nd age.
After getting ANOVA table we have the results below:
mod = ols('expr~age/C(dose)', data=df).fit()
anova_table = sm.stats.anova_lm(mod, typ=1); anova_table
Screenshot
The total sum of the 'sum_sg' = 1590.257424 + 47.039636 + 197.452754 = 1834.7498139999998 that is NOT equal the right total sum (computed below) = 1805.5494956433238
grand_mean = df['expr'].mean()
ssq_t = sum((df.expr - grand_mean)**2)
Expected Output:
Let's try to get ANOVA table in R:
df <- read.csv(file = "/mnt/storage/users/kseniya/platform-adpkd-mrwda-aim-imaging/mrwda_training/data_samples/athero_new.csv")
nest <- aov(df$expr ~ df$age / factor(df$dose))
print(summary(nest))
The results:
Screenshot
Why they are not equal? The formulas are the same. Are there any mistakes in computing ANOVA through statsmodels?
The results from R seem to be right, because the total sum 197.5 + 17.8 + 1590.3 = 1805.6 is equal to the total sum computed manually.

The degrees of freedom aren't equal. I suspect that the model definition is not really the same between OLS and R. Since lm(y ~ x/z, data) is just a shortcut for lm(y ~ x + x:z, data), I prefer using the extended formulation and recheck if your data is the same. Use also lm instead of aov and . Behaviour of the Python and R implementations should be more similar.
Also behavior of C() in Python does not seem the same as factor() cast in R.

Related

Trying to calculate correct returns and set a constraints on the max and min invested in each asset using 'quad_form'

I am trying to hack together some code that looks like it should print our risk and returns of a portfolio, but the first return is 0.00, and that can't be right. Here's the code that I'm testing.
import pandas as pd
# initialize list of lists
data = [[130000, 150000, 190000, 200000], [100000, 200000, 300000, 900000], [350000, 450000, 890000, 20000], [400000, 10000, 500000, 600000]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['HOSPITAL', 'HOTEL', 'STADIUM', 'SUBWAY'])
# print dataframe.
data
That gives me this data frame.
symbols = data.columns
# convert daily stock prices into daily returns
returns = data.pct_change()
r = np.asarray(np.mean(returns, axis=1))
r = np.nan_to_num(r)
C = np.asmatrix(np.cov(returns))
C = np.nan_to_num(C)
# print expected returns and risk
for j in range(len(symbols)):
print ('%s: Exp ret = %f, Risk = %f' %(symbols[j],r[j], C[j,j]**0.5))
The result is this.
The hospital risk and return can't be zero. That doesn't make sense. Something is off here, but I'm not sure what.
Finally, I am trying to optimize the portfolio. So, I hacked together this code.
# Number of variables
n = len(data)
# The variables vector
x = Variable(n)
# The minimum return
req_return = 0.02
# The return
ret = r.T*x
# The risk in xT.Q.x format
risk = quad_form(x, C)
# The core problem definition with the Problem class from CVXPY
prob = Problem(Minimize(risk), [sum(x)==1, ret >= req_return, x >= 0])
try:
prob.solve()
print ("Optimal portfolio")
print ("----------------------")
for s in range(len(symbols)):
print (" Investment in {} : {}% of the portfolio".format(symbols[s],round(100*x.value[s],2)))
print ("----------------------")
print ("Exp ret = {}%".format(round(100*ret.value,2)))
print ("Expected risk = {}%".format(round(100*risk.value**0.5,2)))
except:
print ("Error")
It seems to run but I don't know how to add a constraint. I want to invest at least 5% in every asset and don't invest more than 40% in any one asset. How can I add a constraint to do that?
The idea comes from this link.
https://tirthajyoti.github.io/Notebooks/Portfolio_optimization.html
Based on the idea from the link, they skip the NaN row from the monthly return dataframe, and after converting the return to a matrix, the following step is transposing the matrix, that is the step you are missing hence you are getting the returns and risk as 0 for Hospital. You might want to add this line C = np.asmatrix(np.cov(returns.dropna().transpose()))to skip the first NaN line. This should give you the correct Returns and Risk values.
As for your second question, i had a quick glance into the class definition of cxpy Problem class and there doesnt seem to be a provision for add constraints. The function was programmed to solve equations based on the Minimizing or Maximizing constraint given to it.
For a work around you might want to try taking the outputs and then capping the investment to 40% and and the remaining you distribute it among other ventures proportionally. Example lets say it tells you to invest 5%, 80% and 15% of your assets in A, B and C. You could cap investment in B to 40% and the part remainder of the asset (5/(5+15))*40 = 10% more into A and 30% of your total investing asset more ,into B.
DISCLAIMER: I am not an expert in finance, i am just stating my opinion.

cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'> with these indexers [[2.]] of <class 'numpy.ndarray'>

so I have this equation I have to maximize V**2=((k*(T-k))/T)*0.5(Y2-Y1))**2.
And Y2 and Y1 are means of the data. T is the total number of data points. To be precise k marks the breakpoint of the data, that splits it in two, thus leading the mean of the first part Y1 and the second part Y2. The data is given in a data frame(cases) with a date time index.
Here you can see my code so far:
def obj(k):
Y1=cases[' New_cases'].iloc[:k+1].mean()
Y2=cases[' New_cases'].iloc[k+1:].mean()
return ((k*(180-k))/180)*(Y2-Y1)
x0=0
sol=minimize(obj, x0)
And every time I run minimize I get:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'> with these indexers [[2.]] of <class 'numpy.ndarray'>.
Is it because spicy.optimize cannot handle data frames? If so how can I circumvent this?
Also this is my first time using scipy, so if you see something else wrong or subpar in my code, please let me know.
Edit: I boiled down the my code to make it simpler and reproducible. The cases data frame has a date time index and one column " New_cases" containing integers.To be precise the cases file contains the number of new Covid cases for each day.
Sample data: >
Date_reported ---- New_cases
2020-01-03 --- 0
2020-01-04 --- 1
2020-01-05 --- 0
2020-01-06 --- 3
.
.
.
2020-07-01---4554
k also is supposed to be an integer, as it marks the position of the breakpoint in the data
Your approach seems too complicated to me. Why are you using scipy optimisation? This is a highly discrete problem of manageable size (memory workload and speed are negligible).
If you set
T = 180
construct a sample frame of length T (populated with random numbers between 0 and 10)
cases = pd.DataFrame(np.random.randint(0, 10, T),
index=pd.date_range('2020-01-03', periods=T, freq='D'),
columns=['New_cases'])
define the objective as you did (note that it slightly differs from your original objective to maximise ((k*(T-k))/T)*0.5(Y2-Y1))**2)
def objective(k):
y1 = cases['New_cases'].iloc[:k+1].mean()
y2 = cases['New_cases'].iloc[k+1:].mean()
return k * (T - k) / T * (y2 - y1)
then you get the maximum/minimum by
objectives = [objective(k) for k in range(len(cases))]
max_objective = max(objectives)
min_objective = min(objectives)
and the values of k at which the maximum/minimum is reached by
k_max = [k for k in range(len(cases)) if objectives[k] == max_objective]
k_min = [k for k in range(len(cases)) if objectives[k] == min_objective]
(k_max/k_min are lists because it's possible that the maximum/minimum is reached several times.)
Edit: Illustration of the list comprehensions:
objectives = []
for k in range(len(cases)):
objectives.append(objective(k))
k_max = []
for k in range(len(cases)):
if objectives[k] == max_objective:
k_max.append(k)

Python, how to perform an iterate formula?

I'm trying to export from excel a formula that (as usual) iterate two other value.
I'm trying to explain me better.
I have the following three variables:
iva_versamenti_totale,
utilizzato,
riporto
iva_versamenti_totalehave a len equal to 13 and it is given by the following formula
iva_versamenti_totale={'Saldo IVA': [sum(t) for t in zip(*iva_versamenti.values())],}
Utilizzato and riporto are simultaneously obtained in an iterate manner. I have tried to get the following code but does not work:
utilizzato=dict()
riporto=dict()
for index, xi in enumerate(iva_versamenti_totale['Saldo IVA']):
if xi > 0 :
riporto[index] = riporto[index] + xi
else:
riporto[index] = riporto[index-1]-utilizzato[index]
for index, xi in enumerate(iva_versamenti_totale['Saldo IVA']):
if xi > 0 :
utilizzato[index] == 0
elif riporto[index-1] >= xi:
utilizzato[index] = -xi
else:
utilizzato[index]=riporto[index-1]
Python give me KeyError:0.
EDIT
Here my excel file:
https://drive.google.com/open?id=1SRp9uscUgYsV88991yTnZ8X4c8NElhuD
In grey input and in yellow the variables

DataFrame how to perform calculations (please see the attached photo)

As you can see the calculations under column D follow a specific pattern
i.e. prior value * (1+the rate%/365)
so in cell D2 you have 100*(1+8%/365)
D3 will be 100.021918*(1+8.06%/365)
is there an easy way to do that in python as I don't want to use excel for that purpose....and I have daily data going back 30 years.
cell_d = [100]
rates = [0.08, 0.0806, 0.0812, 0.0813, 0.08]
for i, rate in enumerate(rates):
cell_d.append(cell_d[i] * (1 + rate/365))
pd.DataFrame({'rates': rates, 'cell_d': cell_d[1:]})
Probably should rename cell_d to something more meaningful.
I don't know of any "DataFrame friendly" way to do it, but you can simply iterate over the rows with an index using a for loop.
for i in range(1, num_rows):
df[i]["value"] = df[i-1]["value"] * (1 + df[i]["rate"]/365)

Maximum Active Drawdown in python

I recently asked a question about calculating maximum drawdown where Alexander gave a very succinct and efficient way of calculating it with DataFrame methods in pandas.
I wanted to follow up by asking how others are calculating maximum active drawdown?
This calculates Max Drawdown. NOT! Max Active Drawdown
This is what I implemented for max drawdown based on Alexander's answer to question linked above:
def max_drawdown_absolute(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
It takes a return series and gives back the max_drawdown along with the indices for which the drawdown occured.
We start by generating a series of cumulative returns to act as a return index.
r = returns.add(1).cumprod()
At each point in time, the current drawdown is calcualted by comparing the current level of the return index with the maximum return index for all periods prior.
dd = r.div(r.cummax()).sub(1)
The max drawdown is then just the minimum of all the calculated drawdowns.
My question:
I wanted to follow up by asking how others are calculating maximum
active drawdown?
Assumes that the solution will extend on the solution above.
Starting with a series of portfolio returns and benchmark returns, we build cumulative returns for both. the variables below are assumed to already be in cumulative return space.
The active return from period j to period i is:
Solution
This is how we can extend the absolute solution:
def max_draw_down_relative(p, b):
p = p.add(1).cumprod()
b = b.add(1).cumprod()
pmb = p - b
cam = pmb.expanding(min_periods=1).apply(lambda x: x.argmax())
p0 = pd.Series(p.iloc[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.iloc[cam.values.astype(int)].values, index=b.index)
dd = (p * b0 - b * p0) / (p0 * b0)
mdd = dd.min()
end = dd.argmin()
start = cam.ix[end]
return mdd, start, end
Explanation
Similar to the absolute case, at each point in time, we want to know what the maximum cumulative active return has been up to that point. We get this series of cumulative active returns with p - b. The difference is that we want to keep track of what the p and b were at this time and not the difference itself.
So, we generate a series of 'whens' captured in cam (cumulative argmax) and subsequent series of portfolio and benchmark values at those 'whens'.
p0 = pd.Series(p.ix[cam.values.astype(int)].values, index=p.index)
b0 = pd.Series(b.ix[cam.values.astype(int)].values, index=b.index)
The drawdown caclulation can now be made analogously using the formula above:
dd = (p * b0 - b * p0) / (p0 * b0)
Demonstration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(314)
p = pd.Series(np.random.randn(200) / 100 + 0.001)
b = pd.Series(np.random.randn(200) / 100 + 0.001)
keys = ['Portfolio', 'Benchmark']
cum = pd.concat([p, b], axis=1, keys=keys).add(1).cumprod()
cum['Active'] = cum.Portfolio - cum.Benchmark
mdd, sd, ed = max_draw_down_relative(p, b)
f, a = plt.subplots(2, 1, figsize=[8, 10])
cum[['Portfolio', 'Benchmark']].plot(title='Cumulative Absolute', ax=a[0])
a[0].axvspan(sd, ed, alpha=0.1, color='r')
cum[['Active']].plot(title='Cumulative Active', ax=a[1])
a[1].axvspan(sd, ed, alpha=0.1, color='r')
You may have noticed that your individual components do not equal the whole, either in an additive or geometric manner:
>>> cum.tail(1)
Portfolio Benchmark Active
199 1.342179 1.280958 1.025144
This is always a troubling situation, as it indicates that some sort of leakage may be occurring in your model.
Mixing single period and multi-period attribution is always always a challenge. Part of the issue lies in the goal of the analysis, i.e. what are you trying to explain.
If you are looking at cumulative returns as is the case above, then one way you perform your analysis is as follows:
Ensure the portfolio returns and the benchmark returns are both excess returns, i.e. subtract the appropriate cash return for the respective period (e.g. daily, monthly, etc.).
Assume you have a rich uncle who lends you $100m to start your fund. Now you can think of your portfolio as three transactions, one cash and two derivative transactions:
a) Invest your $100m in a cash account, conveniently earning the offer rate.
b) Enter into an equity swap for $100m notional
c) Enter into a swap transaction with a zero beta hedge fund, again for $100m notional.
We will conveniently assume that both swap transactions are collateralized by the cash account, and that there are no transaction costs (if only...!).
On day one, the stock index is up just over 1% (an excess return of exactly 1.00% after deducting the cash expense for the day). The uncorrelated hedge fund, however, delivered an excess return of -5%. Our fund is now at $96m.
Day two, how do we rebalance? Your calculations imply that we never do. Each is a separate portfolio that drifts on forever... For the purpose of attribution, however, I believe it makes total sense to rebalance daily, i.e. 100% to each of the two strategies.
As these are just notional exposures with ample cash collateral, we can just adjust the amounts. So instead of having $101m exposure to the equity index on day two and $95m of exposure to the hedge fund, we will instead rebalance (at zero cost) so that we have $96m of exposure to each.
How does this work in Pandas, you might ask? You've already calculated cum['Portfolio'], which is the cumulative excess growth factor for the portfolio (i.e. after deducting cash returns). If we apply the current day's excess benchmark and active returns to the prior day's portfolio growth factor, we calculate the daily rebalanced returns.
import numpy as np
import pandas as pd
np.random.seed(314)
df_returns = pd.DataFrame({
'Portfolio': np.random.randn(200) / 100 + 0.001,
'Benchmark': np.random.randn(200) / 100 + 0.001})
df_returns['Active'] = df.Portfolio - df.Benchmark
# Copy return dataframe shape and fill with NaNs.
df_cum = pd.DataFrame()
# Calculate cumulative portfolio growth
df_cum['Portfolio'] = (1 + df_returns.Portfolio).cumprod()
# Calculate shifted portfolio growth factors.
portfolio_return_factors = pd.Series([1] + df_cum['Portfolio'].shift()[1:].tolist(), name='Portfolio_return_factor')
# Use portfolio return factors to calculate daily rebalanced returns.
df_cum['Benchmark'] = (df_returns.Benchmark * portfolio_return_factors).cumsum()
df_cum['Active'] = (df_returns.Active * portfolio_return_factors).cumsum()
Now we see that the active return plus the benchmark return plus the initial cash equals the current value of the portfolio.
>>> df_cum.tail(3)[['Benchmark', 'Active', 'Portfolio']]
Benchmark Active Portfolio
197 0.303995 0.024725 1.328720
198 0.287709 0.051606 1.339315
199 0.292082 0.050098 1.342179
By construction, df_cum['Portfolio'] = 1 + df_cum['Benchmark'] + df_cum['Active'].
Because this method is difficult to calculate (without Pandas!) and understand (most people won't get the notional exposures), industry practice generally defines the active return as the cumulative difference in returns over a period of time. For example, if a fund was up 5.0% in a month and the market was down 1.0%, then the excess return for that month is generally defined as +6.0%. The problem with this simplistic approach, however, is that your results will drift apart over time due to compounding and rebalancing issues that aren't properly factored into the calculations.
So given our df_cum.Active column, we could define the drawdown as:
drawdown = pd.Series(1 - (1 + df_cum.Active)/(1 + df_cum.Active.cummax()), name='Active Drawdown')
>>> df_cum.Active.plot(legend=True);drawdown.plot(legend=True)
You can then determine the start and end points of the drawdown as you have previously done.
Comparing my cumulative Active return contribution with the amounts you calculated, you will find them to be similar at first, and then drift apart over time (my return calcs are in green):
My cheap two pennies in pure Python:
def find_drawdown(lista):
peak = 0
trough = 0
drawdown = 0
for n in lista:
if n > peak:
peak = n
trough = peak
if n < trough:
trough = n
temp_dd = peak - trough
if temp_dd > drawdown:
drawdown = temp_dd
return -drawdown
In piRSquared answer I would suggest amending
pmb = p - b
to
pmb = p / b
to find the rel. maxDD. df3 using pmb = p-b identifies a rel. MaxDD of US$851 (-48.9%). df2 using pmb = p/b identifies the rel. MaxDD as US$544.6 (-57.9%)
import pandas as pd
import datetime
import pandas_datareader.data as pdr
import matplotlib.pyplot as plt
import yfinance as yfin
yfin.pdr_override()
stocks = ["AMZN", "SPY"]
df = pdr.get_data_yahoo(stocks, start="2020-01-01", end="2022-02-18")
df = df[['Adj Close']]
df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)
df.Date=df.Date.dt.date
df2 = df[df.Date.isin([datetime.date(2020,7,9), datetime.date(2022,2,3)])].copy()
df2['AMZN/SPY'] = df2.AMZN / df2.SPY
df2['AMZN-SPY'] = df2.AMZN - df2.SPY
df2['USDdiff'] = df2['AMZN-SPY'].diff().round(1)
df2[["p", "b"]] = df2[['AMZN','SPY']].pct_change(1).round(4)
df2['p-b'] = df2.p - df2.b
df2.replace(np. nan,'',regex=True, inplace=True)
df2 = df2.round(2)
print(df2)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-07-09 3182.63 307.7 10.34 2874.93
2022-02-03 2776.91 446.6 6.22 2330.31 -544.6 -0.1275 0.4514 -0.5789
df3 = df[df.Date.isin([datetime.date(2020,9,2), datetime.date(2022,2,3)])].copy()
df3['AMZN/SPY'] = df3.AMZN / df3.SPY
df3['AMZN-SPY'] = df3.AMZN - df3.SPY
df3['USDdiff'] = df3['AMZN-SPY'].diff().round(1)
df3[["p", "b"]] = df3[['AMZN','SPY']].pct_change(1).round(4)
df3['p-b'] = df3.p - df3.b
df3.replace(np. nan,'',regex=True, inplace=True)
df3 = df3.round(2)
print(df3)
Date AMZN SPY AMZN/SPY AMZN-SPY USDdiff p b p-b
2020-09-02 3531.45 350.09 10.09 3181.36
2022-02-03 2776.91 446.60 6.22 2330.31 -851.0 -0.2137 0.2757 -0.4894
PS: I don't have enough reputation to comment.

Categories

Resources