Python: Linear Regression of Subsets of Columns - python

I have a dataset of monthly stock returns that I am trying to regress on market returns by quarter and by company ID (PERMNO). Here's what the data looks like:
date PERMNO MCAP FIRMRF MKTRF qtr
0 2018-01-02 10026 2.784892e+06 -0.017514 0.0085 1
7339 2018-01-03 10026 2.757077e+06 -0.010048 0.0059 1
14671 2018-01-04 10026 2.795160e+06 0.013753 0.0042 1
22003 2018-01-05 10026 2.768464e+06 -0.009610 0.0066 1
29334 2018-01-08 10026 2.770518e+06 0.000682 0.0019 1
... ... ... ... ... ... ...
8455011 2022-03-25 93436 1.044531e+09 -0.003235 0.0027 1
8464495 2022-03-28 93436 1.128454e+09 0.080345 0.0073 1
8473980 2022-03-29 93436 1.136443e+09 0.007080 0.0145 1
8483469 2022-03-30 93436 1.130676e+09 -0.005075 -0.0083 1
8492959 2022-03-31 93436 1.113736e+09 -0.014982 -0.0155 1
The goal is to have the slope (beta) and standard error of each firm, each quarter, stored as values in the same dataframe (the quarterly regression values would be repeating for each line in a given quarter).
I've been scouring stackoverflow the past few days and have tried to repurpose a bunch of different answers here, but to no avail. So far, I'm assuming that it will need to look something like:
for i in daily['qtr']:
for x in daily['PERMNO']:
reg = sm.OLS(daily['FIRMRF'], sm.add_constant(daily['MKTRF']))
results = reg.fit()
Any help, hints or advice is appreciated!

After a few days of experimenting I've found a solution to the problem. I was able to define a function that contains the OLS regression and then apply it via a groupby function:
#Pull beta estimates
#Def Regression func
def regress(data, yvar, xvars):
Y = data[yvar]
X = sm.add_constant(data[xvars])
result = sm.OLS(Y, X).fit()
return result.params
#regression within group
params1 = data.groupby(['PERMNO', 'year', 'qtr']).apply(regress, 'FIRMRF', 'MKTRF')
params_raw = pd.DataFrame(params1)
params_raw.head(10)
Hopefully this will be useful to someone else!

Related

Finding Pivot Points for stock price, after grouping by symbol. Pivot Point is high for 10 values before and after point

Date Symbol Close Volume
1259 2021-10-29 AA 45.950 6350815.000
1260 2021-10-28 AA 46.450 10265029.000
1261 2021-10-27 AA 45.790 12864700.000
1262 2021-10-26 AA 49.442 6153100.000
1263 2021-10-25 AA 51.058 11070100.000
1264 2021-10-22 AA 49.143 7453300.000
1265 2021-10-21 AA 49.881 9066900.000
1266 2021-10-20 AA 52.396 7330400.000
1267 2021-10-19 AA 53.563 10860800.000
1268 2021-10-18 AA 57.115 9883800.000
Looking for result similar too...
Date Symbol Close Volume High Points Pivot Point
1379 2021-05-11 AA 41.230 9042100.000 41.230 True
1568 2020-08-10 AA 15.536 8087800.000 15.536 True
1760 2019-11-04 AA 22.860 3741000.000 22.860 True
1934 2019-02-27 AA 30.912 2880100.000 30.912 True
2149 2018-04-19 AA 60.099 11779200.00 60.099 True
2213 2018-01-17 AA 56.866 8189700.000 56.866 True
2445 2017-02-14 AA 38.476 3818600.000 38.476 True
5406 2021-06-02 AAL 25.820 58094598.00 25.820 True
5461 2021-03-15 AAL 25.170 93746800.00 25.170 True
5654 2020-06-08 AAL 20.310 175418900.0 20.310 True
5734 2020-02-12 AAL 30.470 9315400.000 30.470 True
5807 2019-10-28 AAL 31.144 10298500.00 31.144 True
5874 2019-07-24 AAL 34.231 7315300.000 34.231 True
6083 2018-09-21 AAL 42.788 10743100.00 42.788 True
6257 2018-01-12 AAL 56.989 7505800.000 56.989 True
6322 2017-10-10 AAL 51.574 9387100.000 51.574 True
6383 2017-07-14 AAL 52.624 4537900.000 52.624 True
I'm newer to programming and have been struggling on this one. I'm trying to find points that are a local max which must be higher than the 10 closes before and after. The data frame has about 320 and stocks on it and needs to be grouped by symbol. I have tried a few different approaches to solving this but haven't been able to find something that will work. Any insight would be greatly appreciated.
#read in data, vol_list is an exsting screen to reduce number of stock that didn't meet volume critria
df_prices = pd.read_csv('/Users/kylemerrick/Desktop/Stock Screener/price_data.csv')
include_pivot_points = df_prices[df_prices['Symbol'].isin(vol_list)]
n=10
pivot_points = include_pivot_points.groupby('Symbol')['Close'].apply(lambda x : iloc[argrelextrema(x.values, np.greater_equal, axis=1, order=n)
I have also tried writing my own function to do this but can't figure out how to compare the current to the 10 values before and after
include_pivot_points.groupby('Symbol').iloc['Close'] + 10:['Close'] -10]
I was able to solve with the following code eventually and wanted to share as I didn't receive a reply. Many other solutions existed for pivot points or (support/resistance points) appending each price point to a list or just for one symbol. I had wanted to keep data frame with multiple symbols.
First used apply, 21 then shift so that there are even number on each side
include_pivot_points['High Points'] = include_pivot_points.groupby('Symbol').rolling(21)['Close'].max().shift(-11).reset_index(level = 'Symbol', drop = True)
If the high point equaled the current close I then knew this was the pivot point and added column true or false for a pivot point
include_pivot_points['Pivot Point'] = include_pivot_points['High Points'] == include_pivot_points['Close']
Then removed false values to get all past pivot points for all stocks
pivot_points = include_pivot_points[include_pivot_points['Pivot Point'] == True]

Exponential Smoothing with alpha and beta greater than one

I have the following time series
year value
2001-01-01 433.0
2002-01-01 445.0
2003-01-01 406.0
2004-01-01 416.0
2005-01-01 432.0
2006-01-01 458.0
2007-01-01 418.0
2008-01-01 392.0
2009-01-01 464.0
2010-01-01 434.0
2012-01-01 435.0
2013-01-01 437.0
2014-01-01 465.0
2015-01-01 442.0
2016-01-01 456.0
2017-01-01 448.0
2018-01-01 433.0
2019-01-01 399.0
that I want to fit with an Exponential Smoothing model. I define my model the following way:
model = ExponentialSmoothing(dataframe, missing='drop', trend='mul', seasonal_periods=5,
seasonal='add',initialization_method="heuristic")
model = model.fit(optimized=True, method="basinhopping")
where I let the algorithm to optimize the values of smoothing_level=$\alpha$, smoothing_trending=$\beta$, smoothing_seasonal=$\gamma$ and damping_trend=$\phi$.
However, when I print the results for this specific case, i get: $\alpha=1.49$, $\beta=1.41$, $\gamma=0.0$ and $\phi=0.0$.
Could someone explain me what's happening here?
Are these values of $\alpha$ and $\beta$ greater than 1 acceptable?
I think you're misinterpreting the results. We can run your model as follows:
data = [
433.0, 445.0, 406.0, 416.0, 432.0, 458.0,
418.0, 392.0, 464.0, 434.0, 435.0, 437.0,
465.0, 442.0, 456.0, 448.0, 433.0, 399.0]
model = sm.tsa.ExponentialSmoothing(data, missing='drop', trend='mul', seasonal_periods=5,
seasonal='add',initialization_method="heuristic")
res = model.fit(optimized=True, method="basinhopping")
print(res.params['smoothing_level'])
print(res.params['smoothing_trend'])
which gives me:
1.4901161193847656e-08
1.4873988732462211e-08
Notice the e-08 part - the first parameter isn't equal to 1.49, it's equal to 0.0000000149.

calculating moving average in pandas

So, this is fairly a new topic for me and I don't quite understand it yet. I wanted to make a new column in a dataset that contains the moving average of the volume column. The window size is 5 and moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively
I did this.
df['Volume_moving'] = df.iloc[:,5].rolling(window=5).mean()
df
Date Open High Low Close Volume Adj Close Volume_moving
0 2012-10-15 632.35 635.13 623.85 634.76 15446500 631.87 NaN
1 2012-10-16 635.37 650.30 631.00 649.79 19634700 646.84 NaN
2 2012-10-17 648.87 652.79 644.00 644.61 13894200 641.68 NaN
3 2012-10-18 639.59 642.06 630.00 632.64 17022300 629.76 NaN
4 2012-10-19 631.05 631.77 609.62 609.84 26574500 607.07 18514440.0
... ... ... ... ... ... ... ... ...
85 2013-01-08 529.21 531.89 521.25 525.31 16382400 525.31 17504860.0
86 2013-01-09 522.50 525.01 515.99 517.10 14557300 517.10 16412620.0
87 2013-01-10 528.55 528.72 515.52 523.51 21469500 523.51 18185340.0
88 2013-01-11 521.00 525.32 519.02 520.30 12518100 520.30 16443720.0
91 2013-01-14 502.68 507.50 498.51 501.75 26179000 501.75 18221260.0
However, I think that the result is not accurate as I tried it with a different dataframe and get the exact same result.
Can anyone please help me with this?
Try with this:
df['Volume_moving'] = df['Volume'].rolling(window=5).mean()

Getting a simple predict from OLS something different from .6 to .8 of StatsModels

Sorry for cross posting this but can't get past it I cannot get output from the predict function:
I have an OLS model that used to work with SM .6 and now not working in .8 and Pandas increased from 19.2 to 20.3 so that could be the issue?
I just don't understand what I need to feed to the predict method.
So my model create looks like:
def fit_line2(x, y):
X = sm.add_constant(x, prepend=True) #Add a column of ones to allow the calculation of the intercept
ols_test = sm.OLS(y, X,missing='drop').fit()
"""Return slope, intercept of best fit line."""
X = sm.add_constant(x)
return ols_test
And that works fine and I get a model out and can see the summary fine.
I used to do this to get the prediction one period ahead by using my latest value (on which I want to project forward) worked in SM .6
The predict is called as follows:
yrahead=ols_test.predict(ols_input)
ols input is created from a pandas DF:
ols_input=(sm.add_constant(merged2.lastqu[-1:], prepend=True))
lastqu
2018-12-31 13209.0
type:
<class 'pandas.core.frame.DataFrame'>
calling predict as:
yrahead=ols_test.predict(ols_input)
This gives me an error:
ValueError: shapes (1,1) and (2,) not aligned: 1 (dim 1) != 2 (dim 0)
I tried simply feeding the number by changing ols_input to:
13209.0
Type:
<class 'numpy.float64'>
That gave me a similar error:
ValueError: shapes (1,1) and (2,) not aligned: 1 (dim 1) != 2 (dim 0)
Not sure where to go here?
the base DataFrame table (merged2) from the above looks like so the last line lastqu column contains the value I want to predict Units for:
Units lastqu Uperchg lqperchg
2000-12-31 19391.000000 NaN NaN NaN
2001-12-31 35068.000000 5925.0 80.85 NaN
2002-12-31 39279.000000 8063.0 12.01 36.08
2003-12-31 47517.000000 9473.0 20.97 17.49
2004-12-31 51439.000000 11226.0 8.25 18.51
2005-12-31 59674.000000 11667.0 16.01 3.93
2006-12-31 58664.000000 14016.0 -1.69 20.13
2007-12-31 55698.000000 13186.0 -5.06 -5.92
2008-12-31 42235.000000 11343.0 -24.17 -13.98
2009-12-31 40478.333333 7867.0 -4.16 -30.64
2010-12-31 38721.666667 8114.0 -4.34 3.14
2011-12-31 36965.000000 8361.0 -4.54 3.04
2012-12-31 39132.000000 8608.0 5.86 2.95
2013-12-31 43160.000000 9016.0 10.29 4.74
2014-12-31 44520.000000 9785.0 3.15 8.53
2015-12-31 49966.000000 10351.0 12.23 5.78
2016-12-31 53752.000000 10884.0 7.58 5.15
2017-12-31 57571.000000 12109.0 7.10 11.26
2018-12-31 NaN 13209.0 NaN 9.08
So I'm using the OLS against the lastqu to project units for 2018
I freely confess to not really understanding why SM .6 worked the way it did, but it did!
After some discussion with The library author of Statsmodels it seems there is a bug see the discussion here https://groups.google.com/d/topic/pystatsmodels/a0XsXIiP5ro/discussion
Note my final solution for my specific issue was:
ols_input=np.array([1,merged2.lastqu[-1:].values])
yrahead=ols_test.predict(ols_input)
Which yields the Units for next period..

Calculate max draw down with a vectorized solution in python

Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

Categories

Resources