Dual beta in python - multiple linear regression with dummy variable in statsmodel

Dual beta in python - multiple linear regression with dummy variable in statsmodel - python

I am trying to calculate the dual beta in python using a statsmodel regression. Unfortunately I am prompting an error message.
The regression equation for dual betas is given here
Dual Beta Formula
I am neglecting the risk free rate (rf) for now, but implementation should be similiar once I add it. For now my code looks as follows, where my 'spx.xlsx' file simple has two columns with returns, called 'SPXrets' and 'AAPLrets' (+ one column with dates):
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = pd.read_excel('spx.xlsx')
print(df.columns)
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
print(res.summary())
Prompting an patsy error:
PatsyError: intercept term cannot interact with anything else
AAPLrets ~ SPXrets:c(D) + SPXrets:(1-c(D))
Grateful for any help - many thanks!

Edit:
After my initial suggestions, the OP has changed both the title and the provided code snippet. My suggestions have since been edited accordingly.
New suggestion:
I suspect you're experiencing some problems with your dataset.
I suggest that you tell us a little more about the data source, how you've loaded the data, what it looks like (structure) and what type your columns have (string, float etc).
What I can tell you right now, is that your snippet runs fine with some sample data like this:
Code:
CONret DAXret:c(D) DAXret:(1-c(D)) AAPLrets SPXrets dummy
2017-01-08 109 107 122 101 100 0
2017-01-09 117 108 124 113 147 0
2017-01-10 142 108 130 107 103 1
2017-01-11 106 121 149 103 104 1
2017-01-12 124 149 143 112 126 0
Output:
OLS Regression Results
==============================================================================
Dep. Variable: AAPLrets R-squared: 0.095
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 1.044
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.331
Time: 16:00:01 Log-Likelihood: -48.388
No. Observations: 12 AIC: 100.8
Df Residuals: 10 BIC: 101.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 84.3198 31.143 2.708 0.022 14.929 153.711
SPXrets 0.2635 0.258 1.022 0.331 -0.311 0.838
==============================================================================
Omnibus: 5.649 Durbin-Watson: 1.882
Prob(Omnibus): 0.059 Jarque-Bera (JB): 2.933
Skew: 1.202 Prob(JB): 0.231
Kurtosis: 3.290 Cond. No. 872.
==============================================================================
Here's the whole thing:
# imports
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
res.summary()
Another suggestion:
Personally, I'd feel much more comfortable without patsy.
The snippet below will let you run a linear regression and select whether to return the model summary, or a dataframe with other details like coefficient p-values and r-squared.
# Imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
def LinReg(df, y, x, const, results):
betas = x.copy()
# Model with out without a constant
if const == True:
x = sm.add_constant(df[x])
model = sm.OLS(df[y], x).fit()
else:
model = sm.OLS(df[y], df[x]).fit()
# Estimates of R2 and p
res1 = {'Y': [y],
'R2': [format(model.rsquared, '.4f')],
'p': [model.pvalues.tolist()],
'start': [df.index[0]],
'stop': [df.index[-1]],
'obs' : [df.shape[0]],
'X': [betas]}
df_res1 = pd.DataFrame(data = res1)
# Regression Coefficients
theParams = model.params[0:]
coefs = theParams.to_frame()
df_coefs = pd.DataFrame(coefs.T)
xNames = list(df_coefs)
xValues = list(df_coefs.loc[0].values)
xValues2 = [ '%.2f' % elem for elem in xValues ]
res2 = {'Independent': [xNames],
'beta': [xValues2]}
df_res2 = pd.DataFrame(data = res2)
# All results
df_res = pd.concat([df_res1, df_res2], axis = 1)
df_res = df_res.T
df_res.columns = ['results']
if results == 'summary':
return(model.summary())
print(model.summary())
else:
return(df_res)
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
print(df_regression)
Test run 1:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))'], const = True, results = '')
Output 1:
results
Y CONret
R2 0.0813
p [0.13194822614949883, 0.45726622261432304, 0.9...
start 2017-01-01 00:00:00
stop 2017-01-12 00:00:00
obs 12
X [DAXret:c(D), DAXret:(1-c(D)), dummy]
Independent [const, DAXret:c(D), DAXret:(1-c(D)), dummy]
beta [88.94, 0.24, -0.01, 2.20]
Test run 2:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
Output 2:
OLS Regression Results
==============================================================================
Dep. Variable: CONret R-squared: 0.081
Model: OLS Adj. R-squared: -0.263
Method: Least Squares F-statistic: 0.2361
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.869
Time: 16:04:02 Log-Likelihood: -47.138
No. Observations: 12 AIC: 102.3
Df Residuals: 8 BIC: 104.2
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 88.9438 53.019 1.678 0.132 -33.318 211.205
DAXret:c(D) 0.2350 0.301 0.781 0.457 -0.459 0.929
DAXret:(1-c(D)) -0.0060 0.391 -0.015 0.988 -0.908 0.896
dummy 2.2005 8.973 0.245 0.812 -18.490 22.891
==============================================================================
Omnibus: 1.025 Durbin-Watson: 2.354
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.720
Skew: 0.540 Prob(JB): 0.698
Kurtosis: 2.477 Cond. No. 2.15e+03
==============================================================================

Related

Using GLM to reproduce built-in regression models in statsmodels

I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...

How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model

I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.
The general linear equation looks something like this:
y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e
I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?
In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.
Any help would be greatly appreciated and thank you in advance.

Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()
But like you mentioned, you could create new variables manually the way you described (I used some random data):
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']
shifts = 3
for variable in df.columns.values:
for t in range(1, shifts + 1):
df[f'{variable} AR({t})'] = df.shift(t)[variable]
df = df.dropna()
>>> df.head()
y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3)
date ...
1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0
1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0
1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0
1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0
1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0
[5 rows x 12 columns]
I'm using the model you describe in your post:
model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()
>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 150.8
Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51
Time: 17:51:20 Log-Likelihood: -53.357
No. Observations: 201 AIC: 112.7
Df Residuals: 198 BIC: 122.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439
x1 0.0211 0.003 6.261 0.000 0.014 0.028
x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030
==============================================================================
Omnibus: 2.115 Durbin-Watson: 2.277
Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712
Skew: 0.064 Prob(JB): 0.425
Kurtosis: 2.567 Cond. No. 41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Hope this helps you getting started.

Python: Super Dictionary of Simple OLS

I am trying to build a super dictionary which holds within a number of lower level libraries
Concept
I have interest rates for my retail bank for the last 12 years and I am trying to model the interest rates by using a portfolio of different bonds.
Regression formula
Y_i - Y_i-1 = A + B(X_i - X_i-1) + E
In words, Y_Lag = alpha + beta(X_Lag) + Error term
Data
Note: Y = Historic Rate
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)),
columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y'])
Code thus far
#Import packages required for the analysis
import pandas as pd
import numpy as np
import statsmodels.api as sm
def Simulation(TotalSim,j):
#super dictionary to hold all iterations of the loop
Super_fit_d = {}
for i in range(1,TotalSim):
#Create a introductory loop to run the first set of regressions
#Each loop produces a univariate regression
#Each loop has a fixed lag of i
fit_d = {} # This will hold all of the fit results and summaries
for col in [x for x in df.columns if x != 'Historic Rate']:
Y = df['Historic Rate'] - df['Historic Rate'].shift(1)
# Need to remove the NaN for fit
Y = Y[Y.notnull()]
X = df[col] - df[col].shift(i)
X = X[X.notnull()]
#Y now has more observations than X due to lag, drop rows to match
Y = Y.drop(Y.index[0:i-1])
if j = 1:
X = sm.add_constant(X) # Add a constant to the fit
fit_d[col] = sm.OLS(Y,X).fit()
#append the dictionary for each lag onto the super dictionary
Super_fit_d[lag_i] = fit_d
#Check the output for one column
fit_d['Overnight'].summary()
#Check the output for one column in one segment of the super dictionary
Super_fit_d['lag_5'].fit_d['Overnight'].summary()
Simulation(11,1)
Question
I seem to be overwriting my dictionary with every loop and I'm not evaluating the i properly to index the iteration as lag_1, lag_2, lag_3 etc. How do I fix this?
Thanks in advance

There are a couple of issues here:
you sometimes use i and sometimes lag_i, but only i is defined. I changed all to lag_i for consistency
if j = 1 is incorrect syntax. You need if j == 1
You need to return fit_d so that it persists after your loop
I got it done by applying those changes
import pandas as pd
import numpy as np
import statsmodels.api as sm
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)),
columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y'])
def Simulation(TotalSim,j):
Super_fit_d = {}
for lag_i in range(1,TotalSim):
#Create a introductory loop to run the first set of regressions
#Each loop produces a univariate regression
#Each loop has a fixed lag of i
fit_d = {} # This will hold all of the fit results and summaries
for col in [x for x in df.columns if x != 'Historic Rate']:
Y = df['Historic Rate'] - df['Historic Rate'].shift(1)
# Need to remove the NaN for fit
Y = Y[Y.notnull()]
X = df[col] - df[col].shift(lag_i)
X = X[X.notnull()]
#Y now has more observations than X due to lag, drop rows to match
Y = Y.drop(Y.index[0:lag_i-1])
if j == 1:
X = sm.add_constant(X) # Add a constant to the fit
fit_d[col] = sm.OLS(Y,X).fit()
#append the dictionary for each lag onto the super dictionary
# return fit_d
Super_fit_d[lag_i] = fit_d
return Super_fit_d
test_dict = Simulation(11,1)
First lag
test_dict[1]['Overnight'].summary()
Out[76]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Historic Rate R-squared: 0.042
Model: OLS Adj. R-squared: 0.033
Method: Least Squares F-statistic: 4.303
Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.0407
Time: 11:15:13 Log-Likelihood: -280.39
No. Observations: 99 AIC: 564.8
Df Residuals: 97 BIC: 570.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0048 0.417 -0.012 0.991 -0.833 0.823
Overnight 0.2176 0.105 2.074 0.041 0.009 0.426
==============================================================================
Omnibus: 1.449 Durbin-Watson: 2.756
Prob(Omnibus): 0.485 Jarque-Bera (JB): 1.180
Skew: 0.005 Prob(JB): 0.554
Kurtosis: 2.465 Cond. No. 3.98
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Second Lag
test_dict[2]['Overnight'].summary()
Out[77]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Historic Rate R-squared: 0.001
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.06845
Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.794
Time: 11:15:15 Log-Likelihood: -279.44
No. Observations: 98 AIC: 562.9
Df Residuals: 96 BIC: 568.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0315 0.428 0.074 0.941 -0.817 0.880
Overnight 0.0291 0.111 0.262 0.794 -0.192 0.250
==============================================================================
Omnibus: 2.457 Durbin-Watson: 2.798
Prob(Omnibus): 0.293 Jarque-Bera (JB): 1.735
Skew: 0.115 Prob(JB): 0.420
Kurtosis: 2.391 Cond. No. 3.84
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

Scikit-Learn: Std.Error, p-Value from LinearRegression

I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.
I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method
import numpy as np
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target
#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)
xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)
# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()
Output as follows:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
I've also found the following articles
Find p-value (significance) in scikit-learn LinearRegression
http://connor-johnson.com/2014/02/18/linear-regression-with-python/
Both the codes in the SO link doesn't compile
Here is my code & data that I'm working on - but not being able to find the std error & p-values
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def readFile(filename, sheetname):
xlsx = pd.ExcelFile(filename)
data = xlsx.parse(sheetname, skiprows=1)
return data
def lr_statsmodel(X,y):
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print (results.summary())
def lr_scikit(X,y,featureCols):
model = LinearRegression()
results = model.fit(X,y)
predictions = results.predict(X)
print 'Coefficients'
print 'Intercept\t' , results.intercept_
df = pd.DataFrame(zip(featureCols, results.coef_))
print df.to_string(index=False, header=False)
# Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
rSquare = results.score(X,y)
print '\nR-Square::', rSquare
# This looks like a better option
# source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
r2 = metrics.r2_score(y,results.predict(X))
print 'r2', r2
# Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
# INFO:: All three of them are providing the same figures!
# Adj-Rsquare formula # https://www.easycalculation.com/statistics/learn-adjustedr2.php
# In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
N = X.shape[0]
p = X.shape[1]
adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
print "Adjusted R-Square::", adjRsquare
# calculate standard errors
# mean_absolute_error
# mean_squared_error
# median_absolute_error
# r2_score
# explained_variance_score
mse = metrics.mean_squared_error(y,results.predict(X))
print mse
print 'Residual Standard Error:', np.sqrt(mse)
# OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
n = X.shape[0]
X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
se_matrix = scipy.linalg.sqrtm(
metrics.mean_squared_error(y, results.predict(X)) *
np.linalg.inv(X1.T * X1)
)
print 'se',np.diagonal(se_matrix)
# https://github.com/nsh87/regressors/blob/master/regressors/stats.py
# http://regressors.readthedocs.io/en/latest/usage.html
y_hat = results.predict(X)
sse = np.sum((y_hat - y) ** 2)
print 'Standard Square Error of the Model:', sse
if __name__ == '__main__':
# read file
fileData = readFile('Linear_regression.xlsx','Input Data')
# list of independent variables
feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
# build dependent & independent data set
X = fileData[feature_cols]
y = fileData['Number of weekly riders']
# Statsmodel - OLS
# lr_statsmodel(X,y)
# ScikitLearn - OLS
lr_scikit(X,y,feature_cols)
My data-set
Y X1 X2 X3 X4
City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
1 1,92,000 $15 18,00,000 $5,800 $50
2 1,90,400 $15 17,90,000 $6,200 $50
3 1,91,200 $15 17,80,000 $6,400 $60
4 1,77,600 $25 17,78,000 $6,500 $60
5 1,76,800 $25 17,50,000 $6,550 $60
6 1,78,400 $25 17,40,000 $6,580 $70
7 1,80,800 $25 17,25,000 $8,200 $75
8 1,75,200 $30 17,25,000 $8,600 $75
9 1,74,400 $30 17,20,000 $8,800 $75
10 1,73,920 $30 17,05,000 $9,200 $80
11 1,72,800 $30 17,10,000 $9,630 $80
12 1,63,200 $40 17,00,000 $10,570 $80
13 1,61,600 $40 16,95,000 $11,330 $85
14 1,61,600 $40 16,95,000 $11,600 $100
15 1,60,800 $40 16,90,000 $11,800 $105
16 1,59,200 $40 16,30,000 $11,830 $105
17 1,48,800 $65 16,40,000 $12,650 $105
18 1,15,696 $102 16,35,000 $13,000 $110
19 1,47,200 $75 16,30,000 $13,224 $125
20 1,50,400 $75 16,20,000 $13,766 $130
21 1,52,000 $75 16,15,000 $14,010 $150
22 1,36,000 $80 16,05,000 $14,468 $155
23 1,26,240 $86 15,90,000 $15,000 $165
24 1,23,888 $98 15,95,000 $15,200 $175
25 1,26,080 $87 15,90,000 $15,600 $175
26 1,51,680 $77 16,00,000 $16,000 $190
27 1,52,800 $63 16,10,000 $16,200 $200
I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.
EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables

Here is reg is output of lin regression fit method of sklearn
to calculate adjusted r2
def adjustedR2(x,y reg):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
and for p values
from sklearn.feature_selection import f_regression
freg=f_regression(x,y)
p=freg[1]
print(p.round(3))

how to get equation for nonlinear mutivariate regression in which one variable is dependent on other two independent variables in python

I have set of 5000 data points of like_so_ (x,y,z) for eg (0,1,50)
where x=1,y=2,z=120.with help of these 5000 enteries,i have to get an equation in
which given x and y ,equation should be able to get value of z

You can use statsmodels.ols. Some sample data - assuming you can create a pd.DataFrame from your (x, y, z) data:
import pandas as pd
df = pd.DataFrame(np.random.randint(100, size=(150, 3)), columns=list('XYZ'))
df.info()
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
X 150 non-null int64
Y 150 non-null int64
Z 150 non-null int64
Now estimate linear regression parameters:
import numpy as np
import statsmodels.api as sm
model = sm.OLS(df['Z'], df[['X', 'Y']])
results = model.fit()
to get:
results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Z R-squared: 0.652
Model: OLS Adj. R-squared: 0.647
Method: Least Squares F-statistic: 138.6
Date: Fri, 17 Jun 2016 Prob (F-statistic): 1.21e-34
Time: 13:48:38 Log-Likelihood: -741.94
No. Observations: 150 AIC: 1488.
Df Residuals: 148 BIC: 1494.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
X 0.5224 0.076 6.874 0.000 0.372 0.673
Y 0.3531 0.076 4.667 0.000 0.204 0.503
==============================================================================
Omnibus: 5.869 Durbin-Watson: 1.921
Prob(Omnibus): 0.053 Jarque-Bera (JB): 2.990
Skew: -0.000 Prob(JB): 0.224
Kurtosis: 2.308 Cond. No. 2.70
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
to predict, use:
params = results.params
params = results.params
df['predictions'] = model.predict(params)
which yields:
X Y Z predictions
0 31 85 75 54.701830
1 36 46 43 34.828605
2 77 42 8 43.795386
3 78 84 65 66.932761
4 27 54 50 36.737606

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dual beta in python - multiple linear regression with dummy variable in statsmodel - python

Related

Using GLM to reproduce built-in regression models in statsmodels

How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model

Python: Super Dictionary of Simple OLS

Scikit-Learn: Std.Error, p-Value from LinearRegression

how to get equation for nonlinear mutivariate regression in which one variable is dependent on other two independent variables in python

Categories

Resources