Error using the 'predict' function for a logistic regression - python

I am trying to fit a multinomial logistic regression and then predicting the result from samples.
### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()
[Output]:
MAP Sr delTC Mean_Treecover
302993741 2159.297363 452.975647 2.666672 1.0
217364332 3242.351807 65.615341 8.000000 1.0
390863334 1617.215454 493.124054 5.666666 0.0
446559668 1095.183105 498.373383 -8.000000 0.0
246078364 2804.615234 98.981110 -4.000000 1.0
1000000 rows × 7 columns
#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
print(model.summary2())
[Output]:
Results: MNLogit
====================================================================
Model: MNLogit Pseudo R-squared: 0.364
Dependent Variable: Mean_Treecover AIC: 831092.4595
Date: 2021-04-02 13:51 BIC: 831139.7215
No. Observations: 1000000 Log-Likelihood: -4.1554e+05
Df Model: 3 LL-Null: -6.5347e+05
Df Residuals: 999996 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
--------------------------------------------------------------------
Mean_Treecover = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept -5.2200 0.0119 -438.4468 0.0000 -5.2434 -5.1967
MAP 0.0023 0.0000 491.0859 0.0000 0.0023 0.0023
Sr 0.0016 0.0000 90.6805 0.0000 0.0015 0.0016
delTC -0.0093 0.0002 -39.9022 0.0000 -0.0098 -0.0089
However, wherever I try to predict the using the model.predict() function, I get the following error.
prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627
Does anyone know how to troubleshoot this? Is there something that I might be doing wrong?

The model adds an intercept so you need to include that, using an example data:
from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])
RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
You can see the dimensions of your fitted data:
model.model.exog[:5,]
Out[16]:
array([[1. , 0.33914763, 0.79358056, 0.3103758 ],
[1. , 0.45915785, 0.94991271, 0.27203524],
[1. , 0.55527662, 0.15122108, 0.80675951],
[1. , 0.18493681, 0.89854583, 0.66760684],
[1. , 0.38300074, 0.6945397 , 0.28128137]])
Which is the same as if you add a constant:
import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])
const MAP Sr delTC
0 1.0 0.339148 0.793581 0.310376
1 1.0 0.459158 0.949913 0.272035
2 1.0 0.555277 0.151221 0.806760
3 1.0 0.184937 0.898546 0.667607
If you have a data.frame with the same column names, it will just be:
prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])
Or if you just need the fitted values, do:
model.fittedvalues

Related

Using GLM to reproduce built-in regression models in statsmodels

I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...

How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model

I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.
The general linear equation looks something like this:
y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e
I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?
In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.
Any help would be greatly appreciated and thank you in advance.
Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()
But like you mentioned, you could create new variables manually the way you described (I used some random data):
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']
shifts = 3
for variable in df.columns.values:
for t in range(1, shifts + 1):
df[f'{variable} AR({t})'] = df.shift(t)[variable]
df = df.dropna()
>>> df.head()
y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3)
date ...
1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0
1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0
1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0
1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0
1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0
[5 rows x 12 columns]
I'm using the model you describe in your post:
model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()
>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 150.8
Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51
Time: 17:51:20 Log-Likelihood: -53.357
No. Observations: 201 AIC: 112.7
Df Residuals: 198 BIC: 122.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439
x1 0.0211 0.003 6.261 0.000 0.014 0.028
x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030
==============================================================================
Omnibus: 2.115 Durbin-Watson: 2.277
Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712
Skew: 0.064 Prob(JB): 0.425
Kurtosis: 2.567 Cond. No. 41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Hope this helps you getting started.

Dual beta in python - multiple linear regression with dummy variable in statsmodel

I am trying to calculate the dual beta in python using a statsmodel regression. Unfortunately I am prompting an error message.
The regression equation for dual betas is given here
Dual Beta Formula
I am neglecting the risk free rate (rf) for now, but implementation should be similiar once I add it. For now my code looks as follows, where my 'spx.xlsx' file simple has two columns with returns, called 'SPXrets' and 'AAPLrets' (+ one column with dates):
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = pd.read_excel('spx.xlsx')
print(df.columns)
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
print(res.summary())
Prompting an patsy error:
PatsyError: intercept term cannot interact with anything else
AAPLrets ~ SPXrets:c(D) + SPXrets:(1-c(D))
Grateful for any help - many thanks!
Edit:
After my initial suggestions, the OP has changed both the title and the provided code snippet. My suggestions have since been edited accordingly.
New suggestion:
I suspect you're experiencing some problems with your dataset.
I suggest that you tell us a little more about the data source, how you've loaded the data, what it looks like (structure) and what type your columns have (string, float etc).
What I can tell you right now, is that your snippet runs fine with some sample data like this:
Code:
CONret DAXret:c(D) DAXret:(1-c(D)) AAPLrets SPXrets dummy
2017-01-08 109 107 122 101 100 0
2017-01-09 117 108 124 113 147 0
2017-01-10 142 108 130 107 103 1
2017-01-11 106 121 149 103 104 1
2017-01-12 124 149 143 112 126 0
Output:
OLS Regression Results
==============================================================================
Dep. Variable: AAPLrets R-squared: 0.095
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 1.044
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.331
Time: 16:00:01 Log-Likelihood: -48.388
No. Observations: 12 AIC: 100.8
Df Residuals: 10 BIC: 101.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 84.3198 31.143 2.708 0.022 14.929 153.711
SPXrets 0.2635 0.258 1.022 0.331 -0.311 0.838
==============================================================================
Omnibus: 5.649 Durbin-Watson: 1.882
Prob(Omnibus): 0.059 Jarque-Bera (JB): 2.933
Skew: 1.202 Prob(JB): 0.231
Kurtosis: 3.290 Cond. No. 872.
==============================================================================
Here's the whole thing:
# imports
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
res.summary()
Another suggestion:
Personally, I'd feel much more comfortable without patsy.
The snippet below will let you run a linear regression and select whether to return the model summary, or a dataframe with other details like coefficient p-values and r-squared.
# Imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
def LinReg(df, y, x, const, results):
betas = x.copy()
# Model with out without a constant
if const == True:
x = sm.add_constant(df[x])
model = sm.OLS(df[y], x).fit()
else:
model = sm.OLS(df[y], df[x]).fit()
# Estimates of R2 and p
res1 = {'Y': [y],
'R2': [format(model.rsquared, '.4f')],
'p': [model.pvalues.tolist()],
'start': [df.index[0]],
'stop': [df.index[-1]],
'obs' : [df.shape[0]],
'X': [betas]}
df_res1 = pd.DataFrame(data = res1)
# Regression Coefficients
theParams = model.params[0:]
coefs = theParams.to_frame()
df_coefs = pd.DataFrame(coefs.T)
xNames = list(df_coefs)
xValues = list(df_coefs.loc[0].values)
xValues2 = [ '%.2f' % elem for elem in xValues ]
res2 = {'Independent': [xNames],
'beta': [xValues2]}
df_res2 = pd.DataFrame(data = res2)
# All results
df_res = pd.concat([df_res1, df_res2], axis = 1)
df_res = df_res.T
df_res.columns = ['results']
if results == 'summary':
return(model.summary())
print(model.summary())
else:
return(df_res)
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
print(df_regression)
Test run 1:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))'], const = True, results = '')
Output 1:
results
Y CONret
R2 0.0813
p [0.13194822614949883, 0.45726622261432304, 0.9...
start 2017-01-01 00:00:00
stop 2017-01-12 00:00:00
obs 12
X [DAXret:c(D), DAXret:(1-c(D)), dummy]
Independent [const, DAXret:c(D), DAXret:(1-c(D)), dummy]
beta [88.94, 0.24, -0.01, 2.20]
Test run 2:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
Output 2:
OLS Regression Results
==============================================================================
Dep. Variable: CONret R-squared: 0.081
Model: OLS Adj. R-squared: -0.263
Method: Least Squares F-statistic: 0.2361
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.869
Time: 16:04:02 Log-Likelihood: -47.138
No. Observations: 12 AIC: 102.3
Df Residuals: 8 BIC: 104.2
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 88.9438 53.019 1.678 0.132 -33.318 211.205
DAXret:c(D) 0.2350 0.301 0.781 0.457 -0.459 0.929
DAXret:(1-c(D)) -0.0060 0.391 -0.015 0.988 -0.908 0.896
dummy 2.2005 8.973 0.245 0.812 -18.490 22.891
==============================================================================
Omnibus: 1.025 Durbin-Watson: 2.354
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.720
Skew: 0.540 Prob(JB): 0.698
Kurtosis: 2.477 Cond. No. 2.15e+03
==============================================================================

Scikit-Learn: Std.Error, p-Value from LinearRegression

I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.
I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method
import numpy as np
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target
#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)
xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)
# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()
Output as follows:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
I've also found the following articles
Find p-value (significance) in scikit-learn LinearRegression
http://connor-johnson.com/2014/02/18/linear-regression-with-python/
Both the codes in the SO link doesn't compile
Here is my code & data that I'm working on - but not being able to find the std error & p-values
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def readFile(filename, sheetname):
xlsx = pd.ExcelFile(filename)
data = xlsx.parse(sheetname, skiprows=1)
return data
def lr_statsmodel(X,y):
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print (results.summary())
def lr_scikit(X,y,featureCols):
model = LinearRegression()
results = model.fit(X,y)
predictions = results.predict(X)
print 'Coefficients'
print 'Intercept\t' , results.intercept_
df = pd.DataFrame(zip(featureCols, results.coef_))
print df.to_string(index=False, header=False)
# Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
rSquare = results.score(X,y)
print '\nR-Square::', rSquare
# This looks like a better option
# source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
r2 = metrics.r2_score(y,results.predict(X))
print 'r2', r2
# Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
# INFO:: All three of them are providing the same figures!
# Adj-Rsquare formula # https://www.easycalculation.com/statistics/learn-adjustedr2.php
# In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
N = X.shape[0]
p = X.shape[1]
adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
print "Adjusted R-Square::", adjRsquare
# calculate standard errors
# mean_absolute_error
# mean_squared_error
# median_absolute_error
# r2_score
# explained_variance_score
mse = metrics.mean_squared_error(y,results.predict(X))
print mse
print 'Residual Standard Error:', np.sqrt(mse)
# OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
n = X.shape[0]
X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
se_matrix = scipy.linalg.sqrtm(
metrics.mean_squared_error(y, results.predict(X)) *
np.linalg.inv(X1.T * X1)
)
print 'se',np.diagonal(se_matrix)
# https://github.com/nsh87/regressors/blob/master/regressors/stats.py
# http://regressors.readthedocs.io/en/latest/usage.html
y_hat = results.predict(X)
sse = np.sum((y_hat - y) ** 2)
print 'Standard Square Error of the Model:', sse
if __name__ == '__main__':
# read file
fileData = readFile('Linear_regression.xlsx','Input Data')
# list of independent variables
feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
# build dependent & independent data set
X = fileData[feature_cols]
y = fileData['Number of weekly riders']
# Statsmodel - OLS
# lr_statsmodel(X,y)
# ScikitLearn - OLS
lr_scikit(X,y,feature_cols)
My data-set
Y X1 X2 X3 X4
City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
1 1,92,000 $15 18,00,000 $5,800 $50
2 1,90,400 $15 17,90,000 $6,200 $50
3 1,91,200 $15 17,80,000 $6,400 $60
4 1,77,600 $25 17,78,000 $6,500 $60
5 1,76,800 $25 17,50,000 $6,550 $60
6 1,78,400 $25 17,40,000 $6,580 $70
7 1,80,800 $25 17,25,000 $8,200 $75
8 1,75,200 $30 17,25,000 $8,600 $75
9 1,74,400 $30 17,20,000 $8,800 $75
10 1,73,920 $30 17,05,000 $9,200 $80
11 1,72,800 $30 17,10,000 $9,630 $80
12 1,63,200 $40 17,00,000 $10,570 $80
13 1,61,600 $40 16,95,000 $11,330 $85
14 1,61,600 $40 16,95,000 $11,600 $100
15 1,60,800 $40 16,90,000 $11,800 $105
16 1,59,200 $40 16,30,000 $11,830 $105
17 1,48,800 $65 16,40,000 $12,650 $105
18 1,15,696 $102 16,35,000 $13,000 $110
19 1,47,200 $75 16,30,000 $13,224 $125
20 1,50,400 $75 16,20,000 $13,766 $130
21 1,52,000 $75 16,15,000 $14,010 $150
22 1,36,000 $80 16,05,000 $14,468 $155
23 1,26,240 $86 15,90,000 $15,000 $165
24 1,23,888 $98 15,95,000 $15,200 $175
25 1,26,080 $87 15,90,000 $15,600 $175
26 1,51,680 $77 16,00,000 $16,000 $190
27 1,52,800 $63 16,10,000 $16,200 $200
I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.
EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables
Here is reg is output of lin regression fit method of sklearn
to calculate adjusted r2
def adjustedR2(x,y reg):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
and for p values
from sklearn.feature_selection import f_regression
freg=f_regression(x,y)
p=freg[1]
print(p.round(3))

Pandas: Implementing Breusch-Pagan with Panel data

I am currently using the following code to estimate PanelOLS:
Y = df['billsum']
X = df[['years_exp', 'leg_totalbills', 'amtsum', 'amtsumlag.1', 'cfcontrol', 'sen',\
'Republican']]
X = add_constant(X)
from pandas.stats.plm import PanelOLS
reg=PanelOLS(Y,X,time_effects=True)
print('MODEL 1: OLS Regression Results',reg)
MODEL 1: OLS Regression Results
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <const> + <years_exp> + <leg_totalbills> + <amtsum> + <amtsumlag.1>
+ <cfcontrol> + <sen> + <Republican>
Number of Observations: 6930
Number of Degrees of Freedom: 17
R-squared: 0.7081
Adj R-squared: 0.7074
Rmse: 0.2423
F-stat (8, 6913): 1048.1396, p-value: 0.0000
Degrees of Freedom: model 16, resid 6913
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
const 0.0000 nan nan nan nan nan
years_exp 0.0205 0.0005 43.71 0.0000 0.0196 0.0214
leg_totalbills 0.0148 0.0005 32.94 0.0000 0.0139 0.0157
amtsum -0.0003 0.0001 -3.17 0.0015 -0.0005 -0.0001
amtsumlag.1 0.0005 0.0001 5.03 0.0000 0.0003 0.0007
--------------------------------------------------------------------------------
cfcontrol 0.3629 0.0168 21.61 0.0000 0.3299 0.3958
sen 0.0598 0.0177 3.38 0.0007 0.0251 0.0944
Republican 0.6540 0.0114 57.38 0.0000 0.6317 0.6764
---------------------------------End of Summary---------------------------------
I want to do the Breusch-Pagan test for Heteroskedasticity:
statsmodels.stats.diagnostic.het_breushpagan(resid, exog_het)
I know that I am supposed to input the residuals (probably in array format) and the exog_het which in my case would be X. The problem is that I do not know how to get the PanelOLSto ouput the residuals. Actually, I'm not sure if the residuals is actually the Std Err reported in the PanelOLS output. So, the question: Where does the residual show in the regression output and how can I get Pandas to output it independently so that I can input it into the Breusch-Pagan test.

Categories

Resources