For a fixed effect model I was planning to switch from Stata's areg to Python's linearmodels.panel.PanelOLS.
But the results are different. In Stata I get R-squared = 0.6047 and in Python I get R-squared = 0.1454.
How come that I get so different R-squared from the commands below?
Stata command and results:
use ./linearmodels_datasets_wage_panel.dta, clear
areg lwage expersq union married hours, vce(cluster nr) absorb(nr)
Linear regression, absorbing indicators Number of obs = 4,360
Absorbed variable: nr No. of categories = 545
F(4, 544) = 84.67
Prob > F = 0.0000
R-squared = 0.6047
Adj R-squared = 0.5478
Root MSE = 0.3582
(Std. err. adjusted for 545 clusters in nr)
------------------------------------------------------------------------------
| Robust
lwage | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
expersq | .0039509 .0002554 15.47 0.000 .0034492 .0044526
union | .0784442 .0252621 3.11 0.002 .028821 .1280674
married | .1146543 .0234954 4.88 0.000 .0685014 .1608072
hours | -.0000846 .0000238 -3.56 0.000 -.0001313 -.0000379
_cons | 1.565825 .0531868 29.44 0.000 1.461348 1.670302
------------------------------------------------------------------------------
Python command and results:
from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS
data = wage_panel.load()
mod_entity = PanelOLS.from_formula(
"lwage ~ 1 + expersq + union + married + hours + EntityEffects",
data=data.set_index(["nr", "year"]),
)
result_entity = mod_entity.fit(
cov_type='clustered',
cluster_entity=True,
)
print(result_entity)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1454
Estimator: PanelOLS R-squared (Between): -0.0844
No. Observations: 4360 R-squared (Within): 0.1454
Date: Wed, Feb 02 2022 R-squared (Overall): 0.0219
Time: 12:23:24 Log-likelihood -1416.4
Cov. Estimator: Clustered
F-statistic: 162.14
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(4,3811)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 96.915
P-value 0.0000
Time periods: 8 Distribution: F(4,3811)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 1.5658 0.0497 31.497 0.0000 1.4684 1.6633
expersq 0.0040 0.0002 16.550 0.0000 0.0035 0.0044
hours -8.46e-05 2.22e-05 -3.8101 0.0001 -0.0001 -4.107e-05
married 0.1147 0.0220 5.2207 0.0000 0.0716 0.1577
union 0.0784 0.0236 3.3221 0.0009 0.0321 0.1247
==============================================================================
F-test for Poolability: 9.4833
P-value: 0.0000
Distribution: F(544,3811)
Included effects: Entity
man. How are you?
You are trying to run an absorbing regression (.areg). Specifically, you're trying to run 'a linear regression absorbing one categorical factor'. To do this, you can just run the following model linearmodels.iv.absorbing.AbsorbingLS(endog_variable, exog_variables, categorical_variable_absorb)
See the example below:
import pandas as pd
import statsmodels as sm
from linearmodels.iv import absorbing
dta = pd.read_csv('http://www.math.smith.edu/~bbaumer/mth247/labs/airline.csv')
dta.rename(columns={'I': 'airline',
'T': 'year',
'Q': 'output',
'C': 'cost',
'PF': 'fuel',
'LF ': 'load'}, inplace=True)
Next, transform the absorbing variable into a categorical variable (in this case, I will use the airline variable):
cats = pd.DataFrame({'airline': pd.Categorical(dta['airline'])})
Then, just run the model:
exog_variables = ['output', 'fuel', 'load']
endog_variable = ['cost']
exog = sm.tools.tools.add_constant(dta[exog_variables])
endog = dta[endog_variable]
model = absorbing.AbsorbingLS(endog, exog, absorb=cats, drop_absorbed=True)
model_res = model.fit(cov_type='unadjusted', debiased=True)
print(model_res.summary)
Below is the results of this same model in both python and stata (using the command .areg cost output fuel load, absorb(airline))
Python:
Absorbing LS Estimation Summary
==================================================================================
Dep. Variable: cost R-squared: 0.9974
Estimator: Absorbing LS Adj. R-squared: 0.9972
No. Observations: 90 F-statistic: 3827.4
Date: Thu, Oct 27 2022 P-value (F-stat): 0.0000
Time: 20:58:04 Distribution: F(3,81)
Cov. Estimator: unadjusted R-squared (No Effects): 0.9926
Varaibles Absorbed: 5.0000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 9.7135 0.2229 43.585 0.0000 9.2701 10.157
output 0.9193 0.0290 31.691 0.0000 0.8616 0.9770
fuel 0.4175 0.0148 28.303 0.0000 0.3881 0.4468
load -1.0704 0.1957 -5.4685 0.0000 -1.4599 -0.6809
==============================================================================
Stata:
Linear regression, absorbing indicators Number of obs = 90
F( 3, 81) = 3604.80
Prob > F = 0.0000
R-squared = 0.9974
Adj R-squared = 0.9972
Root MSE = .06011
------------------------------------------------------------------------------
cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
output | .9192846 .0298901 30.76 0.000 .8598126 .9787565
fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333
load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963
_cons | 9.713528 .229641 42.30 0.000 9.256614 10.17044
-------------+----------------------------------------------------------------
airline | F(5, 81) = 57.732 0.000 (6 categories)
I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...
I am trying to fit a multinomial logistic regression and then predicting the result from samples.
### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()
[Output]:
MAP Sr delTC Mean_Treecover
302993741 2159.297363 452.975647 2.666672 1.0
217364332 3242.351807 65.615341 8.000000 1.0
390863334 1617.215454 493.124054 5.666666 0.0
446559668 1095.183105 498.373383 -8.000000 0.0
246078364 2804.615234 98.981110 -4.000000 1.0
1000000 rows × 7 columns
#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
print(model.summary2())
[Output]:
Results: MNLogit
====================================================================
Model: MNLogit Pseudo R-squared: 0.364
Dependent Variable: Mean_Treecover AIC: 831092.4595
Date: 2021-04-02 13:51 BIC: 831139.7215
No. Observations: 1000000 Log-Likelihood: -4.1554e+05
Df Model: 3 LL-Null: -6.5347e+05
Df Residuals: 999996 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
--------------------------------------------------------------------
Mean_Treecover = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept -5.2200 0.0119 -438.4468 0.0000 -5.2434 -5.1967
MAP 0.0023 0.0000 491.0859 0.0000 0.0023 0.0023
Sr 0.0016 0.0000 90.6805 0.0000 0.0015 0.0016
delTC -0.0093 0.0002 -39.9022 0.0000 -0.0098 -0.0089
However, wherever I try to predict the using the model.predict() function, I get the following error.
prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627
Does anyone know how to troubleshoot this? Is there something that I might be doing wrong?
The model adds an intercept so you need to include that, using an example data:
from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])
RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
You can see the dimensions of your fitted data:
model.model.exog[:5,]
Out[16]:
array([[1. , 0.33914763, 0.79358056, 0.3103758 ],
[1. , 0.45915785, 0.94991271, 0.27203524],
[1. , 0.55527662, 0.15122108, 0.80675951],
[1. , 0.18493681, 0.89854583, 0.66760684],
[1. , 0.38300074, 0.6945397 , 0.28128137]])
Which is the same as if you add a constant:
import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])
const MAP Sr delTC
0 1.0 0.339148 0.793581 0.310376
1 1.0 0.459158 0.949913 0.272035
2 1.0 0.555277 0.151221 0.806760
3 1.0 0.184937 0.898546 0.667607
If you have a data.frame with the same column names, it will just be:
prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])
Or if you just need the fitted values, do:
model.fittedvalues
I have percentages and need to calculate a regression. According to basic statistics using logistic regression is better than OLS as percentages invalidate the requirement of a continuous and unconstraint value space.
So far, so good.
However, I get different results in R, Python, and Matlab. In fact, Matlab even reports significant values where python will not.
My models look like:
R:
summary(glm(foo ~ 1 + bar + baz , family = "binomial", data = <<data>>))
Python via statsmodels:
smf.logit('foo ~ 1 + bar + baz', <<data>>).fit().summary()
Matlab:
fitglm(<<data>>,'foo ~ 1 + bar + baz','Link','logit')
where Matlab currently produces the best results.
Could there be different initialization values? Different solvers? Different settings for alphas when computing p-values?
How can I get the same results at least in similar numeric ranges or same features detected as significant? I do not require exact equal numeric output.
edit
the summary statistics
python:
Dep. Variable: foo No. Observations: 104
Model: Logit Df Residuals: 98
Method: MLE Df Model: 5
Date: Wed, 28 Aug 2019 Pseudo R-squ.: inf
Time: 06:48:12 Log-Likelihood: -0.25057
converged: True LL-Null: 0.0000
LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
Intercept -16.9863 154.602 -0.110 0.913 -320.001 286.028
bar -0.0278 0.945 -0.029 0.977 -1.880 1.824
baz 18.5550 280.722 0.066 0.947 -531.650 568.760
a 9.9996 153.668 0.065 0.948 -291.184 311.183
b 0.6757 132.542 0.005 0.996 -259.102 260.454
d 0.0005 0.039 0.011 0.991 -0.076 0.077
R:
glm(formula = myformula, family = "binomial", data = r_x)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.046466 -0.013282 -0.001017 0.006217 0.104467
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.699e+01 1.546e+02 -0.110 0.913
bar -2.777e-02 9.449e-01 -0.029 0.977
baz 1.855e+01 2.807e+02 0.066 0.947
a 1.000e+01 1.537e+02 0.065 0.948
b 6.757e-01 1.325e+02 0.005 0.996
d 4.507e-04 3.921e-02 0.011 0.991
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.049633 on 103 degrees of freedom
Residual deviance: 0.035684 on 98 degrees of freedom
AIC: 12.486
Matlab:
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ ________ __________
(Intercept) -21.044 3.315 -6.3483 6.8027e-09
bar -0.033507 0.022165 -1.5117 0.13383
d 0.0016149 0.00083173 1.9416 0.055053
baz 21.427 6.0132 3.5632 0.00056774
a 14.875 3.7828 3.9322 0.00015712
b -1.2126 2.7535 -0.44038 0.66063
104 observations, 98 error degrees of freedom
Estimated Dispersion: 1.25e-06
F-statistic vs. constant model: 7.4, p-value = 6.37e-06
You are not actually using the binomial distribution in the MATLAB case. You are specifying the link function, but the distribution remains its default value for a normal distribution, which will not give you the expected logistic fit, at least if the sample sizes for the percentages are small. It is also giving you lower p-values, because the normal distribution is less constrained in its variance than the binomial distribution is.
You need to specify the Distribution argument to Binomial:
fitglm(<<data>>, 'foo ~ 1 + bar + baz', 'Distribution', 'binomial ', 'Link', 'logit')
The R and Python code seem to match rather well.
I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.
I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method
import numpy as np
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target
#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)
xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)
# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()
Output as follows:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
I've also found the following articles
Find p-value (significance) in scikit-learn LinearRegression
http://connor-johnson.com/2014/02/18/linear-regression-with-python/
Both the codes in the SO link doesn't compile
Here is my code & data that I'm working on - but not being able to find the std error & p-values
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def readFile(filename, sheetname):
xlsx = pd.ExcelFile(filename)
data = xlsx.parse(sheetname, skiprows=1)
return data
def lr_statsmodel(X,y):
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print (results.summary())
def lr_scikit(X,y,featureCols):
model = LinearRegression()
results = model.fit(X,y)
predictions = results.predict(X)
print 'Coefficients'
print 'Intercept\t' , results.intercept_
df = pd.DataFrame(zip(featureCols, results.coef_))
print df.to_string(index=False, header=False)
# Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
rSquare = results.score(X,y)
print '\nR-Square::', rSquare
# This looks like a better option
# source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
r2 = metrics.r2_score(y,results.predict(X))
print 'r2', r2
# Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
# INFO:: All three of them are providing the same figures!
# Adj-Rsquare formula # https://www.easycalculation.com/statistics/learn-adjustedr2.php
# In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
N = X.shape[0]
p = X.shape[1]
adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
print "Adjusted R-Square::", adjRsquare
# calculate standard errors
# mean_absolute_error
# mean_squared_error
# median_absolute_error
# r2_score
# explained_variance_score
mse = metrics.mean_squared_error(y,results.predict(X))
print mse
print 'Residual Standard Error:', np.sqrt(mse)
# OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
n = X.shape[0]
X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
se_matrix = scipy.linalg.sqrtm(
metrics.mean_squared_error(y, results.predict(X)) *
np.linalg.inv(X1.T * X1)
)
print 'se',np.diagonal(se_matrix)
# https://github.com/nsh87/regressors/blob/master/regressors/stats.py
# http://regressors.readthedocs.io/en/latest/usage.html
y_hat = results.predict(X)
sse = np.sum((y_hat - y) ** 2)
print 'Standard Square Error of the Model:', sse
if __name__ == '__main__':
# read file
fileData = readFile('Linear_regression.xlsx','Input Data')
# list of independent variables
feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
# build dependent & independent data set
X = fileData[feature_cols]
y = fileData['Number of weekly riders']
# Statsmodel - OLS
# lr_statsmodel(X,y)
# ScikitLearn - OLS
lr_scikit(X,y,feature_cols)
My data-set
Y X1 X2 X3 X4
City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
1 1,92,000 $15 18,00,000 $5,800 $50
2 1,90,400 $15 17,90,000 $6,200 $50
3 1,91,200 $15 17,80,000 $6,400 $60
4 1,77,600 $25 17,78,000 $6,500 $60
5 1,76,800 $25 17,50,000 $6,550 $60
6 1,78,400 $25 17,40,000 $6,580 $70
7 1,80,800 $25 17,25,000 $8,200 $75
8 1,75,200 $30 17,25,000 $8,600 $75
9 1,74,400 $30 17,20,000 $8,800 $75
10 1,73,920 $30 17,05,000 $9,200 $80
11 1,72,800 $30 17,10,000 $9,630 $80
12 1,63,200 $40 17,00,000 $10,570 $80
13 1,61,600 $40 16,95,000 $11,330 $85
14 1,61,600 $40 16,95,000 $11,600 $100
15 1,60,800 $40 16,90,000 $11,800 $105
16 1,59,200 $40 16,30,000 $11,830 $105
17 1,48,800 $65 16,40,000 $12,650 $105
18 1,15,696 $102 16,35,000 $13,000 $110
19 1,47,200 $75 16,30,000 $13,224 $125
20 1,50,400 $75 16,20,000 $13,766 $130
21 1,52,000 $75 16,15,000 $14,010 $150
22 1,36,000 $80 16,05,000 $14,468 $155
23 1,26,240 $86 15,90,000 $15,000 $165
24 1,23,888 $98 15,95,000 $15,200 $175
25 1,26,080 $87 15,90,000 $15,600 $175
26 1,51,680 $77 16,00,000 $16,000 $190
27 1,52,800 $63 16,10,000 $16,200 $200
I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.
EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables
Here is reg is output of lin regression fit method of sklearn
to calculate adjusted r2
def adjustedR2(x,y reg):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
and for p values
from sklearn.feature_selection import f_regression
freg=f_regression(x,y)
p=freg[1]
print(p.round(3))