Python statsmodel.tsa.MarkovAutoregression using current real GNP/GDP data - python

I have been using statsmodel.tsa.MarkovAutoregressio to replicate Hamilton's markov switching model published in 1989. If using the Hamilton data (real GNP in 1982 dollar) I could have the same result as the code example / the paper showed. However, when I used current available real GNP or GDP data (in 2009 dollar) and took their log difference (quarterly) as input, the model doesn't give satisfactory results.
I plotted the log difference of Hamilton gnp and that's from the current available real GNP. They are quite close with slight differences.
Can anyone enlighten me why it is the case? Does it have anything to do with the seasonality adjustment of current GNP data? If so, is there is any way to counter it?
Result using current available GNP
Result using paper provided GNP

You write:
the model doesn't give satisfactory results.
But what you mean is that the model isn't giving you the results you expect / want. i.e., you want the model to pick out periods the NBER has labeled as "Recessions", but the Markov switching model is simply finding the parameters which maximize the likelihood function for the data.
(The rest of the post shows results that are taken from this Jupyter notebook: http://nbviewer.jupyter.org/gist/ChadFulton/a5d24d32ba3b7b2e381e43a232342f1f)
(I'll also note that I double-checked these results using E-views, and it agrees with Statsmodels' output almost exactly).
The raw dataset is the growth rate (log difference * 100) of real GNP; the Hamilton dataset versus one found on the Federal Reserve Economic Database are shown here, with grey bars indicating NBER-dated recessions:
In this case, the model is an AR(4) on the growth rate of real GNP, with a regime-specific intercept; the model allows two regimes. The idea is that "recessions" should correspond to a low (or negative) average growth rate and expansions should correspond to a higher average growth rate.
Model 1: Hamilton's dataset: Maximum likelihood estimation of parameters
From the model applied to Hamilton's (1989) dataset, we get the following estimated parameters:
Markov Switching Model Results
================================================================================
Dep. Variable: Hamilton No. Observations: 131
Model: MarkovAutoregression Log Likelihood -181.263
Date: Sun, 02 Apr 2017 AIC 380.527
Time: 19:52:31 BIC 406.404
Sample: 04-01-1951 HQIC 391.042
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.265 -1.356 0.175 -0.877 0.160
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.075 15.614 0.000 1.017 1.310
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.103 5.761 0.000 0.390 0.793
ar.L1 0.0135 0.120 0.112 0.911 -0.222 0.249
ar.L2 -0.0575 0.138 -0.418 0.676 -0.327 0.212
ar.L3 -0.2470 0.107 -2.310 0.021 -0.457 -0.037
ar.L4 -0.2129 0.111 -1.926 0.054 -0.430 0.004
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.097 7.819 0.000 0.565 0.944
p[1->0] 0.0959 0.038 2.542 0.011 0.022 0.170
==============================================================================
and the time series of the probability of operating in regime 0 (which here corresponds to a negative growth rate, i.e. a recession) looks like:
Model 2: Updated dataset: Maximum likelihood estimation of parameters
Now, as you saw, we can instead fit the model using the "updated" dataset (which looks pretty much like the original dataset), to get the following parameters and regime probabilities:
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -188.002
Date: Sun, 02 Apr 2017 AIC 394.005
Time: 20:00:58 BIC 419.882
Sample: 04-01-1951 HQIC 404.520
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.2475 3.470 -0.359 0.719 -8.049 5.554
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 0.9364 0.453 2.066 0.039 0.048 1.825
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.8509 0.561 1.516 0.130 -0.249 1.951
ar.L1 0.3437 0.189 1.821 0.069 -0.026 0.714
ar.L2 0.0919 0.143 0.645 0.519 -0.187 0.371
ar.L3 -0.0846 0.251 -0.337 0.736 -0.577 0.408
ar.L4 -0.1727 0.258 -0.669 0.503 -0.678 0.333
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.0002 1.705 0.000 1.000 -3.341 3.341
p[1->0] 0.0397 0.186 0.213 0.831 -0.326 0.405
==============================================================================
To understand what the model is doing, look at the intercepts in the two regimes. In Hamilton's model, the "low" regime has an intercept of -0.35, whereas with the updated data, the "low" regime has an intercept of -1.25.
What that tells us is that with the updated dataset, the model is doing a "better job" fitting the data (in terms of a higher likelihood) by choosing the "low" regime to be much deeper recessions. In particular, looking back at the GNP data series, it's apparent that it's using the "low" regime to fit the very low growth in the late 1950's and early 1980's.
In contrast, the fitted parameters from Hamilton's model allow the "low" regime to fit "moderately low" growth rates that cover a wider range of recessions.
We can't compare these two models' outcomes using e.g. the log-likelihood values because they're using different datasets. One thing we could try, though is to use the fitted parameters from Hamilton's dataset on the updated GNP data. Doing that, we get the following result:
Model 3: Updated dataset using parameters estimated on Hamilton's dataset
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -191.807
Date: Sun, 02 Apr 2017 AIC 401.614
Time: 19:52:52 BIC 427.491
Sample: 04-01-1951 HQIC 412.129
- 10-01-1984
Covariance Type: opg
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.185 -1.939 0.053 -0.722 0.004
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.083 13.967 0.000 1.000 1.327
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.090 6.604 0.000 0.416 0.767
ar.L1 0.0135 0.100 0.134 0.893 -0.183 0.210
ar.L2 -0.0575 0.088 -0.651 0.515 -0.231 0.116
ar.L3 -0.2470 0.104 -2.384 0.017 -0.450 -0.044
ar.L4 -0.2129 0.084 -2.524 0.012 -0.378 -0.048
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.100 7.563 0.000 0.559 0.950
p[1->0] 0.0959 0.051 1.872 0.061 -0.005 0.196
==============================================================================
This looks more like what you expected / wanted, and that's because as I mentioned above, the "low" regime intercept of 0.35 makes the "low" regime a good fit for more time periods in the sample. But notice that the log-likelihood here is -191.8, whereas in Model 2 the log-likelihood was -188.0.
Thus even though this model looks more like what you wanted, it does not fit the data as well.
(Note again that you can't compare these log-likelihoods to the -181.3 from Model 1, because that is using a different dataset).

Related

Singular Matrix error for the statistical analysis of logistic regression

I am working on the breast cancer dataset which was downloaded from here. I encoded all categorical variables using label encoder. Then I went through Logistic Regression Part III StatsModel and tried to see my model performance with logistic regression. Although I labelencoded the dataset there were still few columns like Age, Tumor Size, Regional Node Examined, Regional Node Positive, Survival month that were numerical columns which do not need any encoding. So, I left them as it is. Therefore, I need to normalize the dataset otherwise those numerical big values create bias on the model.
So, I tried to modify the video tutorial, which I describe above, in my way,
df = pd.read_csv("D:\Breast_Cancer_labelencoder.csv")
y = df['Status']
X = df.drop(['Status'], axis=1)
X = sm.add_constant(X)
X_train1, X_test1, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state = 1)
#Normalize using StandardScalar
scaler = StandardScaler()
X_train_scale3 = scaler.fit_transform(X_train1)
X_test_scale3 = scaler.transform(X_test1)
Logit_fun = sm.Logit(y_train, X_train_scale3)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the error Singular Matrix
If I delete the normalization portion
Logit_fun = sm.Logit(y_train, X_train1)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the result
Optimization terminated successfully.
Current function value: 0.281394
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Status No. Observations: 2816
Model: Logit Df Residuals: 2800
Method: MLE Df Model: 15
Date: Sun, 30 Oct 2022 Pseudo R-squ.: 0.3425
Time: 23:48:34 Log-Likelihood: -792.40
converged: True LL-Null: -1205.2
Covariance Type: nonrobust LLR p-value: 3.005e-166
==========================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------
const 0.0322 0.713 0.045 0.964 -1.366 1.430
Age 0.0273 0.008 3.585 0.000 0.012 0.042
Race -0.1095 0.110 -0.997 0.319 -0.325 0.106
Marital Status 0.0550 0.060 0.918 0.359 -0.062 0.172
T Stage 0.4264 0.170 2.505 0.012 0.093 0.760
N Stage 0.7746 0.265 2.921 0.003 0.255 1.294
6th Stage -0.1708 0.169 -1.009 0.313 -0.503 0.161
differentiate -0.0092 0.075 -0.124 0.902 -0.156 0.137
Grade 0.3412 0.109 3.125 0.002 0.127 0.555
A Stage 0.3260 0.375 0.868 0.385 -0.410 1.062
Tumor Size 0.0007 0.005 0.146 0.884 -0.009 0.010
Estrogen Status -0.6247 0.264 -2.368 0.018 -1.142 -0.108
Progesterone Status -0.5077 0.183 -2.771 0.006 -0.867 -0.149
Regional Node Examined -0.0260 0.009 -2.811 0.005 -0.044 -0.008
Reginol Node Positive 0.0565 0.021 2.693 0.007 0.015 0.098
Survival Months -0.0595 0.003 -18.675 0.000 -0.066 -0.053
==========================================================================================
My 1st question: Why I got Singular matrix error, where normalization is an essential step?
2nd question: X = sm.add_constant(X) - What purpose does this line serve?
3rd question: If I just delete the column whose coef values are negative will it help to improve the result?
I already watched P-Value Method For Hypothesis Testing and Logistic Regression Details Pt 3: R-squared and p-value videos still unable to analysis the result.

Is the categorical variable relevant if one of the dummy variable has a t score of 0.95?

If a variable has more than 0.05 t score, it is deemed not relevant and should be excluded from the model. However, what if the categorical variable has 4 dummy variable and only one of them exceeds 0.05? Do i exclude the entire categorical variable?
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.803
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 368.4
Date: Mon, 15 Jul 2019 Prob (F-statistic): 0.00
Time: 12:00:26 Log-Likelihood: -17357.
No. Observations: 1460 AIC: 3.475e+04
Df Residuals: 1443 BIC: 3.484e+04
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -1.366e+05 9432.229 -14.482 0.000 -1.55e+05 -1.18e+05
OverallQual 1.327e+04 1249.192 10.622 0.000 1.08e+04 1.57e+04
ExterQual 1.168e+04 2763.188 4.228 0.000 6262.969 1.71e+04
TotalBsmtSF 13.7198 5.182 2.648 0.008 3.554 23.885
GrLivArea 45.4098 2.521 18.012 0.000 40.465 50.355
1stFlrSF 9.4573 5.543 1.706 0.088 -1.416 20.330
GarageArea 22.4791 9.748 2.306 0.021 3.358 41.600
KitchenQual 1.309e+04 2142.662 6.111 0.000 8891.243 1.73e+04
GarageCars 8875.8202 2961.291 2.997 0.003 3066.923 1.47e+04
BsmtQual 1.097e+04 2094.395 5.235 0.000 6856.671 1.51e+04
GarageFinish_No 2689.1356 5847.186 0.460 0.646 -8780.759 1.42e+04
GarageFinish_RFn -8223.4503 2639.360 -3.116 0.002 -1.34e+04 -3046.057
GarageFinish_Unf -8416.9443 2928.002 -2.875 0.004 -1.42e+04 -2673.349
BsmtExposure_Gd 2.298e+04 3970.691 5.788 0.000 1.52e+04 3.08e+04
BsmtExposure_Mn -262.8498 4160.294 -0.063 0.950 -8423.721 7898.021
BsmtExposure_No -7690.0994 2800.731 -2.746 0.006 -1.32e+04 -2196.159
BsmtExposure_No Basement 2.598e+04 9879.662 2.630 0.009 6598.642 4.54e+04
==============================================================================
Omnibus: 614.604 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76480.899
Skew: -0.928 Prob(JB): 0.00
Kurtosis: 38.409 Cond. No. 2.85e+04
==============================================================================
when you say "0.05 t score" I assume you mean "0.05 p value". the t-value is just coef / stderr, which goes into the p-value calculation (abs(t_value) > 2 is approximately p-value < 0.05)
when you say "categorical variable has 4 dummy variable", I presume you mean it has 4 "levels" / distinct values and you're referring to BsmtExposure_Mn. I'd leave that in as the other categories/levels are helping the model. if you had several categories that were less predictive you could think about combining them into one "other" category
as a general point, you shouldn't just automatically exclude variables because their p-value is > 0.05 (or whatever your cutoff/"alpha value" is). they can be useful for understanding what's going on within the model, and explaining results to other people

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Robustness issue of statsmodel Linear regression (ols) - Python

I was testing some basic category regression using Stats model:
I build up a deterministic model
Y = X + Z
where X can takes 3 values (a, b or c) and Z only 2 (d or e).
At that stage the model is purely deterministic, I setup the weights for each variable as followed
a's weight=1
b's weight=2
c's weight=3
d's weight=1
e's weight=2
Therefore with 1(X=a) being 1 if X=a, 0 otherwise, the model is simply:
Y = 1(X=a) + 2*1(X=b) + 3*1(X=c) + 1(Z=d) + 2*1(Z=e)
Using the following code, to generate the different variables and run the regression
from statsmodels.formula.api import ols
nbData = 1000
rand1 = np.random.uniform(size=nbData)
rand2 = np.random.uniform(size=nbData)
a = 1 * (rand1 <= (1.0/3.0))
b = 1 * (((1.0/3.0)< rand1) & (rand1< (4/5.0)))
c = 1-b-a
d = 1 * (rand2 <= (3.0/5.0))
e = 1-d
weigths = [1,2,3,1,2]
y = a+2*b+3*c+4*d+5*e
df = pd.DataFrame({'y':y, 'a':a, 'b':b, 'c':c, 'd':d, 'e':e})
mod = ols(formula='y ~ a + b + c + d + e - 1', data=df)
res = mod.fit()
print(res.summary())
I end up with the rights results (one has to look at the difference between coef rather than the coef themselfs)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.006e+30
Date: Wed, 16 Sep 2015 Prob (F-statistic): 0.00
Time: 03:05:40 Log-Likelihood: 3156.8
No. Observations: 100 AIC: -6306.
Df Residuals: 96 BIC: -6295.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 1.6000 7.47e-16 2.14e+15 0.000 1.600 1.600
b 2.6000 6.11e-16 4.25e+15 0.000 2.600 2.600
c 3.6000 9.61e-16 3.74e+15 0.000 3.600 3.600
d 3.4000 5.21e-16 6.52e+15 0.000 3.400 3.400
e 4.4000 6.85e-16 6.42e+15 0.000 4.400 4.400
==============================================================================
Omnibus: 11.299 Durbin-Watson: 0.833
Prob(Omnibus): 0.004 Jarque-Bera (JB): 5.720
Skew: -0.381 Prob(JB): 0.0573
Kurtosis: 2.110 Cond. No. 2.46e+15
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.67e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
But when I increase the number of data point to (say) 600, the regression is producing really bad results. I have tried similar regression in Excel and in R and they are producing consistent results whatever the number of data points. Does anyone know if there is some restriction on statsmodel ols explaining such behaviour or am I missing something?
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.167
Model: OLS Adj. R-squared: 0.161
Method: Least Squares F-statistic: 29.83
Date: Wed, 16 Sep 2015 Prob (F-statistic): 1.23e-22
Time: 03:08:04 Log-Likelihood: -701.02
No. Observations: 600 AIC: 1412.
Df Residuals: 595 BIC: 1434.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 5.8070 1.15e+13 5.05e-13 1.000 -2.26e+13 2.26e+13
b 6.4951 1.15e+13 5.65e-13 1.000 -2.26e+13 2.26e+13
c 6.9033 1.15e+13 6.01e-13 1.000 -2.26e+13 2.26e+13
d -1.1927 1.15e+13 -1.04e-13 1.000 -2.26e+13 2.26e+13
e -0.1685 1.15e+13 -1.47e-14 1.000 -2.26e+13 2.26e+13
==============================================================================
Omnibus: 67.153 Durbin-Watson: 0.328
Prob(Omnibus): 0.000 Jarque-Bera (JB): 70.964
Skew: 0.791 Prob(JB): 3.89e-16
Kurtosis: 2.419 Cond. No. 7.70e+14
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.25e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
It appears that as mentionned by Mr. F, the main problem is that the statsmodel OLS does not seem to handle the collinearity pb as well as Excel/R in that case, but if instead of defining one variable for each a, b, c, d and e, one define a variable X and one Z which can be equal to a, b or c and d or e resp, then the regression works fine. Ie updating the code with :
df['X'] = ['c']*len(df)
df.X[df.b!=0] = 'b'
df.X[df.a!=0] = 'a'
df['Z'] = ['e']*len(df)
df.Z[df.d!=0] = 'd'
mod = ols(formula='y ~ X + Z - 1', data=df)
leads to the expected results
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.684e+27
Date: Thu, 17 Sep 2015 Prob (F-statistic): 0.00
Time: 06:22:43 Log-Likelihood: 2.5096e+06
No. Observations: 100000 AIC: -5.019e+06
Df Residuals: 99996 BIC: -5.019e+06
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
X[a] 5.0000 1.85e-14 2.7e+14 0.000 5.000 5.000
X[b] 6.0000 1.62e-14 3.71e+14 0.000 6.000 6.000
X[c] 7.0000 2.31e-14 3.04e+14 0.000 7.000 7.000
Z[T.e] 1.0000 1.97e-14 5.08e+13 0.000 1.000 1.000
==============================================================================
Omnibus: 145.367 Durbin-Watson: 1.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9729.487
Skew: -0.094 Prob(JB): 0.00
Kurtosis: 1.483 Cond. No. 2.29
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

How to get R-squared for robust regression (RLM) in Statsmodels?

When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models.
But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results.
Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?
Or is there another measure that can be used / calculated from the results produced by sm.RLS?
This is what Statsmodels is producing:
import numpy as np
import statsmodels.api as sm
# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())
Which outputs:
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mo, 27 Jul 2015
Time: 10:00:00
No. Iterations: 17
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.0254 0.091 55.017 0.000 4.846 5.204
x1 0.4845 0.008 61.555 0.000 0.469 0.500
==============================================================================
Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.
R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.
Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.726
Model: OLS Adj. R-squared: 0.721
Method: Least Squares F-statistic: 127.4
Date: Wed, 03 Nov 2021 Prob (F-statistic): 4.15e-15
Time: 09:33:40 Log-Likelihood: -87.455
No. Observations: 50 AIC: 178.9
Df Residuals: 48 BIC: 182.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.7071 0.396 14.425 0.000 4.912 6.503
x1 0.3848 0.034 11.288 0.000 0.316 0.453
==============================================================================
Omnibus: 23.499 Durbin-Watson: 2.752
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.906
Skew: -1.649 Prob(JB): 4.34e-08
Kurtosis: 5.324 Cond. No. 23.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 03 Nov 2021
Time: 09:34:24
No. Iterations: 17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1857 0.111 46.590 0.000 4.968 5.404
x1 0.4790 0.010 49.947 0.000 0.460 0.498
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.
Why not use model.predict to obtain the r2? For Example:
r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)

Categories

Resources