How to get R-squared for robust regression (RLM) in Statsmodels?

How to get R-squared for robust regression (RLM) in Statsmodels? - python

When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models.
But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results.
Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?
Or is there another measure that can be used / calculated from the results produced by sm.RLS?
This is what Statsmodels is producing:
import numpy as np
import statsmodels.api as sm
# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())
Which outputs:
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mo, 27 Jul 2015
Time: 10:00:00
No. Iterations: 17
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.0254 0.091 55.017 0.000 4.846 5.204
x1 0.4845 0.008 61.555 0.000 0.469 0.500
==============================================================================

Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.

R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.
Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.726
Model: OLS Adj. R-squared: 0.721
Method: Least Squares F-statistic: 127.4
Date: Wed, 03 Nov 2021 Prob (F-statistic): 4.15e-15
Time: 09:33:40 Log-Likelihood: -87.455
No. Observations: 50 AIC: 178.9
Df Residuals: 48 BIC: 182.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.7071 0.396 14.425 0.000 4.912 6.503
x1 0.3848 0.034 11.288 0.000 0.316 0.453
==============================================================================
Omnibus: 23.499 Durbin-Watson: 2.752
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.906
Skew: -1.649 Prob(JB): 4.34e-08
Kurtosis: 5.324 Cond. No. 23.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 03 Nov 2021
Time: 09:34:24
No. Iterations: 17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1857 0.111 46.590 0.000 4.968 5.404
x1 0.4790 0.010 49.947 0.000 0.460 0.498
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.

Why not use model.predict to obtain the r2? For Example:
r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)

Related

Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?
screenshot of repeated rows

It's likely because of the data type of the horsepower column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.
Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
{
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
}
)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
)
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
print(results.summary())
Output (dummy coding):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
etc.
Now, converting the strings back to integers:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
print(results.summary())
Output (as expected):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
==============================================================================
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using GLM to reproduce built-in regression models in statsmodels

I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...

Dataframe count in Statsmodels OLS formula

I'm trying to do a linear regression to predict the count of a dataframe based on the count of another dataframe. I am using statsmodels. I have tried the following:
X = df1.count()
Y = df2.count()
from statsmodels.formula.api import ols
fit = ols(Y ~ X, data=kak).fit()
fit.summary()
Using the X and Y variables in the OLS formula is not allowed and I have no idea what to fill in at the data= keyword argument. How would I go about doing this?

Let's assume you have to 1D arrays X and Y:
import numpy as np
X = np.arange(100)
Y = 2*X + 5
Then you can run a linear regression using the line below. There are two important things:
The first argument to ols is a str containing a formula. "Y ~ X" and not Y ~ X (note the double quotes).
The second argument data can be any Python object as long as it has keys data["X"] and data["Y"]. I wrote a dict here, but it would also work with a DataFrame. It basically allows statsmodels to understand who are X and Y in the formula you gave to it.
ols("Y ~ X", {"X": X, "Y": Y}).fit().summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.916e+33
Date: Fri, 21 May 2021 Prob (F-statistic): 0.00
Time: 14:01:48 Log-Likelihood: 3055.1
No. Observations: 100 AIC: -6106.
Df Residuals: 98 BIC: -6101.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0000 2.62e-15 1.91e+15 0.000 5.000 5.000
X 2.0000 4.57e-17 4.38e+16 0.000 2.000 2.000
==============================================================================
Omnibus: 220.067 Durbin-Watson: 0.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.816
Skew: 0.159 Prob(JB): 0.00739
Kurtosis: 1.498 Cond. No. 114.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Logistic Regression different results with R and Python?

I used a logistic regression approach in both programs, and was wondering why I am getting different results, especially with the coefficients. The outcome, Infection, is (1, 0) and Flushed is a continuous variable.
Python:
import statsmodels.api as sm
logit_model=sm.Logit(data['INFECTION'], data['Flushed'])
result=logit_model.fit()
print(result.summary())
Results:
Logit Regression Results
==============================================================================
Dep. Variable: INFECTION No. Observations: 414
Model: Logit Df Residuals: 413
Method: MLE Df Model: 0
Date: Fri, 24 Aug 2018 Pseudo R-squ.: -1.388
Time: 15:47:42 Log-Likelihood: -184.09
converged: True LL-Null: -77.104
LLR p-value: nan
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Flushed -0.6467 0.070 -9.271 0.000 -0.783 -0.510
==============================================================================
R:
mylogit <- glm(INFECTION ~ Flushed, data = cvc, family = "binomial")
summary(mylogit)
Results:
Call:
glm(formula = INFECTION ~ Flushed, family = "binomial", data = cvc)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0598 -0.3107 -0.2487 -0.2224 2.8051
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.91441 0.38639 -10.131 < 2e-16 ***
Flushed 0.22696 0.06049 3.752 0.000175 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You seem to be missing the constant (offset) parameter in the Python logistic model.
To use R's formula syntax you're fitting two different models:
Python model: INFECTION ~ 0 + Flushed
R model : INFECTION ~ Flushed
To add a constant to the Python model use sm.add_constant(...).

Python statsmodel.tsa.MarkovAutoregression using current real GNP/GDP data

I have been using statsmodel.tsa.MarkovAutoregressio to replicate Hamilton's markov switching model published in 1989. If using the Hamilton data (real GNP in 1982 dollar) I could have the same result as the code example / the paper showed. However, when I used current available real GNP or GDP data (in 2009 dollar) and took their log difference (quarterly) as input, the model doesn't give satisfactory results.
I plotted the log difference of Hamilton gnp and that's from the current available real GNP. They are quite close with slight differences.
Can anyone enlighten me why it is the case? Does it have anything to do with the seasonality adjustment of current GNP data? If so, is there is any way to counter it?
Result using current available GNP
Result using paper provided GNP

You write:
the model doesn't give satisfactory results.
But what you mean is that the model isn't giving you the results you expect / want. i.e., you want the model to pick out periods the NBER has labeled as "Recessions", but the Markov switching model is simply finding the parameters which maximize the likelihood function for the data.
(The rest of the post shows results that are taken from this Jupyter notebook: http://nbviewer.jupyter.org/gist/ChadFulton/a5d24d32ba3b7b2e381e43a232342f1f)
(I'll also note that I double-checked these results using E-views, and it agrees with Statsmodels' output almost exactly).
The raw dataset is the growth rate (log difference * 100) of real GNP; the Hamilton dataset versus one found on the Federal Reserve Economic Database are shown here, with grey bars indicating NBER-dated recessions:
In this case, the model is an AR(4) on the growth rate of real GNP, with a regime-specific intercept; the model allows two regimes. The idea is that "recessions" should correspond to a low (or negative) average growth rate and expansions should correspond to a higher average growth rate.
Model 1: Hamilton's dataset: Maximum likelihood estimation of parameters
From the model applied to Hamilton's (1989) dataset, we get the following estimated parameters:
Markov Switching Model Results
================================================================================
Dep. Variable: Hamilton No. Observations: 131
Model: MarkovAutoregression Log Likelihood -181.263
Date: Sun, 02 Apr 2017 AIC 380.527
Time: 19:52:31 BIC 406.404
Sample: 04-01-1951 HQIC 391.042
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.265 -1.356 0.175 -0.877 0.160
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.075 15.614 0.000 1.017 1.310
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.103 5.761 0.000 0.390 0.793
ar.L1 0.0135 0.120 0.112 0.911 -0.222 0.249
ar.L2 -0.0575 0.138 -0.418 0.676 -0.327 0.212
ar.L3 -0.2470 0.107 -2.310 0.021 -0.457 -0.037
ar.L4 -0.2129 0.111 -1.926 0.054 -0.430 0.004
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.097 7.819 0.000 0.565 0.944
p[1->0] 0.0959 0.038 2.542 0.011 0.022 0.170
==============================================================================
and the time series of the probability of operating in regime 0 (which here corresponds to a negative growth rate, i.e. a recession) looks like:
Model 2: Updated dataset: Maximum likelihood estimation of parameters
Now, as you saw, we can instead fit the model using the "updated" dataset (which looks pretty much like the original dataset), to get the following parameters and regime probabilities:
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -188.002
Date: Sun, 02 Apr 2017 AIC 394.005
Time: 20:00:58 BIC 419.882
Sample: 04-01-1951 HQIC 404.520
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.2475 3.470 -0.359 0.719 -8.049 5.554
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 0.9364 0.453 2.066 0.039 0.048 1.825
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.8509 0.561 1.516 0.130 -0.249 1.951
ar.L1 0.3437 0.189 1.821 0.069 -0.026 0.714
ar.L2 0.0919 0.143 0.645 0.519 -0.187 0.371
ar.L3 -0.0846 0.251 -0.337 0.736 -0.577 0.408
ar.L4 -0.1727 0.258 -0.669 0.503 -0.678 0.333
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.0002 1.705 0.000 1.000 -3.341 3.341
p[1->0] 0.0397 0.186 0.213 0.831 -0.326 0.405
==============================================================================
To understand what the model is doing, look at the intercepts in the two regimes. In Hamilton's model, the "low" regime has an intercept of -0.35, whereas with the updated data, the "low" regime has an intercept of -1.25.
What that tells us is that with the updated dataset, the model is doing a "better job" fitting the data (in terms of a higher likelihood) by choosing the "low" regime to be much deeper recessions. In particular, looking back at the GNP data series, it's apparent that it's using the "low" regime to fit the very low growth in the late 1950's and early 1980's.
In contrast, the fitted parameters from Hamilton's model allow the "low" regime to fit "moderately low" growth rates that cover a wider range of recessions.
We can't compare these two models' outcomes using e.g. the log-likelihood values because they're using different datasets. One thing we could try, though is to use the fitted parameters from Hamilton's dataset on the updated GNP data. Doing that, we get the following result:
Model 3: Updated dataset using parameters estimated on Hamilton's dataset
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -191.807
Date: Sun, 02 Apr 2017 AIC 401.614
Time: 19:52:52 BIC 427.491
Sample: 04-01-1951 HQIC 412.129
- 10-01-1984
Covariance Type: opg
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.185 -1.939 0.053 -0.722 0.004
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.083 13.967 0.000 1.000 1.327
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.090 6.604 0.000 0.416 0.767
ar.L1 0.0135 0.100 0.134 0.893 -0.183 0.210
ar.L2 -0.0575 0.088 -0.651 0.515 -0.231 0.116
ar.L3 -0.2470 0.104 -2.384 0.017 -0.450 -0.044
ar.L4 -0.2129 0.084 -2.524 0.012 -0.378 -0.048
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.100 7.563 0.000 0.559 0.950
p[1->0] 0.0959 0.051 1.872 0.061 -0.005 0.196
==============================================================================
This looks more like what you expected / wanted, and that's because as I mentioned above, the "low" regime intercept of 0.35 makes the "low" regime a good fit for more time periods in the sample. But notice that the log-likelihood here is -191.8, whereas in Model 2 the log-likelihood was -188.0.
Thus even though this model looks more like what you wanted, it does not fit the data as well.
(Note again that you can't compare these log-likelihoods to the -181.3 from Model 1, because that is using a different dataset).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get R-squared for robust regression (RLM) in Statsmodels? - python

Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.

Why not use model.predict to obtain the r2? For Example: r2=1. - np.sum(np.abs(model.predict(X) - y) 2) / np.sum(np.abs(y - np.mean(y)) 2)

Related

Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

Using GLM to reproduce built-in regression models in statsmodels

Dataframe count in Statsmodels OLS formula

Logistic Regression different results with R and Python?

Python statsmodel.tsa.MarkovAutoregression using current real GNP/GDP data

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get R-squared for robust regression (RLM) in Statsmodels? - python

Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.

Why not use model.predict to obtain the r2? For Example: r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)

Related

Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

Using GLM to reproduce built-in regression models in statsmodels

Dataframe count in Statsmodels OLS formula

Logistic Regression different results with R and Python?

Python statsmodel.tsa.MarkovAutoregression using current real GNP/GDP data

Categories

Resources

Why not use model.predict to obtain the r2? For Example: r2=1. - np.sum(np.abs(model.predict(X) - y) 2) / np.sum(np.abs(y - np.mean(y)) 2)