I used a logistic regression approach in both programs, and was wondering why I am getting different results, especially with the coefficients. The outcome, Infection, is (1, 0) and Flushed is a continuous variable.
Python:
import statsmodels.api as sm
logit_model=sm.Logit(data['INFECTION'], data['Flushed'])
result=logit_model.fit()
print(result.summary())
Results:
Logit Regression Results
==============================================================================
Dep. Variable: INFECTION No. Observations: 414
Model: Logit Df Residuals: 413
Method: MLE Df Model: 0
Date: Fri, 24 Aug 2018 Pseudo R-squ.: -1.388
Time: 15:47:42 Log-Likelihood: -184.09
converged: True LL-Null: -77.104
LLR p-value: nan
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Flushed -0.6467 0.070 -9.271 0.000 -0.783 -0.510
==============================================================================
R:
mylogit <- glm(INFECTION ~ Flushed, data = cvc, family = "binomial")
summary(mylogit)
Results:
Call:
glm(formula = INFECTION ~ Flushed, family = "binomial", data = cvc)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0598 -0.3107 -0.2487 -0.2224 2.8051
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.91441 0.38639 -10.131 < 2e-16 ***
Flushed 0.22696 0.06049 3.752 0.000175 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You seem to be missing the constant (offset) parameter in the Python logistic model.
To use R's formula syntax you're fitting two different models:
Python model: INFECTION ~ 0 + Flushed
R model : INFECTION ~ Flushed
To add a constant to the Python model use sm.add_constant(...).
Related
For a fixed effect model I was planning to switch from Stata's areg to Python's linearmodels.panel.PanelOLS.
But the results are different. In Stata I get R-squared = 0.6047 and in Python I get R-squared = 0.1454.
How come that I get so different R-squared from the commands below?
Stata command and results:
use ./linearmodels_datasets_wage_panel.dta, clear
areg lwage expersq union married hours, vce(cluster nr) absorb(nr)
Linear regression, absorbing indicators Number of obs = 4,360
Absorbed variable: nr No. of categories = 545
F(4, 544) = 84.67
Prob > F = 0.0000
R-squared = 0.6047
Adj R-squared = 0.5478
Root MSE = 0.3582
(Std. err. adjusted for 545 clusters in nr)
------------------------------------------------------------------------------
| Robust
lwage | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
expersq | .0039509 .0002554 15.47 0.000 .0034492 .0044526
union | .0784442 .0252621 3.11 0.002 .028821 .1280674
married | .1146543 .0234954 4.88 0.000 .0685014 .1608072
hours | -.0000846 .0000238 -3.56 0.000 -.0001313 -.0000379
_cons | 1.565825 .0531868 29.44 0.000 1.461348 1.670302
------------------------------------------------------------------------------
Python command and results:
from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS
data = wage_panel.load()
mod_entity = PanelOLS.from_formula(
"lwage ~ 1 + expersq + union + married + hours + EntityEffects",
data=data.set_index(["nr", "year"]),
)
result_entity = mod_entity.fit(
cov_type='clustered',
cluster_entity=True,
)
print(result_entity)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1454
Estimator: PanelOLS R-squared (Between): -0.0844
No. Observations: 4360 R-squared (Within): 0.1454
Date: Wed, Feb 02 2022 R-squared (Overall): 0.0219
Time: 12:23:24 Log-likelihood -1416.4
Cov. Estimator: Clustered
F-statistic: 162.14
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(4,3811)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 96.915
P-value 0.0000
Time periods: 8 Distribution: F(4,3811)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 1.5658 0.0497 31.497 0.0000 1.4684 1.6633
expersq 0.0040 0.0002 16.550 0.0000 0.0035 0.0044
hours -8.46e-05 2.22e-05 -3.8101 0.0001 -0.0001 -4.107e-05
married 0.1147 0.0220 5.2207 0.0000 0.0716 0.1577
union 0.0784 0.0236 3.3221 0.0009 0.0321 0.1247
==============================================================================
F-test for Poolability: 9.4833
P-value: 0.0000
Distribution: F(544,3811)
Included effects: Entity
man. How are you?
You are trying to run an absorbing regression (.areg). Specifically, you're trying to run 'a linear regression absorbing one categorical factor'. To do this, you can just run the following model linearmodels.iv.absorbing.AbsorbingLS(endog_variable, exog_variables, categorical_variable_absorb)
See the example below:
import pandas as pd
import statsmodels as sm
from linearmodels.iv import absorbing
dta = pd.read_csv('http://www.math.smith.edu/~bbaumer/mth247/labs/airline.csv')
dta.rename(columns={'I': 'airline',
'T': 'year',
'Q': 'output',
'C': 'cost',
'PF': 'fuel',
'LF ': 'load'}, inplace=True)
Next, transform the absorbing variable into a categorical variable (in this case, I will use the airline variable):
cats = pd.DataFrame({'airline': pd.Categorical(dta['airline'])})
Then, just run the model:
exog_variables = ['output', 'fuel', 'load']
endog_variable = ['cost']
exog = sm.tools.tools.add_constant(dta[exog_variables])
endog = dta[endog_variable]
model = absorbing.AbsorbingLS(endog, exog, absorb=cats, drop_absorbed=True)
model_res = model.fit(cov_type='unadjusted', debiased=True)
print(model_res.summary)
Below is the results of this same model in both python and stata (using the command .areg cost output fuel load, absorb(airline))
Python:
Absorbing LS Estimation Summary
==================================================================================
Dep. Variable: cost R-squared: 0.9974
Estimator: Absorbing LS Adj. R-squared: 0.9972
No. Observations: 90 F-statistic: 3827.4
Date: Thu, Oct 27 2022 P-value (F-stat): 0.0000
Time: 20:58:04 Distribution: F(3,81)
Cov. Estimator: unadjusted R-squared (No Effects): 0.9926
Varaibles Absorbed: 5.0000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
const 9.7135 0.2229 43.585 0.0000 9.2701 10.157
output 0.9193 0.0290 31.691 0.0000 0.8616 0.9770
fuel 0.4175 0.0148 28.303 0.0000 0.3881 0.4468
load -1.0704 0.1957 -5.4685 0.0000 -1.4599 -0.6809
==============================================================================
Stata:
Linear regression, absorbing indicators Number of obs = 90
F( 3, 81) = 3604.80
Prob > F = 0.0000
R-squared = 0.9974
Adj R-squared = 0.9972
Root MSE = .06011
------------------------------------------------------------------------------
cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
output | .9192846 .0298901 30.76 0.000 .8598126 .9787565
fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333
load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963
_cons | 9.713528 .229641 42.30 0.000 9.256614 10.17044
-------------+----------------------------------------------------------------
airline | F(5, 81) = 57.732 0.000 (6 categories)
I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...
I am working on Logistic regression model and I am using statsmodels api's logit. I am unable to figure out how to feed interaction terms to the model.
You can use the formula interface, and use the colon,: , inside the formula, for example :
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas
np.random.seed(111)
df = pd.DataFrame(np.random.binomial(1,0.5,(50,3)),columns=['x1','x2','y'])
res1 = smf.logit(formula='y ~ x1 + x2 + x1:x2', data=df).fit()
res1.summary()
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: Logit Df Residuals: 46
Method: MLE Df Model: 3
Date: Thu, 04 Feb 2021 Pseudo R-squ.: 0.02229
Time: 10:03:59 Log-Likelihood: -32.463
converged: True LL-Null: -33.203
Covariance Type: nonrobust LLR p-value: 0.6869
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.9808 0.677 -1.449 0.147 -2.308 0.346
x1 0.4700 0.851 0.552 0.581 -1.199 2.139
x2 0.9808 0.863 1.137 0.256 -0.710 2.671
x1:x2 -1.1632 1.229 -0.946 0.344 -3.572 1.246
==============================================================================
I have percentages and need to calculate a regression. According to basic statistics using logistic regression is better than OLS as percentages invalidate the requirement of a continuous and unconstraint value space.
So far, so good.
However, I get different results in R, Python, and Matlab. In fact, Matlab even reports significant values where python will not.
My models look like:
R:
summary(glm(foo ~ 1 + bar + baz , family = "binomial", data = <<data>>))
Python via statsmodels:
smf.logit('foo ~ 1 + bar + baz', <<data>>).fit().summary()
Matlab:
fitglm(<<data>>,'foo ~ 1 + bar + baz','Link','logit')
where Matlab currently produces the best results.
Could there be different initialization values? Different solvers? Different settings for alphas when computing p-values?
How can I get the same results at least in similar numeric ranges or same features detected as significant? I do not require exact equal numeric output.
edit
the summary statistics
python:
Dep. Variable: foo No. Observations: 104
Model: Logit Df Residuals: 98
Method: MLE Df Model: 5
Date: Wed, 28 Aug 2019 Pseudo R-squ.: inf
Time: 06:48:12 Log-Likelihood: -0.25057
converged: True LL-Null: 0.0000
LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
Intercept -16.9863 154.602 -0.110 0.913 -320.001 286.028
bar -0.0278 0.945 -0.029 0.977 -1.880 1.824
baz 18.5550 280.722 0.066 0.947 -531.650 568.760
a 9.9996 153.668 0.065 0.948 -291.184 311.183
b 0.6757 132.542 0.005 0.996 -259.102 260.454
d 0.0005 0.039 0.011 0.991 -0.076 0.077
R:
glm(formula = myformula, family = "binomial", data = r_x)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.046466 -0.013282 -0.001017 0.006217 0.104467
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.699e+01 1.546e+02 -0.110 0.913
bar -2.777e-02 9.449e-01 -0.029 0.977
baz 1.855e+01 2.807e+02 0.066 0.947
a 1.000e+01 1.537e+02 0.065 0.948
b 6.757e-01 1.325e+02 0.005 0.996
d 4.507e-04 3.921e-02 0.011 0.991
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.049633 on 103 degrees of freedom
Residual deviance: 0.035684 on 98 degrees of freedom
AIC: 12.486
Matlab:
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ ________ __________
(Intercept) -21.044 3.315 -6.3483 6.8027e-09
bar -0.033507 0.022165 -1.5117 0.13383
d 0.0016149 0.00083173 1.9416 0.055053
baz 21.427 6.0132 3.5632 0.00056774
a 14.875 3.7828 3.9322 0.00015712
b -1.2126 2.7535 -0.44038 0.66063
104 observations, 98 error degrees of freedom
Estimated Dispersion: 1.25e-06
F-statistic vs. constant model: 7.4, p-value = 6.37e-06
You are not actually using the binomial distribution in the MATLAB case. You are specifying the link function, but the distribution remains its default value for a normal distribution, which will not give you the expected logistic fit, at least if the sample sizes for the percentages are small. It is also giving you lower p-values, because the normal distribution is less constrained in its variance than the binomial distribution is.
You need to specify the Distribution argument to Binomial:
fitglm(<<data>>, 'foo ~ 1 + bar + baz', 'Distribution', 'binomial ', 'Link', 'logit')
The R and Python code seem to match rather well.
When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models.
But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results.
Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?
Or is there another measure that can be used / calculated from the results produced by sm.RLS?
This is what Statsmodels is producing:
import numpy as np
import statsmodels.api as sm
# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())
Which outputs:
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mo, 27 Jul 2015
Time: 10:00:00
No. Iterations: 17
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.0254 0.091 55.017 0.000 4.846 5.204
x1 0.4845 0.008 61.555 0.000 0.469 0.500
==============================================================================
Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.
R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.
Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.726
Model: OLS Adj. R-squared: 0.721
Method: Least Squares F-statistic: 127.4
Date: Wed, 03 Nov 2021 Prob (F-statistic): 4.15e-15
Time: 09:33:40 Log-Likelihood: -87.455
No. Observations: 50 AIC: 178.9
Df Residuals: 48 BIC: 182.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.7071 0.396 14.425 0.000 4.912 6.503
x1 0.3848 0.034 11.288 0.000 0.316 0.453
==============================================================================
Omnibus: 23.499 Durbin-Watson: 2.752
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.906
Skew: -1.649 Prob(JB): 4.34e-08
Kurtosis: 5.324 Cond. No. 23.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 03 Nov 2021
Time: 09:34:24
No. Iterations: 17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1857 0.111 46.590 0.000 4.968 5.404
x1 0.4790 0.010 49.947 0.000 0.460 0.498
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.
Why not use model.predict to obtain the r2? For Example:
r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)