I am working on Logistic regression model and I am using statsmodels api's logit. I am unable to figure out how to feed interaction terms to the model.
You can use the formula interface, and use the colon,: , inside the formula, for example :
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas
np.random.seed(111)
df = pd.DataFrame(np.random.binomial(1,0.5,(50,3)),columns=['x1','x2','y'])
res1 = smf.logit(formula='y ~ x1 + x2 + x1:x2', data=df).fit()
res1.summary()
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: Logit Df Residuals: 46
Method: MLE Df Model: 3
Date: Thu, 04 Feb 2021 Pseudo R-squ.: 0.02229
Time: 10:03:59 Log-Likelihood: -32.463
converged: True LL-Null: -33.203
Covariance Type: nonrobust LLR p-value: 0.6869
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.9808 0.677 -1.449 0.147 -2.308 0.346
x1 0.4700 0.851 0.552 0.581 -1.199 2.139
x2 0.9808 0.863 1.137 0.256 -0.710 2.671
x1:x2 -1.1632 1.229 -0.946 0.344 -3.572 1.246
==============================================================================
Related
I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?
screenshot of repeated rows
It's likely because of the data type of the horsepower column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.
Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
{
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
}
)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
)
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
print(results.summary())
Output (dummy coding):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
etc.
Now, converting the strings back to integers:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
print(results.summary())
Output (as expected):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
==============================================================================
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.
Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.
You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:
How do I explicitly add covariates to the glm model with a
regression coefficient equal to 1?
From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.
Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.
Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?
Any help would be greatly appreciated, so thanks in advance!
import numpy as np
import pandas as pd
np.random.seed(1+8+2022)
# Number of random samples
n = 4000
# Intercept
alpha = 1
# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])
# Random Data
data = {
"x1" : np.random.normal(1.00,0.10, size = n),
"x2" : np.random.normal(1.50,0.15, size = n),
"x3" : np.random.normal(-2.0,0.20, size = n),
"exposure": np.random.poisson(14, size = n),
}
# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])
# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))
# Convert dict to df
data = pd.DataFrame(data)
print(data)
#-----------------------------------------------------
# My Model
#-----------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------
import statsmodels.api as sm
data["offset"] = np.ones(n)
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"],
offset = data["offset"]).fit()
print(model.summary())
#-----------------------------------------------------
# statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------
import statsmodels.api as sm
data["x1"] = sm.add_constant(data["x1"])
model = sm.Poisson( endog = data["y"],
exog = data[["x1", "x2", "x3"]],
exposure = data["exposure"]).fit()
print(model.summary())
RESULTS:
x1 x2 x3 exposure y
0 1.151771 1.577677 -1.811903 13 0.508422
1 0.897012 1.678311 -2.327583 22 0.228219
2 1.040250 1.471962 -1.705458 13 0.621328
3 0.866195 1.512472 -1.766108 17 0.478107
4 0.925470 1.399320 -1.886349 13 0.512518
... ... ... ... ... ...
3995 1.073945 1.365260 -1.755071 12 0.804081
3996 0.855000 1.251951 -2.173843 11 0.439639
3997 0.892066 1.710856 -2.183085 10 0.107643
3998 0.763777 1.538938 -2.013619 22 0.363551
3999 1.056958 1.413922 -1.722252 19 1.098932
[4000 rows x 5 columns]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: GLM Df Residuals: 3996
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -2743.7
Date: Sat, 08 Jan 2022 Deviance: 141.11
Time: 09:32:32 Pearson chi2: 140.
No. Iterations: 4
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.6857 0.378 9.755 0.000 2.945 4.426
x1 2.0020 0.227 8.800 0.000 1.556 2.448
x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750
x3 0.9937 0.114 8.719 0.000 0.770 1.217
==============================================================================
Optimization terminated successfully.
Current function value: 0.668293
Iterations 10
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462
Time: 09:32:32 Log-Likelihood: -2673.2
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 4.619e-122
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.0000 0.184 10.875 0.000 1.640 2.360
x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757
x3 1.0000 0.094 10.667 0.000 0.816 1.184
==============================================================================
Optimization terminated successfully.
Current function value: 0.677893
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Poisson Df Residuals: 3997
Method: MLE Df Model: 2
Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162
Time: 09:32:32 Log-Likelihood: -2711.6
converged: True LL-Null: -2952.6
Covariance Type: nonrobust LLR p-value: 2.196e-105
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.9516 0.304 9.711 0.000 2.356 3.547
x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692
x3 0.9807 0.113 8.655 0.000 0.759 1.203
==============================================================================
Process return 0 (0x0)
Press Enter to continue...
I'm trying to do a linear regression to predict the count of a dataframe based on the count of another dataframe. I am using statsmodels. I have tried the following:
X = df1.count()
Y = df2.count()
from statsmodels.formula.api import ols
fit = ols(Y ~ X, data=kak).fit()
fit.summary()
Using the X and Y variables in the OLS formula is not allowed and I have no idea what to fill in at the data= keyword argument. How would I go about doing this?
Let's assume you have to 1D arrays X and Y:
import numpy as np
X = np.arange(100)
Y = 2*X + 5
Then you can run a linear regression using the line below. There are two important things:
The first argument to ols is a str containing a formula. "Y ~ X" and not Y ~ X (note the double quotes).
The second argument data can be any Python object as long as it has keys data["X"] and data["Y"]. I wrote a dict here, but it would also work with a DataFrame. It basically allows statsmodels to understand who are X and Y in the formula you gave to it.
ols("Y ~ X", {"X": X, "Y": Y}).fit().summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.916e+33
Date: Fri, 21 May 2021 Prob (F-statistic): 0.00
Time: 14:01:48 Log-Likelihood: 3055.1
No. Observations: 100 AIC: -6106.
Df Residuals: 98 BIC: -6101.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0000 2.62e-15 1.91e+15 0.000 5.000 5.000
X 2.0000 4.57e-17 4.38e+16 0.000 2.000 2.000
==============================================================================
Omnibus: 220.067 Durbin-Watson: 0.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.816
Skew: 0.159 Prob(JB): 0.00739
Kurtosis: 1.498 Cond. No. 114.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models.
But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results.
Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?
Or is there another measure that can be used / calculated from the results produced by sm.RLS?
This is what Statsmodels is producing:
import numpy as np
import statsmodels.api as sm
# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())
Which outputs:
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mo, 27 Jul 2015
Time: 10:00:00
No. Iterations: 17
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.0254 0.091 55.017 0.000 4.846 5.204
x1 0.4845 0.008 61.555 0.000 0.469 0.500
==============================================================================
Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.
R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.
Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.726
Model: OLS Adj. R-squared: 0.721
Method: Least Squares F-statistic: 127.4
Date: Wed, 03 Nov 2021 Prob (F-statistic): 4.15e-15
Time: 09:33:40 Log-Likelihood: -87.455
No. Observations: 50 AIC: 178.9
Df Residuals: 48 BIC: 182.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.7071 0.396 14.425 0.000 4.912 6.503
x1 0.3848 0.034 11.288 0.000 0.316 0.453
==============================================================================
Omnibus: 23.499 Durbin-Watson: 2.752
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.906
Skew: -1.649 Prob(JB): 4.34e-08
Kurtosis: 5.324 Cond. No. 23.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 03 Nov 2021
Time: 09:34:24
No. Iterations: 17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1857 0.111 46.590 0.000 4.968 5.404
x1 0.4790 0.010 49.947 0.000 0.460 0.498
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.
Why not use model.predict to obtain the r2? For Example:
r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)
Trying to do logistic regression through pandas and statsmodels. Don't know why I'm getting an error or how to fix it.
import pandas as pd
import statsmodels.api as sm
x = [1, 3, 5, 6, 8]
y = [0, 1, 0, 1, 1]
d = { "x": pd.Series(x), "y": pd.Series(y)}
df = pd.DataFrame(d)
model = "y ~ x"
glm = sm.Logit(model, df=df).fit()
ERROR:
Traceback (most recent call last):
File "regress.py", line 45, in <module>
glm = sm.Logit(model, df=df).fit()
TypeError: __init__() takes exactly 3 arguments (2 given)
You can't pass a formula to Logit. Do:
In [82]: import patsy
In [83]: f = 'y ~ x'
In [84]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
In [85]: sm.Logit(y, X).fit().summary()
Optimization terminated successfully.
Current function value: 0.511631
Iterations 6
Out[85]:
<class 'statsmodels.iolib.summary.Summary'>
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5
Model: Logit Df Residuals: 3
Method: MLE Df Model: 1
Date: Fri, 30 Aug 2013 Pseudo R-squ.: 0.2398
Time: 16:56:38 Log-Likelihood: -2.5582
converged: True LL-Null: -3.3651
LLR p-value: 0.2040
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -2.0544 2.452 -0.838 0.402 -6.861 2.752
x 0.5672 0.528 1.073 0.283 -0.468 1.603
==============================================================================
"""
This is pretty much straight from the docs on how to do exactly what you're asking.
EDIT: You can also use the formula API, as suggested by #user333700:
In [22]: print sm.formula.logit(model, data=df).fit().summary()
Optimization terminated successfully.
Current function value: 0.511631
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5
Model: Logit Df Residuals: 3
Method: MLE Df Model: 1
Date: Fri, 30 Aug 2013 Pseudo R-squ.: 0.2398
Time: 18:14:26 Log-Likelihood: -2.5582
converged: True LL-Null: -3.3651
LLR p-value: 0.2040
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -2.0544 2.452 -0.838 0.402 -6.861 2.752
x 0.5672 0.528 1.073 0.283 -0.468 1.603
==============================================================================
You can pass a formula directly in Logit too.
Logit.from_formula('y ~ x',data=data).fit()