I am working on Logistic regression model and I am using statsmodels api's logit. I am unable to figure out how to feed interaction terms to the model.
You can use the formula interface, and use the colon,: , inside the formula, for example :
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas
np.random.seed(111)
df = pd.DataFrame(np.random.binomial(1,0.5,(50,3)),columns=['x1','x2','y'])
res1 = smf.logit(formula='y ~ x1 + x2 + x1:x2', data=df).fit()
res1.summary()
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: Logit Df Residuals: 46
Method: MLE Df Model: 3
Date: Thu, 04 Feb 2021 Pseudo R-squ.: 0.02229
Time: 10:03:59 Log-Likelihood: -32.463
converged: True LL-Null: -33.203
Covariance Type: nonrobust LLR p-value: 0.6869
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.9808 0.677 -1.449 0.147 -2.308 0.346
x1 0.4700 0.851 0.552 0.581 -1.199 2.139
x2 0.9808 0.863 1.137 0.256 -0.710 2.671
x1:x2 -1.1632 1.229 -0.946 0.344 -3.572 1.246
==============================================================================
I am trying to build a super dictionary which holds within a number of lower level libraries
Concept
I have interest rates for my retail bank for the last 12 years and I am trying to model the interest rates by using a portfolio of different bonds.
Regression formula
Y_i - Y_i-1 = A + B(X_i - X_i-1) + E
In words, Y_Lag = alpha + beta(X_Lag) + Error term
Data
Note: Y = Historic Rate
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)),
columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y'])
Code thus far
#Import packages required for the analysis
import pandas as pd
import numpy as np
import statsmodels.api as sm
def Simulation(TotalSim,j):
#super dictionary to hold all iterations of the loop
Super_fit_d = {}
for i in range(1,TotalSim):
#Create a introductory loop to run the first set of regressions
#Each loop produces a univariate regression
#Each loop has a fixed lag of i
fit_d = {} # This will hold all of the fit results and summaries
for col in [x for x in df.columns if x != 'Historic Rate']:
Y = df['Historic Rate'] - df['Historic Rate'].shift(1)
# Need to remove the NaN for fit
Y = Y[Y.notnull()]
X = df[col] - df[col].shift(i)
X = X[X.notnull()]
#Y now has more observations than X due to lag, drop rows to match
Y = Y.drop(Y.index[0:i-1])
if j = 1:
X = sm.add_constant(X) # Add a constant to the fit
fit_d[col] = sm.OLS(Y,X).fit()
#append the dictionary for each lag onto the super dictionary
Super_fit_d[lag_i] = fit_d
#Check the output for one column
fit_d['Overnight'].summary()
#Check the output for one column in one segment of the super dictionary
Super_fit_d['lag_5'].fit_d['Overnight'].summary()
Simulation(11,1)
Question
I seem to be overwriting my dictionary with every loop and I'm not evaluating the i properly to index the iteration as lag_1, lag_2, lag_3 etc. How do I fix this?
Thanks in advance
There are a couple of issues here:
you sometimes use i and sometimes lag_i, but only i is defined. I changed all to lag_i for consistency
if j = 1 is incorrect syntax. You need if j == 1
You need to return fit_d so that it persists after your loop
I got it done by applying those changes
import pandas as pd
import numpy as np
import statsmodels.api as sm
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)),
columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y'])
def Simulation(TotalSim,j):
Super_fit_d = {}
for lag_i in range(1,TotalSim):
#Create a introductory loop to run the first set of regressions
#Each loop produces a univariate regression
#Each loop has a fixed lag of i
fit_d = {} # This will hold all of the fit results and summaries
for col in [x for x in df.columns if x != 'Historic Rate']:
Y = df['Historic Rate'] - df['Historic Rate'].shift(1)
# Need to remove the NaN for fit
Y = Y[Y.notnull()]
X = df[col] - df[col].shift(lag_i)
X = X[X.notnull()]
#Y now has more observations than X due to lag, drop rows to match
Y = Y.drop(Y.index[0:lag_i-1])
if j == 1:
X = sm.add_constant(X) # Add a constant to the fit
fit_d[col] = sm.OLS(Y,X).fit()
#append the dictionary for each lag onto the super dictionary
# return fit_d
Super_fit_d[lag_i] = fit_d
return Super_fit_d
test_dict = Simulation(11,1)
First lag
test_dict[1]['Overnight'].summary()
Out[76]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Historic Rate R-squared: 0.042
Model: OLS Adj. R-squared: 0.033
Method: Least Squares F-statistic: 4.303
Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.0407
Time: 11:15:13 Log-Likelihood: -280.39
No. Observations: 99 AIC: 564.8
Df Residuals: 97 BIC: 570.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0048 0.417 -0.012 0.991 -0.833 0.823
Overnight 0.2176 0.105 2.074 0.041 0.009 0.426
==============================================================================
Omnibus: 1.449 Durbin-Watson: 2.756
Prob(Omnibus): 0.485 Jarque-Bera (JB): 1.180
Skew: 0.005 Prob(JB): 0.554
Kurtosis: 2.465 Cond. No. 3.98
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Second Lag
test_dict[2]['Overnight'].summary()
Out[77]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Historic Rate R-squared: 0.001
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.06845
Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.794
Time: 11:15:15 Log-Likelihood: -279.44
No. Observations: 98 AIC: 562.9
Df Residuals: 96 BIC: 568.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0315 0.428 0.074 0.941 -0.817 0.880
Overnight 0.0291 0.111 0.262 0.794 -0.192 0.250
==============================================================================
Omnibus: 2.457 Durbin-Watson: 2.798
Prob(Omnibus): 0.293 Jarque-Bera (JB): 1.735
Skew: 0.115 Prob(JB): 0.420
Kurtosis: 2.391 Cond. No. 3.84
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
I used a logistic regression approach in both programs, and was wondering why I am getting different results, especially with the coefficients. The outcome, Infection, is (1, 0) and Flushed is a continuous variable.
Python:
import statsmodels.api as sm
logit_model=sm.Logit(data['INFECTION'], data['Flushed'])
result=logit_model.fit()
print(result.summary())
Results:
Logit Regression Results
==============================================================================
Dep. Variable: INFECTION No. Observations: 414
Model: Logit Df Residuals: 413
Method: MLE Df Model: 0
Date: Fri, 24 Aug 2018 Pseudo R-squ.: -1.388
Time: 15:47:42 Log-Likelihood: -184.09
converged: True LL-Null: -77.104
LLR p-value: nan
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Flushed -0.6467 0.070 -9.271 0.000 -0.783 -0.510
==============================================================================
R:
mylogit <- glm(INFECTION ~ Flushed, data = cvc, family = "binomial")
summary(mylogit)
Results:
Call:
glm(formula = INFECTION ~ Flushed, family = "binomial", data = cvc)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0598 -0.3107 -0.2487 -0.2224 2.8051
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.91441 0.38639 -10.131 < 2e-16 ***
Flushed 0.22696 0.06049 3.752 0.000175 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You seem to be missing the constant (offset) parameter in the Python logistic model.
To use R's formula syntax you're fitting two different models:
Python model: INFECTION ~ 0 + Flushed
R model : INFECTION ~ Flushed
To add a constant to the Python model use sm.add_constant(...).
I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.
I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method
import numpy as np
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target
#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)
xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)
# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()
Output as follows:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
I've also found the following articles
Find p-value (significance) in scikit-learn LinearRegression
http://connor-johnson.com/2014/02/18/linear-regression-with-python/
Both the codes in the SO link doesn't compile
Here is my code & data that I'm working on - but not being able to find the std error & p-values
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def readFile(filename, sheetname):
xlsx = pd.ExcelFile(filename)
data = xlsx.parse(sheetname, skiprows=1)
return data
def lr_statsmodel(X,y):
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print (results.summary())
def lr_scikit(X,y,featureCols):
model = LinearRegression()
results = model.fit(X,y)
predictions = results.predict(X)
print 'Coefficients'
print 'Intercept\t' , results.intercept_
df = pd.DataFrame(zip(featureCols, results.coef_))
print df.to_string(index=False, header=False)
# Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
rSquare = results.score(X,y)
print '\nR-Square::', rSquare
# This looks like a better option
# source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
r2 = metrics.r2_score(y,results.predict(X))
print 'r2', r2
# Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
# INFO:: All three of them are providing the same figures!
# Adj-Rsquare formula # https://www.easycalculation.com/statistics/learn-adjustedr2.php
# In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
N = X.shape[0]
p = X.shape[1]
adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
print "Adjusted R-Square::", adjRsquare
# calculate standard errors
# mean_absolute_error
# mean_squared_error
# median_absolute_error
# r2_score
# explained_variance_score
mse = metrics.mean_squared_error(y,results.predict(X))
print mse
print 'Residual Standard Error:', np.sqrt(mse)
# OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
n = X.shape[0]
X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
se_matrix = scipy.linalg.sqrtm(
metrics.mean_squared_error(y, results.predict(X)) *
np.linalg.inv(X1.T * X1)
)
print 'se',np.diagonal(se_matrix)
# https://github.com/nsh87/regressors/blob/master/regressors/stats.py
# http://regressors.readthedocs.io/en/latest/usage.html
y_hat = results.predict(X)
sse = np.sum((y_hat - y) ** 2)
print 'Standard Square Error of the Model:', sse
if __name__ == '__main__':
# read file
fileData = readFile('Linear_regression.xlsx','Input Data')
# list of independent variables
feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
# build dependent & independent data set
X = fileData[feature_cols]
y = fileData['Number of weekly riders']
# Statsmodel - OLS
# lr_statsmodel(X,y)
# ScikitLearn - OLS
lr_scikit(X,y,feature_cols)
My data-set
Y X1 X2 X3 X4
City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
1 1,92,000 $15 18,00,000 $5,800 $50
2 1,90,400 $15 17,90,000 $6,200 $50
3 1,91,200 $15 17,80,000 $6,400 $60
4 1,77,600 $25 17,78,000 $6,500 $60
5 1,76,800 $25 17,50,000 $6,550 $60
6 1,78,400 $25 17,40,000 $6,580 $70
7 1,80,800 $25 17,25,000 $8,200 $75
8 1,75,200 $30 17,25,000 $8,600 $75
9 1,74,400 $30 17,20,000 $8,800 $75
10 1,73,920 $30 17,05,000 $9,200 $80
11 1,72,800 $30 17,10,000 $9,630 $80
12 1,63,200 $40 17,00,000 $10,570 $80
13 1,61,600 $40 16,95,000 $11,330 $85
14 1,61,600 $40 16,95,000 $11,600 $100
15 1,60,800 $40 16,90,000 $11,800 $105
16 1,59,200 $40 16,30,000 $11,830 $105
17 1,48,800 $65 16,40,000 $12,650 $105
18 1,15,696 $102 16,35,000 $13,000 $110
19 1,47,200 $75 16,30,000 $13,224 $125
20 1,50,400 $75 16,20,000 $13,766 $130
21 1,52,000 $75 16,15,000 $14,010 $150
22 1,36,000 $80 16,05,000 $14,468 $155
23 1,26,240 $86 15,90,000 $15,000 $165
24 1,23,888 $98 15,95,000 $15,200 $175
25 1,26,080 $87 15,90,000 $15,600 $175
26 1,51,680 $77 16,00,000 $16,000 $190
27 1,52,800 $63 16,10,000 $16,200 $200
I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.
EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables
Here is reg is output of lin regression fit method of sklearn
to calculate adjusted r2
def adjustedR2(x,y reg):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
and for p values
from sklearn.feature_selection import f_regression
freg=f_regression(x,y)
p=freg[1]
print(p.round(3))
When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models.
But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results.
Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?
Or is there another measure that can be used / calculated from the results produced by sm.RLS?
This is what Statsmodels is producing:
import numpy as np
import statsmodels.api as sm
# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())
Which outputs:
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mo, 27 Jul 2015
Time: 10:00:00
No. Iterations: 17
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.0254 0.091 55.017 0.000 4.846 5.204
x1 0.4845 0.008 61.555 0.000 0.469 0.500
==============================================================================
Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.
R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.
Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.726
Model: OLS Adj. R-squared: 0.721
Method: Least Squares F-statistic: 127.4
Date: Wed, 03 Nov 2021 Prob (F-statistic): 4.15e-15
Time: 09:33:40 Log-Likelihood: -87.455
No. Observations: 50 AIC: 178.9
Df Residuals: 48 BIC: 182.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.7071 0.396 14.425 0.000 4.912 6.503
x1 0.3848 0.034 11.288 0.000 0.316 0.453
==============================================================================
Omnibus: 23.499 Durbin-Watson: 2.752
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.906
Skew: -1.649 Prob(JB): 4.34e-08
Kurtosis: 5.324 Cond. No. 23.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 50
Model: RLM Df Residuals: 48
Method: IRLS Df Model: 1
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 03 Nov 2021
Time: 09:34:24
No. Iterations: 17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1857 0.111 46.590 0.000 4.968 5.404
x1 0.4790 0.010 49.947 0.000 0.460 0.498
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.
Why not use model.predict to obtain the r2? For Example:
r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)