VIF by coef in OLS Regression Results Python - python

I am attempting to print the VIF (variance inflation factor) by coef. However, I can't seem to find any documentation from statsmodels showing how? I have a model of n variables I need to process and a multicollinearity value for all the variables doesn't help remove the values with the highest collinearity.
this looks like an answer
https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
but how would I run it against this workbook.
http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv
Below is the code an the summary output which is also where I am now.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# read data into a DataFrame
data = pd.read_csv('somepath', index_col=0)
print(data.head())
#multiregression
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Wed, 15 Feb 2017 Prob (F-statistic): 1.58e-96
Time: 13:28:29 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
Radio 0.1885 0.009 21.893 0.000 0.172 0.206
Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================

To get a list of VIFs:
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = lm.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif
To get their mean:
np.array(vif).mean()

Related

Estimating multiple parameters of a model in python

Wondering what's the most efficient/accurate way to estimate these parameters (a0, a1, a2, a3) with Python in the model:
col_4 = a0 + a1*col_1 + a2*col_2 + a3*col_3
The sample dataset would be:
inputs = {
'col_1': np.random.normal(15,2,100),
'col_2': np.random.normal(15,1,100),
'col_3': np.random.normal(0.9,1,100),
'col_4': np.random.normal(-0.05,0.5,100),
}
_idx = pd.date_range('2021-01-01','2021-04-10',freq='D').to_series()
data = pd.DataFrame(inputs, index = _idx)
statsmodels provides a pretty simple way to estimate linear models like that:
import statsmodels.formula.api as smf
results = smf.ols('col_4 ~ col_1 + col_2 + col_3', data=data).fit()
print(results.summary())
The coef column shows your aX parameters:
OLS Regression Results
==============================================================================
Dep. Variable: col_4 R-squared: 0.049
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 1.637
Date: Wed, 29 Dec 2021 Prob (F-statistic): 0.186
Time: 17:25:00 Log-Likelihood: -68.490
No. Observations: 100 AIC: 145.0
Df Residuals: 96 BIC: 155.4
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.2191 0.846 0.259 0.796 -1.461 1.899
col_1 -0.0198 0.023 -0.854 0.395 -0.066 0.026
col_2 -0.0048 0.051 -0.093 0.926 -0.107 0.097
col_3 0.1155 0.056 2.066 0.042 0.005 0.226
==============================================================================
Omnibus: 2.292 Durbin-Watson: 2.291
Prob(Omnibus): 0.318 Jarque-Bera (JB): 2.296
Skew: -0.351 Prob(JB): 0.317
Kurtosis: 2.757 Cond. No. 370.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
That includes the intercept (a0) by default. If you want to remove it, just add a -1 to the formula

lm in R vs statsmodels.api OLS in Python

I get completely different results with the same datasets in R and Python. I cannot understand why it happens.
R:
library(RcppCNPy)
d <- npyLoad("/home/vvkovalchuk/bin/src/python/asks1.npy")
datas = npyLoad('/home/vvkovalchuk/bin/src/python/bids2.npy')
m <- lm(d ~ datas)
summary(m)
Python:
import time
import numpy
import statsmodels.api as sm
from math import log
Y = numpy.load('./asks1.npy', allow_pickle=True)
X = numpy.load('./bids2.npy', allow_pickle=True)
X3 = sm.add_constant(X)
res_ols = sm.OLS(Y, X3).fit()
print(res_ols.params)
What am I doing wrong?
Results:
R:
Call:
lm(formula = d ~ datas)
Residuals:
Min 1Q Median 3Q Max
-6.089e+06 8.797e+07 2.163e+08 2.179e+08 1.122e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.561e+00 2.253e+06 0 1
datas 3.809e+03 2.164e+09 0 1
Residual standard error: 208100000 on 14639 degrees of freedom
Multiple R-squared: 0.2735, Adjusted R-squared: 0.2735
F-statistic: 5512 on 1 and 14639 DF, p-value: < 2.2e-16
Python:
OLS Regression Results
Dep. Variable: y R-squared: 0.112
Model: OLS Adj. R-squared: 0.112
Method: Least Squares F-statistic: 1846.
Date: Thu, 25 Mar 2021 Prob (F-statistic): 0.00
Time: 13:08:43 Log-Likelihood: 1.6948e+05
No. Observations: 14641 AIC: -3.390e+05
Df Residuals: 14639 BIC: -3.389e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0004 3.07e-06 126.136 0.000 0.000 0.000
x1 0.1478 0.003 42.969 0.000 0.141 0.155
Omnibus: 3251.130 Durbin-Watson: 0.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14606.605
Skew: 1.019 Prob(JB): 0.00
Kurtosis: 7.449 Cond. No. 1.83e+05
I also tried to swap arguments in OLS function. Still getting incorrect results. Could this be related to NAs?

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Preserve variable names in summary from statsmodels

I am using OLS from statsmodel, the link is https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
#USD
X = sm.add_constant(USD)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 352.4
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.35e-67
Time: 17:30:24 Log-Likelihood: -8018.8
No. Observations: 977 AIC: 1.604e+04
Df Residuals: 975 BIC: 1.605e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1843.1414 149.675 12.314 0.000 1549.418 2136.864
USD 3512.5040 187.111 18.772 0.000 3145.318 3879.690
==============================================================================
Omnibus: 276.458 Durbin-Watson: 0.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.633
Skew: 0.438 Prob(JB): 6.22e-17
Kurtosis: 1.967 Cond. No. 10.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
You can see the X is showing as USD in the summary which is what I want.
However, after adding a new variable
#JPY + USD
X = sm.add_constant(JPY)
X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 17:39:19 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
x1 78.6589 2.466 31.902 0.000 73.820 83.497
x2 -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not showing USD and JPY, but x1 x2. Is there a way to fix it? I tried google but found nothing.
As my question is all care about the showing, thus, if I keep the header, then the problem solved, so I post my solution in case someone may have the same problem.
#JPY + USD
X = JPY.join(USD)
X = sm.add_constant(X)
#X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 22:51:43 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
JPY 78.6589 2.466 31.902 0.000 73.820 83.497
USD -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Here's an easy fix using pandas. You only need to add a list of features inside summary().
# list of features (names)
features = list(df.iloc[:, 0:-1].columns) # exclude last column (label)
# scale features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train MLR model
regressor = sm.OLS(y_train, X_train).fit()
regressor.summary(xname=features)

Linear regression to extract only coefficients and constant

I have written a code for multi-linear regression model. But when I use results.summary() Python spits this whole thing out
if i >1:
xxx = sm.add_constant(xxx)
results = sm.OLS(y_variable_holder, xxx).fit()
print (results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.051
Date: Wed, 14 Jun 2017 Prob (F-statistic): 0.369
Time: 20:01:26 Log-Likelihood: 6062.6
No. Observations: 2262 AIC: -1.212e+04
Df Residuals: 2258 BIC: -1.209e+04
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0002 0.000 -0.476 0.634 -0.001 0.001
x1 -0.0001 0.001 -0.218 0.828 -0.001 0.001
x2 8.445e-06 2.31e-05 0.366 0.714 -3.68e-05 5.37e-05
x3 -0.0026 0.003 -0.941 0.347 -0.008 0.003
==============================================================================
Omnibus: 322.021 Durbin-Watson: 2.255
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4334.191
Skew: -0.097 Prob(JB): 0.00
Kurtosis: 9.779 Cond. No. 127.
==============================================================================
I want Python to only spit out constant and coefficients. For example, desired output:
python output:
[-0.0002]
[-0.0001]
[8.445e-06]
[ -0.0026]
How can I achieve this? I don't need the whole summary just the constant/efficient.
I figured it out. the answer is results_bucket.append(results.params)

Categories

Resources