Linear regression to extract only coefficients and constant - python

I have written a code for multi-linear regression model. But when I use results.summary() Python spits this whole thing out
if i >1:
xxx = sm.add_constant(xxx)
results = sm.OLS(y_variable_holder, xxx).fit()
print (results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.051
Date: Wed, 14 Jun 2017 Prob (F-statistic): 0.369
Time: 20:01:26 Log-Likelihood: 6062.6
No. Observations: 2262 AIC: -1.212e+04
Df Residuals: 2258 BIC: -1.209e+04
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0002 0.000 -0.476 0.634 -0.001 0.001
x1 -0.0001 0.001 -0.218 0.828 -0.001 0.001
x2 8.445e-06 2.31e-05 0.366 0.714 -3.68e-05 5.37e-05
x3 -0.0026 0.003 -0.941 0.347 -0.008 0.003
==============================================================================
Omnibus: 322.021 Durbin-Watson: 2.255
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4334.191
Skew: -0.097 Prob(JB): 0.00
Kurtosis: 9.779 Cond. No. 127.
==============================================================================
I want Python to only spit out constant and coefficients. For example, desired output:
python output:
[-0.0002]
[-0.0001]
[8.445e-06]
[ -0.0026]
How can I achieve this? I don't need the whole summary just the constant/efficient.

I figured it out. the answer is results_bucket.append(results.params)

Related

Is the categorical variable relevant if one of the dummy variable has a t score of 0.95?

If a variable has more than 0.05 t score, it is deemed not relevant and should be excluded from the model. However, what if the categorical variable has 4 dummy variable and only one of them exceeds 0.05? Do i exclude the entire categorical variable?
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.803
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 368.4
Date: Mon, 15 Jul 2019 Prob (F-statistic): 0.00
Time: 12:00:26 Log-Likelihood: -17357.
No. Observations: 1460 AIC: 3.475e+04
Df Residuals: 1443 BIC: 3.484e+04
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -1.366e+05 9432.229 -14.482 0.000 -1.55e+05 -1.18e+05
OverallQual 1.327e+04 1249.192 10.622 0.000 1.08e+04 1.57e+04
ExterQual 1.168e+04 2763.188 4.228 0.000 6262.969 1.71e+04
TotalBsmtSF 13.7198 5.182 2.648 0.008 3.554 23.885
GrLivArea 45.4098 2.521 18.012 0.000 40.465 50.355
1stFlrSF 9.4573 5.543 1.706 0.088 -1.416 20.330
GarageArea 22.4791 9.748 2.306 0.021 3.358 41.600
KitchenQual 1.309e+04 2142.662 6.111 0.000 8891.243 1.73e+04
GarageCars 8875.8202 2961.291 2.997 0.003 3066.923 1.47e+04
BsmtQual 1.097e+04 2094.395 5.235 0.000 6856.671 1.51e+04
GarageFinish_No 2689.1356 5847.186 0.460 0.646 -8780.759 1.42e+04
GarageFinish_RFn -8223.4503 2639.360 -3.116 0.002 -1.34e+04 -3046.057
GarageFinish_Unf -8416.9443 2928.002 -2.875 0.004 -1.42e+04 -2673.349
BsmtExposure_Gd 2.298e+04 3970.691 5.788 0.000 1.52e+04 3.08e+04
BsmtExposure_Mn -262.8498 4160.294 -0.063 0.950 -8423.721 7898.021
BsmtExposure_No -7690.0994 2800.731 -2.746 0.006 -1.32e+04 -2196.159
BsmtExposure_No Basement 2.598e+04 9879.662 2.630 0.009 6598.642 4.54e+04
==============================================================================
Omnibus: 614.604 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76480.899
Skew: -0.928 Prob(JB): 0.00
Kurtosis: 38.409 Cond. No. 2.85e+04
==============================================================================
when you say "0.05 t score" I assume you mean "0.05 p value". the t-value is just coef / stderr, which goes into the p-value calculation (abs(t_value) > 2 is approximately p-value < 0.05)
when you say "categorical variable has 4 dummy variable", I presume you mean it has 4 "levels" / distinct values and you're referring to BsmtExposure_Mn. I'd leave that in as the other categories/levels are helping the model. if you had several categories that were less predictive you could think about combining them into one "other" category
as a general point, you shouldn't just automatically exclude variables because their p-value is > 0.05 (or whatever your cutoff/"alpha value" is). they can be useful for understanding what's going on within the model, and explaining results to other people

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Preserve variable names in summary from statsmodels

I am using OLS from statsmodel, the link is https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
#USD
X = sm.add_constant(USD)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 352.4
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.35e-67
Time: 17:30:24 Log-Likelihood: -8018.8
No. Observations: 977 AIC: 1.604e+04
Df Residuals: 975 BIC: 1.605e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1843.1414 149.675 12.314 0.000 1549.418 2136.864
USD 3512.5040 187.111 18.772 0.000 3145.318 3879.690
==============================================================================
Omnibus: 276.458 Durbin-Watson: 0.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.633
Skew: 0.438 Prob(JB): 6.22e-17
Kurtosis: 1.967 Cond. No. 10.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
You can see the X is showing as USD in the summary which is what I want.
However, after adding a new variable
#JPY + USD
X = sm.add_constant(JPY)
X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 17:39:19 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
x1 78.6589 2.466 31.902 0.000 73.820 83.497
x2 -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not showing USD and JPY, but x1 x2. Is there a way to fix it? I tried google but found nothing.
As my question is all care about the showing, thus, if I keep the header, then the problem solved, so I post my solution in case someone may have the same problem.
#JPY + USD
X = JPY.join(USD)
X = sm.add_constant(X)
#X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 22:51:43 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
JPY 78.6589 2.466 31.902 0.000 73.820 83.497
USD -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Here's an easy fix using pandas. You only need to add a list of features inside summary().
# list of features (names)
features = list(df.iloc[:, 0:-1].columns) # exclude last column (label)
# scale features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train MLR model
regressor = sm.OLS(y_train, X_train).fit()
regressor.summary(xname=features)

VIF by coef in OLS Regression Results Python

I am attempting to print the VIF (variance inflation factor) by coef. However, I can't seem to find any documentation from statsmodels showing how? I have a model of n variables I need to process and a multicollinearity value for all the variables doesn't help remove the values with the highest collinearity.
this looks like an answer
https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
but how would I run it against this workbook.
http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv
Below is the code an the summary output which is also where I am now.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# read data into a DataFrame
data = pd.read_csv('somepath', index_col=0)
print(data.head())
#multiregression
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Wed, 15 Feb 2017 Prob (F-statistic): 1.58e-96
Time: 13:28:29 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
Radio 0.1885 0.009 21.893 0.000 0.172 0.206
Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================
To get a list of VIFs:
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = lm.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif
To get their mean:
np.array(vif).mean()

How do you pull values out of statsmodels.WLS.fit.summary?

This is the code I have so far. I am performing a weighted least squares operation, and am printing the results out. I want to use the results from the summary, but the summary is apparently not iterable. Is there a way to pull the values from the summary?
self.b = np.linalg.lstsq(self.G,self.d)
w = np.asarray(self.dw)
mod_wls = sm.WLS(self.d,self.G,weights=1./np.asarray(w))
res_wls = mod_wls.fit()
report = res_wls.summary()
print report
Here is the summary as it prints out.
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.955
Model: WLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 92.82
Date: Mon, 24 Oct 2016 Prob (F-statistic): 4.94e-14
Time: 11:38:16 Log-Likelihood: 138.19
No. Observations: 28 AIC: -264.4
Df Residuals: 22 BIC: -256.4
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0066 0.001 -12.389 0.000 -0.008 -0.006
x2 0.0072 0.000 15.805 0.000 0.006 0.008
x3 1.853e-08 2.45e-08 0.756 0.457 -3.23e-08 6.93e-08
x4 -4.402e-09 6.58e-09 -0.669 0.511 -1.81e-08 9.25e-09
x5 -3.595e-08 1.42e-08 -2.528 0.019 -6.55e-08 -6.45e-09
x6 4.402e-09 6.58e-09 0.669 0.511 -9.25e-09 1.81e-08
x7 -6.759e-05 4.17e-05 -1.620 0.120 -0.000 1.9e-05
==============================================================================
Omnibus: 4.421 Durbin-Watson: 1.564
Prob(Omnibus): 0.110 Jarque-Bera (JB): 2.846
Skew: 0.729 Prob(JB): 0.241
Kurtosis: 3.560 Cond. No. 2.22e+16
==============================================================================
edit: To clarify, I want to extract the 'std err' values from each of the x1,x2...x7 rows. I can't seem to find the attribute that represents them or the rows they are in. Anyone know how to do this?
After your operations, res_wls is of type statsmodels.regression.linear_model.RegressionResults, which contains individual attributes for each of the values that you might be interested in. See the documentation for the names of those. For instance, res_wls.rsquared should give you your $R^2$.

Categories

Resources