Linear regression to extract only coefficients and constant

Linear regression to extract only coefficients and constant - python

I have written a code for multi-linear regression model. But when I use results.summary() Python spits this whole thing out
if i >1:
xxx = sm.add_constant(xxx)
results = sm.OLS(y_variable_holder, xxx).fit()
print (results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.051
Date: Wed, 14 Jun 2017 Prob (F-statistic): 0.369
Time: 20:01:26 Log-Likelihood: 6062.6
No. Observations: 2262 AIC: -1.212e+04
Df Residuals: 2258 BIC: -1.209e+04
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0002 0.000 -0.476 0.634 -0.001 0.001
x1 -0.0001 0.001 -0.218 0.828 -0.001 0.001
x2 8.445e-06 2.31e-05 0.366 0.714 -3.68e-05 5.37e-05
x3 -0.0026 0.003 -0.941 0.347 -0.008 0.003
==============================================================================
Omnibus: 322.021 Durbin-Watson: 2.255
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4334.191
Skew: -0.097 Prob(JB): 0.00
Kurtosis: 9.779 Cond. No. 127.
==============================================================================
I want Python to only spit out constant and coefficients. For example, desired output:
python output:
[-0.0002]
[-0.0001]
[8.445e-06]
[ -0.0026]
How can I achieve this? I don't need the whole summary just the constant/efficient.

I figured it out. the answer is results_bucket.append(results.params)

Related

Is the categorical variable relevant if one of the dummy variable has a t score of 0.95?

If a variable has more than 0.05 t score, it is deemed not relevant and should be excluded from the model. However, what if the categorical variable has 4 dummy variable and only one of them exceeds 0.05? Do i exclude the entire categorical variable?
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.803
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 368.4
Date: Mon, 15 Jul 2019 Prob (F-statistic): 0.00
Time: 12:00:26 Log-Likelihood: -17357.
No. Observations: 1460 AIC: 3.475e+04
Df Residuals: 1443 BIC: 3.484e+04
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -1.366e+05 9432.229 -14.482 0.000 -1.55e+05 -1.18e+05
OverallQual 1.327e+04 1249.192 10.622 0.000 1.08e+04 1.57e+04
ExterQual 1.168e+04 2763.188 4.228 0.000 6262.969 1.71e+04
TotalBsmtSF 13.7198 5.182 2.648 0.008 3.554 23.885
GrLivArea 45.4098 2.521 18.012 0.000 40.465 50.355
1stFlrSF 9.4573 5.543 1.706 0.088 -1.416 20.330
GarageArea 22.4791 9.748 2.306 0.021 3.358 41.600
KitchenQual 1.309e+04 2142.662 6.111 0.000 8891.243 1.73e+04
GarageCars 8875.8202 2961.291 2.997 0.003 3066.923 1.47e+04
BsmtQual 1.097e+04 2094.395 5.235 0.000 6856.671 1.51e+04
GarageFinish_No 2689.1356 5847.186 0.460 0.646 -8780.759 1.42e+04
GarageFinish_RFn -8223.4503 2639.360 -3.116 0.002 -1.34e+04 -3046.057
GarageFinish_Unf -8416.9443 2928.002 -2.875 0.004 -1.42e+04 -2673.349
BsmtExposure_Gd 2.298e+04 3970.691 5.788 0.000 1.52e+04 3.08e+04
BsmtExposure_Mn -262.8498 4160.294 -0.063 0.950 -8423.721 7898.021
BsmtExposure_No -7690.0994 2800.731 -2.746 0.006 -1.32e+04 -2196.159
BsmtExposure_No Basement 2.598e+04 9879.662 2.630 0.009 6598.642 4.54e+04
==============================================================================
Omnibus: 614.604 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76480.899
Skew: -0.928 Prob(JB): 0.00
Kurtosis: 38.409 Cond. No. 2.85e+04
==============================================================================

when you say "0.05 t score" I assume you mean "0.05 p value". the t-value is just coef / stderr, which goes into the p-value calculation (abs(t_value) > 2 is approximately p-value < 0.05)
when you say "categorical variable has 4 dummy variable", I presume you mean it has 4 "levels" / distinct values and you're referring to BsmtExposure_Mn. I'd leave that in as the other categories/levels are helping the model. if you had several categories that were less predictive you could think about combining them into one "other" category
as a general point, you shouldn't just automatically exclude variables because their p-value is > 0.05 (or whatever your cutoff/"alpha value" is). they can be useful for understanding what's going on within the model, and explaining results to other people

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Preserve variable names in summary from statsmodels

I am using OLS from statsmodel, the link is https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
#USD
X = sm.add_constant(USD)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 352.4
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.35e-67
Time: 17:30:24 Log-Likelihood: -8018.8
No. Observations: 977 AIC: 1.604e+04
Df Residuals: 975 BIC: 1.605e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1843.1414 149.675 12.314 0.000 1549.418 2136.864
USD 3512.5040 187.111 18.772 0.000 3145.318 3879.690
==============================================================================
Omnibus: 276.458 Durbin-Watson: 0.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.633
Skew: 0.438 Prob(JB): 6.22e-17
Kurtosis: 1.967 Cond. No. 10.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
You can see the X is showing as USD in the summary which is what I want.
However, after adding a new variable
#JPY + USD
X = sm.add_constant(JPY)
X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 17:39:19 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
x1 78.6589 2.466 31.902 0.000 73.820 83.497
x2 -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not showing USD and JPY, but x1 x2. Is there a way to fix it? I tried google but found nothing.

As my question is all care about the showing, thus, if I keep the header, then the problem solved, so I post my solution in case someone may have the same problem.
#JPY + USD
X = JPY.join(USD)
X = sm.add_constant(X)
#X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 22:51:43 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
JPY 78.6589 2.466 31.902 0.000 73.820 83.497
USD -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here's an easy fix using pandas. You only need to add a list of features inside summary().
# list of features (names)
features = list(df.iloc[:, 0:-1].columns) # exclude last column (label)
# scale features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train MLR model
regressor = sm.OLS(y_train, X_train).fit()
regressor.summary(xname=features)

VIF by coef in OLS Regression Results Python

I am attempting to print the VIF (variance inflation factor) by coef. However, I can't seem to find any documentation from statsmodels showing how? I have a model of n variables I need to process and a multicollinearity value for all the variables doesn't help remove the values with the highest collinearity.
this looks like an answer
https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
but how would I run it against this workbook.
http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv
Below is the code an the summary output which is also where I am now.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# read data into a DataFrame
data = pd.read_csv('somepath', index_col=0)
print(data.head())
#multiregression
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Wed, 15 Feb 2017 Prob (F-statistic): 1.58e-96
Time: 13:28:29 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
Radio 0.1885 0.009 21.893 0.000 0.172 0.206
Newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================

To get a list of VIFs:
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = lm.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif
To get their mean:
np.array(vif).mean()

How do you pull values out of statsmodels.WLS.fit.summary?

This is the code I have so far. I am performing a weighted least squares operation, and am printing the results out. I want to use the results from the summary, but the summary is apparently not iterable. Is there a way to pull the values from the summary?
self.b = np.linalg.lstsq(self.G,self.d)
w = np.asarray(self.dw)
mod_wls = sm.WLS(self.d,self.G,weights=1./np.asarray(w))
res_wls = mod_wls.fit()
report = res_wls.summary()
print report
Here is the summary as it prints out.
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.955
Model: WLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 92.82
Date: Mon, 24 Oct 2016 Prob (F-statistic): 4.94e-14
Time: 11:38:16 Log-Likelihood: 138.19
No. Observations: 28 AIC: -264.4
Df Residuals: 22 BIC: -256.4
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0066 0.001 -12.389 0.000 -0.008 -0.006
x2 0.0072 0.000 15.805 0.000 0.006 0.008
x3 1.853e-08 2.45e-08 0.756 0.457 -3.23e-08 6.93e-08
x4 -4.402e-09 6.58e-09 -0.669 0.511 -1.81e-08 9.25e-09
x5 -3.595e-08 1.42e-08 -2.528 0.019 -6.55e-08 -6.45e-09
x6 4.402e-09 6.58e-09 0.669 0.511 -9.25e-09 1.81e-08
x7 -6.759e-05 4.17e-05 -1.620 0.120 -0.000 1.9e-05
==============================================================================
Omnibus: 4.421 Durbin-Watson: 1.564
Prob(Omnibus): 0.110 Jarque-Bera (JB): 2.846
Skew: 0.729 Prob(JB): 0.241
Kurtosis: 3.560 Cond. No. 2.22e+16
==============================================================================
edit: To clarify, I want to extract the 'std err' values from each of the x1,x2...x7 rows. I can't seem to find the attribute that represents them or the rows they are in. Anyone know how to do this?

After your operations, res_wls is of type statsmodels.regression.linear_model.RegressionResults, which contains individual attributes for each of the values that you might be interested in. See the documentation for the names of those. For instance, res_wls.rsquared should give you your $R^2$.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Linear regression to extract only coefficients and constant - python

I figured it out. the answer is results_bucket.append(results.params)

Related

Is the categorical variable relevant if one of the dummy variable has a t score of 0.95?

Multiple linear regression get best fit

Preserve variable names in summary from statsmodels

VIF by coef in OLS Regression Results Python

How do you pull values out of statsmodels.WLS.fit.summary?

Categories

Resources