This is the code I have so far. I am performing a weighted least squares operation, and am printing the results out. I want to use the results from the summary, but the summary is apparently not iterable. Is there a way to pull the values from the summary?
self.b = np.linalg.lstsq(self.G,self.d)
w = np.asarray(self.dw)
mod_wls = sm.WLS(self.d,self.G,weights=1./np.asarray(w))
res_wls = mod_wls.fit()
report = res_wls.summary()
print report
Here is the summary as it prints out.
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.955
Model: WLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 92.82
Date: Mon, 24 Oct 2016 Prob (F-statistic): 4.94e-14
Time: 11:38:16 Log-Likelihood: 138.19
No. Observations: 28 AIC: -264.4
Df Residuals: 22 BIC: -256.4
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0066 0.001 -12.389 0.000 -0.008 -0.006
x2 0.0072 0.000 15.805 0.000 0.006 0.008
x3 1.853e-08 2.45e-08 0.756 0.457 -3.23e-08 6.93e-08
x4 -4.402e-09 6.58e-09 -0.669 0.511 -1.81e-08 9.25e-09
x5 -3.595e-08 1.42e-08 -2.528 0.019 -6.55e-08 -6.45e-09
x6 4.402e-09 6.58e-09 0.669 0.511 -9.25e-09 1.81e-08
x7 -6.759e-05 4.17e-05 -1.620 0.120 -0.000 1.9e-05
==============================================================================
Omnibus: 4.421 Durbin-Watson: 1.564
Prob(Omnibus): 0.110 Jarque-Bera (JB): 2.846
Skew: 0.729 Prob(JB): 0.241
Kurtosis: 3.560 Cond. No. 2.22e+16
==============================================================================
edit: To clarify, I want to extract the 'std err' values from each of the x1,x2...x7 rows. I can't seem to find the attribute that represents them or the rows they are in. Anyone know how to do this?
After your operations, res_wls is of type statsmodels.regression.linear_model.RegressionResults, which contains individual attributes for each of the values that you might be interested in. See the documentation for the names of those. For instance, res_wls.rsquared should give you your $R^2$.
Related
Wondering what's the most efficient/accurate way to estimate these parameters (a0, a1, a2, a3) with Python in the model:
col_4 = a0 + a1*col_1 + a2*col_2 + a3*col_3
The sample dataset would be:
inputs = {
'col_1': np.random.normal(15,2,100),
'col_2': np.random.normal(15,1,100),
'col_3': np.random.normal(0.9,1,100),
'col_4': np.random.normal(-0.05,0.5,100),
}
_idx = pd.date_range('2021-01-01','2021-04-10',freq='D').to_series()
data = pd.DataFrame(inputs, index = _idx)
statsmodels provides a pretty simple way to estimate linear models like that:
import statsmodels.formula.api as smf
results = smf.ols('col_4 ~ col_1 + col_2 + col_3', data=data).fit()
print(results.summary())
The coef column shows your aX parameters:
OLS Regression Results
==============================================================================
Dep. Variable: col_4 R-squared: 0.049
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 1.637
Date: Wed, 29 Dec 2021 Prob (F-statistic): 0.186
Time: 17:25:00 Log-Likelihood: -68.490
No. Observations: 100 AIC: 145.0
Df Residuals: 96 BIC: 155.4
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.2191 0.846 0.259 0.796 -1.461 1.899
col_1 -0.0198 0.023 -0.854 0.395 -0.066 0.026
col_2 -0.0048 0.051 -0.093 0.926 -0.107 0.097
col_3 0.1155 0.056 2.066 0.042 0.005 0.226
==============================================================================
Omnibus: 2.292 Durbin-Watson: 2.291
Prob(Omnibus): 0.318 Jarque-Bera (JB): 2.296
Skew: -0.351 Prob(JB): 0.317
Kurtosis: 2.757 Cond. No. 370.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
That includes the intercept (a0) by default. If you want to remove it, just add a -1 to the formula
If a variable has more than 0.05 t score, it is deemed not relevant and should be excluded from the model. However, what if the categorical variable has 4 dummy variable and only one of them exceeds 0.05? Do i exclude the entire categorical variable?
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.803
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 368.4
Date: Mon, 15 Jul 2019 Prob (F-statistic): 0.00
Time: 12:00:26 Log-Likelihood: -17357.
No. Observations: 1460 AIC: 3.475e+04
Df Residuals: 1443 BIC: 3.484e+04
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -1.366e+05 9432.229 -14.482 0.000 -1.55e+05 -1.18e+05
OverallQual 1.327e+04 1249.192 10.622 0.000 1.08e+04 1.57e+04
ExterQual 1.168e+04 2763.188 4.228 0.000 6262.969 1.71e+04
TotalBsmtSF 13.7198 5.182 2.648 0.008 3.554 23.885
GrLivArea 45.4098 2.521 18.012 0.000 40.465 50.355
1stFlrSF 9.4573 5.543 1.706 0.088 -1.416 20.330
GarageArea 22.4791 9.748 2.306 0.021 3.358 41.600
KitchenQual 1.309e+04 2142.662 6.111 0.000 8891.243 1.73e+04
GarageCars 8875.8202 2961.291 2.997 0.003 3066.923 1.47e+04
BsmtQual 1.097e+04 2094.395 5.235 0.000 6856.671 1.51e+04
GarageFinish_No 2689.1356 5847.186 0.460 0.646 -8780.759 1.42e+04
GarageFinish_RFn -8223.4503 2639.360 -3.116 0.002 -1.34e+04 -3046.057
GarageFinish_Unf -8416.9443 2928.002 -2.875 0.004 -1.42e+04 -2673.349
BsmtExposure_Gd 2.298e+04 3970.691 5.788 0.000 1.52e+04 3.08e+04
BsmtExposure_Mn -262.8498 4160.294 -0.063 0.950 -8423.721 7898.021
BsmtExposure_No -7690.0994 2800.731 -2.746 0.006 -1.32e+04 -2196.159
BsmtExposure_No Basement 2.598e+04 9879.662 2.630 0.009 6598.642 4.54e+04
==============================================================================
Omnibus: 614.604 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76480.899
Skew: -0.928 Prob(JB): 0.00
Kurtosis: 38.409 Cond. No. 2.85e+04
==============================================================================
when you say "0.05 t score" I assume you mean "0.05 p value". the t-value is just coef / stderr, which goes into the p-value calculation (abs(t_value) > 2 is approximately p-value < 0.05)
when you say "categorical variable has 4 dummy variable", I presume you mean it has 4 "levels" / distinct values and you're referring to BsmtExposure_Mn. I'd leave that in as the other categories/levels are helping the model. if you had several categories that were less predictive you could think about combining them into one "other" category
as a general point, you shouldn't just automatically exclude variables because their p-value is > 0.05 (or whatever your cutoff/"alpha value" is). they can be useful for understanding what's going on within the model, and explaining results to other people
I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?
I am using OLS from statsmodel, the link is https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
#USD
X = sm.add_constant(USD)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 352.4
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.35e-67
Time: 17:30:24 Log-Likelihood: -8018.8
No. Observations: 977 AIC: 1.604e+04
Df Residuals: 975 BIC: 1.605e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1843.1414 149.675 12.314 0.000 1549.418 2136.864
USD 3512.5040 187.111 18.772 0.000 3145.318 3879.690
==============================================================================
Omnibus: 276.458 Durbin-Watson: 0.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.633
Skew: 0.438 Prob(JB): 6.22e-17
Kurtosis: 1.967 Cond. No. 10.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
You can see the X is showing as USD in the summary which is what I want.
However, after adding a new variable
#JPY + USD
X = sm.add_constant(JPY)
X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 17:39:19 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
x1 78.6589 2.466 31.902 0.000 73.820 83.497
x2 -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not showing USD and JPY, but x1 x2. Is there a way to fix it? I tried google but found nothing.
As my question is all care about the showing, thus, if I keep the header, then the problem solved, so I post my solution in case someone may have the same problem.
#JPY + USD
X = JPY.join(USD)
X = sm.add_constant(X)
#X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 22:51:43 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
JPY 78.6589 2.466 31.902 0.000 73.820 83.497
USD -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Here's an easy fix using pandas. You only need to add a list of features inside summary().
# list of features (names)
features = list(df.iloc[:, 0:-1].columns) # exclude last column (label)
# scale features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train MLR model
regressor = sm.OLS(y_train, X_train).fit()
regressor.summary(xname=features)
I have written a code for multi-linear regression model. But when I use results.summary() Python spits this whole thing out
if i >1:
xxx = sm.add_constant(xxx)
results = sm.OLS(y_variable_holder, xxx).fit()
print (results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.051
Date: Wed, 14 Jun 2017 Prob (F-statistic): 0.369
Time: 20:01:26 Log-Likelihood: 6062.6
No. Observations: 2262 AIC: -1.212e+04
Df Residuals: 2258 BIC: -1.209e+04
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0002 0.000 -0.476 0.634 -0.001 0.001
x1 -0.0001 0.001 -0.218 0.828 -0.001 0.001
x2 8.445e-06 2.31e-05 0.366 0.714 -3.68e-05 5.37e-05
x3 -0.0026 0.003 -0.941 0.347 -0.008 0.003
==============================================================================
Omnibus: 322.021 Durbin-Watson: 2.255
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4334.191
Skew: -0.097 Prob(JB): 0.00
Kurtosis: 9.779 Cond. No. 127.
==============================================================================
I want Python to only spit out constant and coefficients. For example, desired output:
python output:
[-0.0002]
[-0.0001]
[8.445e-06]
[ -0.0026]
How can I achieve this? I don't need the whole summary just the constant/efficient.
I figured it out. the answer is results_bucket.append(results.params)