Singular Matrix error for the statistical analysis of logistic regression - python

I am working on the breast cancer dataset which was downloaded from here. I encoded all categorical variables using label encoder. Then I went through Logistic Regression Part III StatsModel and tried to see my model performance with logistic regression. Although I labelencoded the dataset there were still few columns like Age, Tumor Size, Regional Node Examined, Regional Node Positive, Survival month that were numerical columns which do not need any encoding. So, I left them as it is. Therefore, I need to normalize the dataset otherwise those numerical big values create bias on the model.
So, I tried to modify the video tutorial, which I describe above, in my way,
df = pd.read_csv("D:\Breast_Cancer_labelencoder.csv")
y = df['Status']
X = df.drop(['Status'], axis=1)
X = sm.add_constant(X)
X_train1, X_test1, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state = 1)
#Normalize using StandardScalar
scaler = StandardScaler()
X_train_scale3 = scaler.fit_transform(X_train1)
X_test_scale3 = scaler.transform(X_test1)
Logit_fun = sm.Logit(y_train, X_train_scale3)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the error Singular Matrix
If I delete the normalization portion
Logit_fun = sm.Logit(y_train, X_train1)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the result
Optimization terminated successfully.
Current function value: 0.281394
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Status No. Observations: 2816
Model: Logit Df Residuals: 2800
Method: MLE Df Model: 15
Date: Sun, 30 Oct 2022 Pseudo R-squ.: 0.3425
Time: 23:48:34 Log-Likelihood: -792.40
converged: True LL-Null: -1205.2
Covariance Type: nonrobust LLR p-value: 3.005e-166
==========================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------
const 0.0322 0.713 0.045 0.964 -1.366 1.430
Age 0.0273 0.008 3.585 0.000 0.012 0.042
Race -0.1095 0.110 -0.997 0.319 -0.325 0.106
Marital Status 0.0550 0.060 0.918 0.359 -0.062 0.172
T Stage 0.4264 0.170 2.505 0.012 0.093 0.760
N Stage 0.7746 0.265 2.921 0.003 0.255 1.294
6th Stage -0.1708 0.169 -1.009 0.313 -0.503 0.161
differentiate -0.0092 0.075 -0.124 0.902 -0.156 0.137
Grade 0.3412 0.109 3.125 0.002 0.127 0.555
A Stage 0.3260 0.375 0.868 0.385 -0.410 1.062
Tumor Size 0.0007 0.005 0.146 0.884 -0.009 0.010
Estrogen Status -0.6247 0.264 -2.368 0.018 -1.142 -0.108
Progesterone Status -0.5077 0.183 -2.771 0.006 -0.867 -0.149
Regional Node Examined -0.0260 0.009 -2.811 0.005 -0.044 -0.008
Reginol Node Positive 0.0565 0.021 2.693 0.007 0.015 0.098
Survival Months -0.0595 0.003 -18.675 0.000 -0.066 -0.053
==========================================================================================
My 1st question: Why I got Singular matrix error, where normalization is an essential step?
2nd question: X = sm.add_constant(X) - What purpose does this line serve?
3rd question: If I just delete the column whose coef values are negative will it help to improve the result?
I already watched P-Value Method For Hypothesis Testing and Logistic Regression Details Pt 3: R-squared and p-value videos still unable to analysis the result.

Related

python Statsmodels SARIMAX KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

My first stack overflow post, I am studying part time for a data science qualification and Im stuck with Statsmodels SARIMAX predicting
my time series data looks as follows
ts_log.head()
Calendar Week
2016-02-22 8.168486
2016-02-29 8.252707
2016-03-07 8.324821
2016-03-14 8.371474
2016-03-21 8.766238
Name: Sales Quantity, dtype: float64
ts_log.tail()
Calendar Week
2020-07-20 8.326759
2020-07-27 8.273847
2020-08-03 8.286521
2020-08-10 8.222822
2020-08-17 8.011687
Name: Sales Quantity, dtype: float64
I run the following
train = ts_log[:'2019-07-01'].dropna()
test = ts_log['2020-08-24':].dropna()
model = SARIMAX(train, order=(2,1,2), seasonal_order=(0,1,0,52)
,enforce_stationarity=False, enforce_invertibility=False)
results = model.fit()
summary shows
results.summary()
Dep. Variable: Sales Quantity No. Observations: 175
Model: SARIMAX(2, 1, 2)x(0, 1, 0, 52) Log Likelihood 16.441
Date: Mon, 21 Sep 2020 AIC -22.883
Time: 22:32:28 BIC -8.987
Sample: 0 HQIC -17.240
- 175
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 1.3171 0.288 4.578 0.000 0.753 1.881
ar.L2 -0.5158 0.252 -2.045 0.041 -1.010 -0.022
ma.L1 -1.5829 0.519 -3.048 0.002 -2.601 -0.565
ma.L2 0.5093 0.502 1.016 0.310 -0.474 1.492
sigma2 0.0345 0.011 3.195 0.001 0.013 0.056
Ljung-Box (Q): 30.08 Jarque-Bera (JB): 2.55
Prob(Q): 0.87 Prob(JB): 0.28
Heteroskedasticity (H): 0.54 Skew: -0.02
Prob(H) (two-sided): 0.05 Kurtosis: 3.72
However, when I try to predict I get a key error suggesting my start date is incorrect but I cant see what is wrong with it
pred = results.predict(start='2019-06-10',end='2020-08-17')[1:]
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I can see both of these dates are valid:
ts_log['2019-06-10']
8.95686647085414
ts_log['2020-08-17']
8.011686729127847
If, instead I run with numbers, it works fine
pred = results.predict(start=175,end=200)[1:]
Id like to use date so I can use it in my time series graph with other dates
EmmaT,
you seem to have same date for start and end.
start='2019-06-10',end='2019-06-10'
Please, double-check if this is what you want. Also check that '2019-06-10' is present in the dataset.

Is the categorical variable relevant if one of the dummy variable has a t score of 0.95?

If a variable has more than 0.05 t score, it is deemed not relevant and should be excluded from the model. However, what if the categorical variable has 4 dummy variable and only one of them exceeds 0.05? Do i exclude the entire categorical variable?
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.803
Model: OLS Adj. R-squared: 0.801
Method: Least Squares F-statistic: 368.4
Date: Mon, 15 Jul 2019 Prob (F-statistic): 0.00
Time: 12:00:26 Log-Likelihood: -17357.
No. Observations: 1460 AIC: 3.475e+04
Df Residuals: 1443 BIC: 3.484e+04
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -1.366e+05 9432.229 -14.482 0.000 -1.55e+05 -1.18e+05
OverallQual 1.327e+04 1249.192 10.622 0.000 1.08e+04 1.57e+04
ExterQual 1.168e+04 2763.188 4.228 0.000 6262.969 1.71e+04
TotalBsmtSF 13.7198 5.182 2.648 0.008 3.554 23.885
GrLivArea 45.4098 2.521 18.012 0.000 40.465 50.355
1stFlrSF 9.4573 5.543 1.706 0.088 -1.416 20.330
GarageArea 22.4791 9.748 2.306 0.021 3.358 41.600
KitchenQual 1.309e+04 2142.662 6.111 0.000 8891.243 1.73e+04
GarageCars 8875.8202 2961.291 2.997 0.003 3066.923 1.47e+04
BsmtQual 1.097e+04 2094.395 5.235 0.000 6856.671 1.51e+04
GarageFinish_No 2689.1356 5847.186 0.460 0.646 -8780.759 1.42e+04
GarageFinish_RFn -8223.4503 2639.360 -3.116 0.002 -1.34e+04 -3046.057
GarageFinish_Unf -8416.9443 2928.002 -2.875 0.004 -1.42e+04 -2673.349
BsmtExposure_Gd 2.298e+04 3970.691 5.788 0.000 1.52e+04 3.08e+04
BsmtExposure_Mn -262.8498 4160.294 -0.063 0.950 -8423.721 7898.021
BsmtExposure_No -7690.0994 2800.731 -2.746 0.006 -1.32e+04 -2196.159
BsmtExposure_No Basement 2.598e+04 9879.662 2.630 0.009 6598.642 4.54e+04
==============================================================================
Omnibus: 614.604 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 76480.899
Skew: -0.928 Prob(JB): 0.00
Kurtosis: 38.409 Cond. No. 2.85e+04
==============================================================================
when you say "0.05 t score" I assume you mean "0.05 p value". the t-value is just coef / stderr, which goes into the p-value calculation (abs(t_value) > 2 is approximately p-value < 0.05)
when you say "categorical variable has 4 dummy variable", I presume you mean it has 4 "levels" / distinct values and you're referring to BsmtExposure_Mn. I'd leave that in as the other categories/levels are helping the model. if you had several categories that were less predictive you could think about combining them into one "other" category
as a general point, you shouldn't just automatically exclude variables because their p-value is > 0.05 (or whatever your cutoff/"alpha value" is). they can be useful for understanding what's going on within the model, and explaining results to other people

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Python statsmodel.tsa.MarkovAutoregression using current real GNP/GDP data

I have been using statsmodel.tsa.MarkovAutoregressio to replicate Hamilton's markov switching model published in 1989. If using the Hamilton data (real GNP in 1982 dollar) I could have the same result as the code example / the paper showed. However, when I used current available real GNP or GDP data (in 2009 dollar) and took their log difference (quarterly) as input, the model doesn't give satisfactory results.
I plotted the log difference of Hamilton gnp and that's from the current available real GNP. They are quite close with slight differences.
Can anyone enlighten me why it is the case? Does it have anything to do with the seasonality adjustment of current GNP data? If so, is there is any way to counter it?
Result using current available GNP
Result using paper provided GNP
You write:
the model doesn't give satisfactory results.
But what you mean is that the model isn't giving you the results you expect / want. i.e., you want the model to pick out periods the NBER has labeled as "Recessions", but the Markov switching model is simply finding the parameters which maximize the likelihood function for the data.
(The rest of the post shows results that are taken from this Jupyter notebook: http://nbviewer.jupyter.org/gist/ChadFulton/a5d24d32ba3b7b2e381e43a232342f1f)
(I'll also note that I double-checked these results using E-views, and it agrees with Statsmodels' output almost exactly).
The raw dataset is the growth rate (log difference * 100) of real GNP; the Hamilton dataset versus one found on the Federal Reserve Economic Database are shown here, with grey bars indicating NBER-dated recessions:
In this case, the model is an AR(4) on the growth rate of real GNP, with a regime-specific intercept; the model allows two regimes. The idea is that "recessions" should correspond to a low (or negative) average growth rate and expansions should correspond to a higher average growth rate.
Model 1: Hamilton's dataset: Maximum likelihood estimation of parameters
From the model applied to Hamilton's (1989) dataset, we get the following estimated parameters:
Markov Switching Model Results
================================================================================
Dep. Variable: Hamilton No. Observations: 131
Model: MarkovAutoregression Log Likelihood -181.263
Date: Sun, 02 Apr 2017 AIC 380.527
Time: 19:52:31 BIC 406.404
Sample: 04-01-1951 HQIC 391.042
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.265 -1.356 0.175 -0.877 0.160
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.075 15.614 0.000 1.017 1.310
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.103 5.761 0.000 0.390 0.793
ar.L1 0.0135 0.120 0.112 0.911 -0.222 0.249
ar.L2 -0.0575 0.138 -0.418 0.676 -0.327 0.212
ar.L3 -0.2470 0.107 -2.310 0.021 -0.457 -0.037
ar.L4 -0.2129 0.111 -1.926 0.054 -0.430 0.004
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.097 7.819 0.000 0.565 0.944
p[1->0] 0.0959 0.038 2.542 0.011 0.022 0.170
==============================================================================
and the time series of the probability of operating in regime 0 (which here corresponds to a negative growth rate, i.e. a recession) looks like:
Model 2: Updated dataset: Maximum likelihood estimation of parameters
Now, as you saw, we can instead fit the model using the "updated" dataset (which looks pretty much like the original dataset), to get the following parameters and regime probabilities:
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -188.002
Date: Sun, 02 Apr 2017 AIC 394.005
Time: 20:00:58 BIC 419.882
Sample: 04-01-1951 HQIC 404.520
- 10-01-1984
Covariance Type: approx
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.2475 3.470 -0.359 0.719 -8.049 5.554
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 0.9364 0.453 2.066 0.039 0.048 1.825
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.8509 0.561 1.516 0.130 -0.249 1.951
ar.L1 0.3437 0.189 1.821 0.069 -0.026 0.714
ar.L2 0.0919 0.143 0.645 0.519 -0.187 0.371
ar.L3 -0.0846 0.251 -0.337 0.736 -0.577 0.408
ar.L4 -0.1727 0.258 -0.669 0.503 -0.678 0.333
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.0002 1.705 0.000 1.000 -3.341 3.341
p[1->0] 0.0397 0.186 0.213 0.831 -0.326 0.405
==============================================================================
To understand what the model is doing, look at the intercepts in the two regimes. In Hamilton's model, the "low" regime has an intercept of -0.35, whereas with the updated data, the "low" regime has an intercept of -1.25.
What that tells us is that with the updated dataset, the model is doing a "better job" fitting the data (in terms of a higher likelihood) by choosing the "low" regime to be much deeper recessions. In particular, looking back at the GNP data series, it's apparent that it's using the "low" regime to fit the very low growth in the late 1950's and early 1980's.
In contrast, the fitted parameters from Hamilton's model allow the "low" regime to fit "moderately low" growth rates that cover a wider range of recessions.
We can't compare these two models' outcomes using e.g. the log-likelihood values because they're using different datasets. One thing we could try, though is to use the fitted parameters from Hamilton's dataset on the updated GNP data. Doing that, we get the following result:
Model 3: Updated dataset using parameters estimated on Hamilton's dataset
Markov Switching Model Results
================================================================================
Dep. Variable: GNPC96 No. Observations: 131
Model: MarkovAutoregression Log Likelihood -191.807
Date: Sun, 02 Apr 2017 AIC 401.614
Time: 19:52:52 BIC 427.491
Sample: 04-01-1951 HQIC 412.129
- 10-01-1984
Covariance Type: opg
Regime 0 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3588 0.185 -1.939 0.053 -0.722 0.004
Regime 1 parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.1635 0.083 13.967 0.000 1.000 1.327
Non-switching parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 0.5914 0.090 6.604 0.000 0.416 0.767
ar.L1 0.0135 0.100 0.134 0.893 -0.183 0.210
ar.L2 -0.0575 0.088 -0.651 0.515 -0.231 0.116
ar.L3 -0.2470 0.104 -2.384 0.017 -0.450 -0.044
ar.L4 -0.2129 0.084 -2.524 0.012 -0.378 -0.048
Regime transition parameters
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
p[0->0] 0.7547 0.100 7.563 0.000 0.559 0.950
p[1->0] 0.0959 0.051 1.872 0.061 -0.005 0.196
==============================================================================
This looks more like what you expected / wanted, and that's because as I mentioned above, the "low" regime intercept of 0.35 makes the "low" regime a good fit for more time periods in the sample. But notice that the log-likelihood here is -191.8, whereas in Model 2 the log-likelihood was -188.0.
Thus even though this model looks more like what you wanted, it does not fit the data as well.
(Note again that you can't compare these log-likelihoods to the -181.3 from Model 1, because that is using a different dataset).

How do you pull values out of statsmodels.WLS.fit.summary?

This is the code I have so far. I am performing a weighted least squares operation, and am printing the results out. I want to use the results from the summary, but the summary is apparently not iterable. Is there a way to pull the values from the summary?
self.b = np.linalg.lstsq(self.G,self.d)
w = np.asarray(self.dw)
mod_wls = sm.WLS(self.d,self.G,weights=1./np.asarray(w))
res_wls = mod_wls.fit()
report = res_wls.summary()
print report
Here is the summary as it prints out.
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.955
Model: WLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 92.82
Date: Mon, 24 Oct 2016 Prob (F-statistic): 4.94e-14
Time: 11:38:16 Log-Likelihood: 138.19
No. Observations: 28 AIC: -264.4
Df Residuals: 22 BIC: -256.4
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0066 0.001 -12.389 0.000 -0.008 -0.006
x2 0.0072 0.000 15.805 0.000 0.006 0.008
x3 1.853e-08 2.45e-08 0.756 0.457 -3.23e-08 6.93e-08
x4 -4.402e-09 6.58e-09 -0.669 0.511 -1.81e-08 9.25e-09
x5 -3.595e-08 1.42e-08 -2.528 0.019 -6.55e-08 -6.45e-09
x6 4.402e-09 6.58e-09 0.669 0.511 -9.25e-09 1.81e-08
x7 -6.759e-05 4.17e-05 -1.620 0.120 -0.000 1.9e-05
==============================================================================
Omnibus: 4.421 Durbin-Watson: 1.564
Prob(Omnibus): 0.110 Jarque-Bera (JB): 2.846
Skew: 0.729 Prob(JB): 0.241
Kurtosis: 3.560 Cond. No. 2.22e+16
==============================================================================
edit: To clarify, I want to extract the 'std err' values from each of the x1,x2...x7 rows. I can't seem to find the attribute that represents them or the rows they are in. Anyone know how to do this?
After your operations, res_wls is of type statsmodels.regression.linear_model.RegressionResults, which contains individual attributes for each of the values that you might be interested in. See the documentation for the names of those. For instance, res_wls.rsquared should give you your $R^2$.

Categories

Resources