Printing out effect sizes from an anova_lm model - python

I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!

Related

Singular Matrix error for the statistical analysis of logistic regression

I am working on the breast cancer dataset which was downloaded from here. I encoded all categorical variables using label encoder. Then I went through Logistic Regression Part III StatsModel and tried to see my model performance with logistic regression. Although I labelencoded the dataset there were still few columns like Age, Tumor Size, Regional Node Examined, Regional Node Positive, Survival month that were numerical columns which do not need any encoding. So, I left them as it is. Therefore, I need to normalize the dataset otherwise those numerical big values create bias on the model.
So, I tried to modify the video tutorial, which I describe above, in my way,
df = pd.read_csv("D:\Breast_Cancer_labelencoder.csv")
y = df['Status']
X = df.drop(['Status'], axis=1)
X = sm.add_constant(X)
X_train1, X_test1, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state = 1)
#Normalize using StandardScalar
scaler = StandardScaler()
X_train_scale3 = scaler.fit_transform(X_train1)
X_test_scale3 = scaler.transform(X_test1)
Logit_fun = sm.Logit(y_train, X_train_scale3)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the error Singular Matrix
If I delete the normalization portion
Logit_fun = sm.Logit(y_train, X_train1)
result_fun = Logit_fun.fit()
print(result_fun.summary())
I got the result
Optimization terminated successfully.
Current function value: 0.281394
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Status No. Observations: 2816
Model: Logit Df Residuals: 2800
Method: MLE Df Model: 15
Date: Sun, 30 Oct 2022 Pseudo R-squ.: 0.3425
Time: 23:48:34 Log-Likelihood: -792.40
converged: True LL-Null: -1205.2
Covariance Type: nonrobust LLR p-value: 3.005e-166
==========================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------
const 0.0322 0.713 0.045 0.964 -1.366 1.430
Age 0.0273 0.008 3.585 0.000 0.012 0.042
Race -0.1095 0.110 -0.997 0.319 -0.325 0.106
Marital Status 0.0550 0.060 0.918 0.359 -0.062 0.172
T Stage 0.4264 0.170 2.505 0.012 0.093 0.760
N Stage 0.7746 0.265 2.921 0.003 0.255 1.294
6th Stage -0.1708 0.169 -1.009 0.313 -0.503 0.161
differentiate -0.0092 0.075 -0.124 0.902 -0.156 0.137
Grade 0.3412 0.109 3.125 0.002 0.127 0.555
A Stage 0.3260 0.375 0.868 0.385 -0.410 1.062
Tumor Size 0.0007 0.005 0.146 0.884 -0.009 0.010
Estrogen Status -0.6247 0.264 -2.368 0.018 -1.142 -0.108
Progesterone Status -0.5077 0.183 -2.771 0.006 -0.867 -0.149
Regional Node Examined -0.0260 0.009 -2.811 0.005 -0.044 -0.008
Reginol Node Positive 0.0565 0.021 2.693 0.007 0.015 0.098
Survival Months -0.0595 0.003 -18.675 0.000 -0.066 -0.053
==========================================================================================
My 1st question: Why I got Singular matrix error, where normalization is an essential step?
2nd question: X = sm.add_constant(X) - What purpose does this line serve?
3rd question: If I just delete the column whose coef values are negative will it help to improve the result?
I already watched P-Value Method For Hypothesis Testing and Logistic Regression Details Pt 3: R-squared and p-value videos still unable to analysis the result.

Convert day numbers into dates in python

How do you convert day numbers (1,2,3...728,729,730) to dates in python? I can assign an arbitrary year to start the date count as the year doesn't matter to me.
I am working on learning time series analysis, ARIMA, SARIMA, etc using python. I have a CSV dataset with two columns, 'Day' and 'Revenue'. The Day column contains numbers 1-731, Revenue contains numbers 0-18.154... I have had success building the model, running statistical tests, building visualizations, etc. But when it comes to forecasting using prophet I am hitting a wall.
Here are what I feel are the relevant parts of the code related to the question:
# Loading the CSV with pandas. This code converts the "Day" column into the index.
df = read_csv("telco_time_series.csv", index_col=0, parse_dates=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 731 entries, 1 to 731
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Revenue 731 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
df.head()
Revenue
Day
1 0.000000
2 0.000793
3 0.825542
4 0.320332
5 1.082554
# Instantiate the model
model = ARIMA(df, order=(4,1,0))
# Fit the model
results = model.fit()
# Print summary
print(results.summary())
# line plot of residuals
residuals = (results.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
SARIMAX Results
==============================================================================
Dep. Variable: Revenue No. Observations: 731
Model: ARIMA(4, 1, 0) Log Likelihood -489.105
Date: Tue, 03 Aug 2021 AIC 988.210
Time: 07:29:55 BIC 1011.175
Sample: 0 HQIC 997.070
- 731
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.4642 0.037 -12.460 0.000 -0.537 -0.391
ar.L2 0.0295 0.040 0.746 0.456 -0.048 0.107
ar.L3 0.0618 0.041 1.509 0.131 -0.018 0.142
ar.L4 0.0366 0.039 0.946 0.344 -0.039 0.112
sigma2 0.2235 0.013 17.629 0.000 0.199 0.248
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 2.52
Prob(Q): 0.90 Prob(JB): 0.28
Heteroskedasticity (H): 1.01 Skew: -0.05
Prob(H) (two-sided): 0.91 Kurtosis: 2.73
===================================================================================
df.columns=['ds','y']
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
m = Prophet()
m.fit(df)
ValueError: Dataframe must have columns "ds" and "y" with the dates and values
respectively.
I've had success with the forecast using prophet if I fill the values in the CSV with dates, but I would like to convert the Day numbers within the code using pandas.
Any ideas?
I can assign an arbitrary year to start the date count as the year doesn't matter to me(...)Any ideas?
You might harness datetime.timedelta for this task. Select any date you wish as day 0 and then add datetime.timedelta(days=x) where x is your day number, for example:
import datetime
day0 = datetime.date(2000,1,1)
day120 = day0 + datetime.timedelta(days=120)
print(day120)
output
2000-04-30
encase in function and .apply if you have pandas.DataFrame like so
import datetime
import pandas as pd
def convert_to_date(x):
return datetime.date(2000,1,1)+datetime.timedelta(days=x)
df = pd.DataFrame({'day_n':[1,2,3,4,5]})
df['day_date'] = df['day_n'].apply(convert_to_date)
print(df)
output
day_n day_date
0 1 2000-01-02
1 2 2000-01-03
2 3 2000-01-04
3 4 2000-01-05
4 5 2000-01-06

python Statsmodels SARIMAX KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

My first stack overflow post, I am studying part time for a data science qualification and Im stuck with Statsmodels SARIMAX predicting
my time series data looks as follows
ts_log.head()
Calendar Week
2016-02-22 8.168486
2016-02-29 8.252707
2016-03-07 8.324821
2016-03-14 8.371474
2016-03-21 8.766238
Name: Sales Quantity, dtype: float64
ts_log.tail()
Calendar Week
2020-07-20 8.326759
2020-07-27 8.273847
2020-08-03 8.286521
2020-08-10 8.222822
2020-08-17 8.011687
Name: Sales Quantity, dtype: float64
I run the following
train = ts_log[:'2019-07-01'].dropna()
test = ts_log['2020-08-24':].dropna()
model = SARIMAX(train, order=(2,1,2), seasonal_order=(0,1,0,52)
,enforce_stationarity=False, enforce_invertibility=False)
results = model.fit()
summary shows
results.summary()
Dep. Variable: Sales Quantity No. Observations: 175
Model: SARIMAX(2, 1, 2)x(0, 1, 0, 52) Log Likelihood 16.441
Date: Mon, 21 Sep 2020 AIC -22.883
Time: 22:32:28 BIC -8.987
Sample: 0 HQIC -17.240
- 175
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 1.3171 0.288 4.578 0.000 0.753 1.881
ar.L2 -0.5158 0.252 -2.045 0.041 -1.010 -0.022
ma.L1 -1.5829 0.519 -3.048 0.002 -2.601 -0.565
ma.L2 0.5093 0.502 1.016 0.310 -0.474 1.492
sigma2 0.0345 0.011 3.195 0.001 0.013 0.056
Ljung-Box (Q): 30.08 Jarque-Bera (JB): 2.55
Prob(Q): 0.87 Prob(JB): 0.28
Heteroskedasticity (H): 0.54 Skew: -0.02
Prob(H) (two-sided): 0.05 Kurtosis: 3.72
However, when I try to predict I get a key error suggesting my start date is incorrect but I cant see what is wrong with it
pred = results.predict(start='2019-06-10',end='2020-08-17')[1:]
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I can see both of these dates are valid:
ts_log['2019-06-10']
8.95686647085414
ts_log['2020-08-17']
8.011686729127847
If, instead I run with numbers, it works fine
pred = results.predict(start=175,end=200)[1:]
Id like to use date so I can use it in my time series graph with other dates
EmmaT,
you seem to have same date for start and end.
start='2019-06-10',end='2019-06-10'
Please, double-check if this is what you want. Also check that '2019-06-10' is present in the dataset.

Is use of ols (python statsmodel) correct for longitudinal data and multiple dependent variables?

I am still a noob when it comes to statistics.
I am using Python Package Statsmodel, with the patsy functionality.
My pandas dataframe looks as such:
index sed label c_g lvl1 lvl2
0 5.0 SP_A c b c
1 10.0 SP_B g b c
2 0.0 SP_C c b c
3 -10.0 SP_H c b c
4 0.0 SP_J g b c
5 -20.0 SP_K g b c
6 30.0 SP_W g a a
7 40.0 SP_X g a a
8 -10.0 SP_Y c a a
9 45.0 SP_BB g a a
10 45.0 SP_CC g a a
11 10.0 SP_A c b c
12 10.0 SP_B g b c
13 10.0 SP_C c b c
14 6.0 SP_D g b c
15 10.0 SP_E c b c
16 29.0 SP_F c b c
17 3.0 SP_G g b c
18 23.0 SP_H c b c
19 34.0 SP_J g b c
Dependent variable: Sedimentation (longitudinal data)
Independent variables: Label (categorical), control_grid (categorical), lvl1(categorical) , lvl2 (categorical).
I am interested in two things.
Which Independent variables have significant effect on Dependent variable?
Which Independent variables have significant interaction?
After having searched and read multiple documents, I do this as such:
import statsmodels.formula.api as smf
import pandas as pd
df = pd.read_csv('some.csv')
model = smf.ols(formula = 'sedimentation ~ lvl1*lvl2',data=df)
results = model.fit()
results.summary()
With results showing:
OLS Regression Results
==============================================================================
Dep. Variable: sedimentation R-squared: 0.129
Model: OLS Adj. R-squared: 0.124
Method: Least Squares F-statistic: 24.91
Date: Tue, 17 Jul 2018 Prob (F-statistic): 4.80e-15
Time: 11:15:28 Log-Likelihood: -2353.6
No. Observations: 510 AIC: 4715.
Df Residuals: 506 BIC: 4732.
Df Model: 3
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 6.9871 1.611 4.338 0.000 3.823 10.151
lvl1[T.b] -3.7990 1.173 -3.239 0.001 -6.103 -1.495
lvl1[T.d] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
lvl2[T.b] -8.9427 1.155 -7.744 0.000 -11.212 -6.674
lvl2[T.c] 5.1436 0.899 5.722 0.000 3.377 6.910
lvl2[T.f] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
lvl1[T.b]:lvl2[T.b] -8.9427 1.155 -7.744 0.000 -11.212 -6.674
lvl1[T.d]:lvl2[T.b] 0 0 nan nan 0 0
lvl1[T.b]:lvl2[T.c] 5.1436 0.899 5.722 0.000 3.377 6.910
lvl1[T.d]:lvl2[T.c] 0 0 nan nan 0 0
lvl1[T.b]:lvl2[T.f] 0 0 nan nan 0 0
lvl1[T.d]:lvl2[T.f] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
==============================================================================
Omnibus: 13.069 Durbin-Watson: 1.118
Prob(Omnibus): 0.001 Jarque-Bera (JB): 18.495
Skew: -0.224 Prob(JB): 9.63e-05
Kurtosis: 3.818 Cond. No. inf
==============================================================================
Am I using the correct model in Python to get my desired results?
I think I am, but I would like to verify. The way I read the table is that the categorical variables lvl1 and lvl2 have a significant effect on the dependent variable AND show significant interaction (for some of the variables). However, I don't understand why not all of my variables are showing...as you see in my data, lvl1 column also contains "a" but this variable is not shown in the results summary.
I am not an expert and I fear I can't tell you what is the correct test to apply to longitudinal data, but I think that the numbers you got can't really be trusted that much.
First, the easy part of the answer, regarding your "why not all of my variables are showing": for example, in lvl1, "a" is not showing because you have to fix a "base" value of some kind. So you should read every entry as "effect of having 'b' instead of 'a'" and "effect of having 'd' instead of 'a'", etc.. In more mathematical terms, if you have a categorical variable that takes three values (a,b,d here), then when you implicitly one-hot encode them you'll get three dimensions that always have values 0 or 1, and the sum of which is always 1. This means that your final A matrix in the regression y = A.x + b will always be degenerate, and you have to delete one column to have a chance of it not being so (thus giving any interpretability at all to the regression coefficients).
Concerning why I think the numbers you got cannot be trusted: among the various hypothesis of the linear regression is independence of the consecutive observations (rows). In the case of longitudinal data, this is exactly what clearly fails. Pushing the example to the limit, if you observe a bunch of people (e.g. 11 as in your set) every second for 1 day, you'll get a huge data frame of nearly 1M rows, and every single person will have virtually the same data repeated over and over again. In this setting, any spurious correlation between the independent and dependent variable will be seen by your model as hugely significant (to him, you've run 86400 independent tests and they all exactly confirmed the same conclusion!), while of course this is not the case.
Summing up, I can't say for sure that the regression coefficients you get are not the best guess you can hope for, but certainly the t statistic, the p-value and everything else that looks like statistic there doesn't make much sense.

How to interpret the summary table for Python OLS Statsmodel?

I have a continuous dependent variable y and a independent categorical variable x named control_grid. x contains two variables: c and g
using python package statsmodel I am trying to see if independent variable has significant effect on y variable, as such:
model = smf.ols('y ~ c(x)', data=df)
results = model.fit()
table = sm.stats.anova_lm(results, typ=2)
Printing the table gives this as ouput:
OLS Regression Results
==============================================================================
Dep. Variable: sedimentation R-squared: 0.167
Model: OLS Adj. R-squared: 0.165
Method: Least Squares F-statistic: 86.84
Date: Fri, 13 Jul 2018 Prob (F-statistic): 5.99e-19
Time: 16:15:51 Log-Likelihood: -2019.2
No. Observations: 436 AIC: 4042.
Df Residuals: 434 BIC: 4050.
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -6.0243 1.734 -3.474 0.001 -9.433 -2.616
control_grid[T.g] 22.2504 2.388 9.319 0.000 17.558 26.943
==============================================================================
Omnibus: 30.623 Durbin-Watson: 1.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45.853
Skew: -0.510 Prob(JB): 1.10e-10
Kurtosis: 4.218 Cond. No. 2.69
==============================================================================
In the table where the coefficients are shown, I don't understand the depiction of my dependent variable.
it says:
control_grid[T.g]
What is the "T"?
And is it only looking at one of the two variables? Only at the effect of "g" and not at "c"?
If you go here you see that in the summary the catogorical data Region is also shown for all the four variables "N","S","E" and "W".
P.S. my data looks as such:
index sedimentation control_grid
0 5.0 c
1 10.0 g
2 0.0 c
3 -10.0 c
4 0.0 g
5 -20.0 g
6 30.0 g
7 40.0 g
8 -10.0 c
9 45.0 g
10 45.0 g
11 10.0 c
12 10.0 g
13 10.0 c
14 6.0 g
15 10.0 c
16 29.0 c
17 3.0 g
18 23.0 c
19 34.0 g
I am not an expert, but I'll try to explain it. First, you should know ANOVA is a Regression analysis, so you are building a model Y ~ X, but in Anova X is a categorical variable. In your case Y = sedimentation, and X = control_grid (this is categorical), so the model is "sedimentation ~ control_grid".
Ols perform a regression analysis, so it calculates the parameters for a linear model: Y = Bo + B1X, but, given your X is categorical, your X is dummy coded which means X only can be 0 or 1, what is coherent with categorical data. Be aware in Anova, the number of parameters estimated is equal to the number of categories - 1, you in your data you have only 2 categories (g and c), therefore only one parameter is showed in your ols report. "T.g" means this parameter corresponds to the "g" category. Then your model is Y = Bo + T.g*X
Now, the parameter for T.c is considered as Bo, so actually, your model is:
Y = T.cX + T.gX where X is O or 1 depending if it is "c" or "g".
So, you are asking:
1) What is the "T"?
T (T.g) is only indicating you the parameters estimated and showed correspond to the category "g".
2) And is it only looking at one of the two variables?
No, the analysis estimated the parameters for the two categories (c and g), but the intercept Bo represents the coefficient for the other level of the category, in your data "c".
3) Only at the effect of "g" and not at "c"?
No, in fact, the analyses look at the effect of both "g" and "c". If you look at the values of the coefficient T.g and Intercept (T.c) you can realize if they are significative or not (p values), and you can say if they have an effect on "sedimentation".
Cheers,

Categories

Resources