Robustness issue of statsmodel Linear regression (ols) - Python

Robustness issue of statsmodel Linear regression (ols) - Python - python

I was testing some basic category regression using Stats model:
I build up a deterministic model
Y = X + Z
where X can takes 3 values (a, b or c) and Z only 2 (d or e).
At that stage the model is purely deterministic, I setup the weights for each variable as followed
a's weight=1
b's weight=2
c's weight=3
d's weight=1
e's weight=2
Therefore with 1(X=a) being 1 if X=a, 0 otherwise, the model is simply:
Y = 1(X=a) + 2*1(X=b) + 3*1(X=c) + 1(Z=d) + 2*1(Z=e)
Using the following code, to generate the different variables and run the regression
from statsmodels.formula.api import ols
nbData = 1000
rand1 = np.random.uniform(size=nbData)
rand2 = np.random.uniform(size=nbData)
a = 1 * (rand1 <= (1.0/3.0))
b = 1 * (((1.0/3.0)< rand1) & (rand1< (4/5.0)))
c = 1-b-a
d = 1 * (rand2 <= (3.0/5.0))
e = 1-d
weigths = [1,2,3,1,2]
y = a+2*b+3*c+4*d+5*e
df = pd.DataFrame({'y':y, 'a':a, 'b':b, 'c':c, 'd':d, 'e':e})
mod = ols(formula='y ~ a + b + c + d + e - 1', data=df)
res = mod.fit()
print(res.summary())
I end up with the rights results (one has to look at the difference between coef rather than the coef themselfs)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.006e+30
Date: Wed, 16 Sep 2015 Prob (F-statistic): 0.00
Time: 03:05:40 Log-Likelihood: 3156.8
No. Observations: 100 AIC: -6306.
Df Residuals: 96 BIC: -6295.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 1.6000 7.47e-16 2.14e+15 0.000 1.600 1.600
b 2.6000 6.11e-16 4.25e+15 0.000 2.600 2.600
c 3.6000 9.61e-16 3.74e+15 0.000 3.600 3.600
d 3.4000 5.21e-16 6.52e+15 0.000 3.400 3.400
e 4.4000 6.85e-16 6.42e+15 0.000 4.400 4.400
==============================================================================
Omnibus: 11.299 Durbin-Watson: 0.833
Prob(Omnibus): 0.004 Jarque-Bera (JB): 5.720
Skew: -0.381 Prob(JB): 0.0573
Kurtosis: 2.110 Cond. No. 2.46e+15
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.67e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
But when I increase the number of data point to (say) 600, the regression is producing really bad results. I have tried similar regression in Excel and in R and they are producing consistent results whatever the number of data points. Does anyone know if there is some restriction on statsmodel ols explaining such behaviour or am I missing something?
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.167
Model: OLS Adj. R-squared: 0.161
Method: Least Squares F-statistic: 29.83
Date: Wed, 16 Sep 2015 Prob (F-statistic): 1.23e-22
Time: 03:08:04 Log-Likelihood: -701.02
No. Observations: 600 AIC: 1412.
Df Residuals: 595 BIC: 1434.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 5.8070 1.15e+13 5.05e-13 1.000 -2.26e+13 2.26e+13
b 6.4951 1.15e+13 5.65e-13 1.000 -2.26e+13 2.26e+13
c 6.9033 1.15e+13 6.01e-13 1.000 -2.26e+13 2.26e+13
d -1.1927 1.15e+13 -1.04e-13 1.000 -2.26e+13 2.26e+13
e -0.1685 1.15e+13 -1.47e-14 1.000 -2.26e+13 2.26e+13
==============================================================================
Omnibus: 67.153 Durbin-Watson: 0.328
Prob(Omnibus): 0.000 Jarque-Bera (JB): 70.964
Skew: 0.791 Prob(JB): 3.89e-16
Kurtosis: 2.419 Cond. No. 7.70e+14
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.25e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

It appears that as mentionned by Mr. F, the main problem is that the statsmodel OLS does not seem to handle the collinearity pb as well as Excel/R in that case, but if instead of defining one variable for each a, b, c, d and e, one define a variable X and one Z which can be equal to a, b or c and d or e resp, then the regression works fine. Ie updating the code with :
df['X'] = ['c']*len(df)
df.X[df.b!=0] = 'b'
df.X[df.a!=0] = 'a'
df['Z'] = ['e']*len(df)
df.Z[df.d!=0] = 'd'
mod = ols(formula='y ~ X + Z - 1', data=df)
leads to the expected results
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.684e+27
Date: Thu, 17 Sep 2015 Prob (F-statistic): 0.00
Time: 06:22:43 Log-Likelihood: 2.5096e+06
No. Observations: 100000 AIC: -5.019e+06
Df Residuals: 99996 BIC: -5.019e+06
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
X[a] 5.0000 1.85e-14 2.7e+14 0.000 5.000 5.000
X[b] 6.0000 1.62e-14 3.71e+14 0.000 6.000 6.000
X[c] 7.0000 2.31e-14 3.04e+14 0.000 7.000 7.000
Z[T.e] 1.0000 1.97e-14 5.08e+13 0.000 1.000 1.000
==============================================================================
Omnibus: 145.367 Durbin-Watson: 1.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9729.487
Skew: -0.094 Prob(JB): 0.00
Kurtosis: 1.483 Cond. No. 2.29
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Related

Estimating multiple parameters of a model in python

Wondering what's the most efficient/accurate way to estimate these parameters (a0, a1, a2, a3) with Python in the model:
col_4 = a0 + a1*col_1 + a2*col_2 + a3*col_3
The sample dataset would be:
inputs = {
'col_1': np.random.normal(15,2,100),
'col_2': np.random.normal(15,1,100),
'col_3': np.random.normal(0.9,1,100),
'col_4': np.random.normal(-0.05,0.5,100),
}
_idx = pd.date_range('2021-01-01','2021-04-10',freq='D').to_series()
data = pd.DataFrame(inputs, index = _idx)

statsmodels provides a pretty simple way to estimate linear models like that:
import statsmodels.formula.api as smf
results = smf.ols('col_4 ~ col_1 + col_2 + col_3', data=data).fit()
print(results.summary())
The coef column shows your aX parameters:
OLS Regression Results
==============================================================================
Dep. Variable: col_4 R-squared: 0.049
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 1.637
Date: Wed, 29 Dec 2021 Prob (F-statistic): 0.186
Time: 17:25:00 Log-Likelihood: -68.490
No. Observations: 100 AIC: 145.0
Df Residuals: 96 BIC: 155.4
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.2191 0.846 0.259 0.796 -1.461 1.899
col_1 -0.0198 0.023 -0.854 0.395 -0.066 0.026
col_2 -0.0048 0.051 -0.093 0.926 -0.107 0.097
col_3 0.1155 0.056 2.066 0.042 0.005 0.226
==============================================================================
Omnibus: 2.292 Durbin-Watson: 2.291
Prob(Omnibus): 0.318 Jarque-Bera (JB): 2.296
Skew: -0.351 Prob(JB): 0.317
Kurtosis: 2.757 Cond. No. 370.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
That includes the intercept (a0) by default. If you want to remove it, just add a -1 to the formula

Dataframe count in Statsmodels OLS formula

I'm trying to do a linear regression to predict the count of a dataframe based on the count of another dataframe. I am using statsmodels. I have tried the following:
X = df1.count()
Y = df2.count()
from statsmodels.formula.api import ols
fit = ols(Y ~ X, data=kak).fit()
fit.summary()
Using the X and Y variables in the OLS formula is not allowed and I have no idea what to fill in at the data= keyword argument. How would I go about doing this?

Let's assume you have to 1D arrays X and Y:
import numpy as np
X = np.arange(100)
Y = 2*X + 5
Then you can run a linear regression using the line below. There are two important things:
The first argument to ols is a str containing a formula. "Y ~ X" and not Y ~ X (note the double quotes).
The second argument data can be any Python object as long as it has keys data["X"] and data["Y"]. I wrote a dict here, but it would also work with a DataFrame. It basically allows statsmodels to understand who are X and Y in the formula you gave to it.
ols("Y ~ X", {"X": X, "Y": Y}).fit().summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.916e+33
Date: Fri, 21 May 2021 Prob (F-statistic): 0.00
Time: 14:01:48 Log-Likelihood: 3055.1
No. Observations: 100 AIC: -6106.
Df Residuals: 98 BIC: -6101.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0000 2.62e-15 1.91e+15 0.000 5.000 5.000
X 2.0000 4.57e-17 4.38e+16 0.000 2.000 2.000
==============================================================================
Omnibus: 220.067 Durbin-Watson: 0.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.816
Skew: 0.159 Prob(JB): 0.00739
Kurtosis: 1.498 Cond. No. 114.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

lm in R vs statsmodels.api OLS in Python

I get completely different results with the same datasets in R and Python. I cannot understand why it happens.
R:
library(RcppCNPy)
d <- npyLoad("/home/vvkovalchuk/bin/src/python/asks1.npy")
datas = npyLoad('/home/vvkovalchuk/bin/src/python/bids2.npy')
m <- lm(d ~ datas)
summary(m)
Python:
import time
import numpy
import statsmodels.api as sm
from math import log
Y = numpy.load('./asks1.npy', allow_pickle=True)
X = numpy.load('./bids2.npy', allow_pickle=True)
X3 = sm.add_constant(X)
res_ols = sm.OLS(Y, X3).fit()
print(res_ols.params)
What am I doing wrong?
Results:
R:
Call:
lm(formula = d ~ datas)
Residuals:
Min 1Q Median 3Q Max
-6.089e+06 8.797e+07 2.163e+08 2.179e+08 1.122e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.561e+00 2.253e+06 0 1
datas 3.809e+03 2.164e+09 0 1
Residual standard error: 208100000 on 14639 degrees of freedom
Multiple R-squared: 0.2735, Adjusted R-squared: 0.2735
F-statistic: 5512 on 1 and 14639 DF, p-value: < 2.2e-16
Python:
OLS Regression Results
Dep. Variable: y R-squared: 0.112
Model: OLS Adj. R-squared: 0.112
Method: Least Squares F-statistic: 1846.
Date: Thu, 25 Mar 2021 Prob (F-statistic): 0.00
Time: 13:08:43 Log-Likelihood: 1.6948e+05
No. Observations: 14641 AIC: -3.390e+05
Df Residuals: 14639 BIC: -3.389e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0004 3.07e-06 126.136 0.000 0.000 0.000
x1 0.1478 0.003 42.969 0.000 0.141 0.155
Omnibus: 3251.130 Durbin-Watson: 0.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14606.605
Skew: 1.019 Prob(JB): 0.00
Kurtosis: 7.449 Cond. No. 1.83e+05
I also tried to swap arguments in OLS function. Still getting incorrect results. Could this be related to NAs?

Multiple linear regression get best fit

I am doing some multiple linear regression with the following code:
import statsmodels.formula.api as sm
df = pd.DataFrame({"A":Output['10'],
"B":Input['Var1'],
"G":Input['Var2'],
"I":Input['Var3'],
"J":Input['Var4'],
res = sm.ols(formula="A ~ B + G + I + J", data=df).fit()
print(res.summary())
With the following result:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.562
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 2235.
Date: Tue, 06 Nov 2018 Prob (F-statistic): 0.00
Time: 09:48:20 Log-Likelihood: -21233.
No. Observations: 6961 AIC: 4.248e+04
Df Residuals: 6956 BIC: 4.251e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.8504 0.448 48.760 0.000 20.972 22.729
B 1.8353 0.022 84.172 0.000 1.793 1.878
G 0.0032 0.004 0.742 0.458 -0.005 0.012
I -0.0210 0.009 -2.224 0.026 -0.039 -0.002
J 0.6677 0.061 10.868 0.000 0.547 0.788
==============================================================================
Omnibus: 2152.474 Durbin-Watson: 0.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5077.082
Skew: -1.773 Prob(JB): 0.00
Kurtosis: 5.221 Cond. No. 555.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
However, my Output dataframe consists of multiple columns from 1 to 149. Is there a way to loop over all the 149 columns in the Output dataframe and in the end show the best and worst fits on for example R-squared? Or get the largest coef for variable B?

Preserve variable names in summary from statsmodels

I am using OLS from statsmodel, the link is https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html
#USD
X = sm.add_constant(USD)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 352.4
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.35e-67
Time: 17:30:24 Log-Likelihood: -8018.8
No. Observations: 977 AIC: 1.604e+04
Df Residuals: 975 BIC: 1.605e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1843.1414 149.675 12.314 0.000 1549.418 2136.864
USD 3512.5040 187.111 18.772 0.000 3145.318 3879.690
==============================================================================
Omnibus: 276.458 Durbin-Watson: 0.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.633
Skew: 0.438 Prob(JB): 6.22e-17
Kurtosis: 1.967 Cond. No. 10.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
You can see the X is showing as USD in the summary which is what I want.
However, after adding a new variable
#JPY + USD
X = sm.add_constant(JPY)
X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 17:39:19 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
x1 78.6589 2.466 31.902 0.000 73.820 83.497
x2 -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not showing USD and JPY, but x1 x2. Is there a way to fix it? I tried google but found nothing.

As my question is all care about the showing, thus, if I keep the header, then the problem solved, so I post my solution in case someone may have the same problem.
#JPY + USD
X = JPY.join(USD)
X = sm.add_constant(X)
#X = np.column_stack((X, USD))
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
========================================================================================
Dep. Variable: All Ordinaries closing price R-squared: 0.641
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 868.8
Date: Tue, 23 Oct 2018 Prob (F-statistic): 2.80e-217
Time: 22:51:43 Log-Likelihood: -7669.4
No. Observations: 977 AIC: 1.534e+04
Df Residuals: 974 BIC: 1.536e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1559.5880 149.478 -10.434 0.000 -1852.923 -1266.253
JPY 78.6589 2.466 31.902 0.000 73.820 83.497
USD -366.5850 178.672 -2.052 0.040 -717.211 -15.958
==============================================================================
Omnibus: 24.957 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.278
Skew: 0.353 Prob(JB): 1.19e-06
Kurtosis: 3.415 Cond. No. 743.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here's an easy fix using pandas. You only need to add a list of features inside summary().
# list of features (names)
features = list(df.iloc[:, 0:-1].columns) # exclude last column (label)
# scale features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train MLR model
regressor = sm.OLS(y_train, X_train).fit()
regressor.summary(xname=features)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Robustness issue of statsmodel Linear regression (ols) - Python - python

Related

Estimating multiple parameters of a model in python

Dataframe count in Statsmodels OLS formula

lm in R vs statsmodels.api OLS in Python

Multiple linear regression get best fit

Preserve variable names in summary from statsmodels

Categories

Resources