Find p-value (significance) in scikit-learn LinearRegression

Find p-value (significance) in scikit-learn LinearRegression - python

How can I find the p-value (significance) of each coefficient?
lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
and we get
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.518
Model: OLS Adj. R-squared: 0.507
Method: Least Squares F-statistic: 46.27
Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62
Time: 10:08:24 Log-Likelihood: -2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424
x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801
x6 476.7458 339.035 1.406 0.160 -189.621 1143.113
x7 101.0446 212.533 0.475 0.635 -316.685 518.774
x8 177.0642 161.476 1.097 0.273 -140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 -62.065 197.316
==============================================================================
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.
lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
And this gives us.
Coefficients Standard Errors t values Probabilities
0 152.1335 2.576 59.061 0.000
1 -10.0122 59.749 -0.168 0.867
2 -239.8191 61.222 -3.917 0.000
3 519.8398 66.534 7.813 0.000
4 324.3904 65.422 4.958 0.000
5 -792.1842 416.684 -1.901 0.058
6 476.7458 339.035 1.406 0.160
7 101.0446 212.533 0.475 0.635
8 177.0642 161.476 1.097 0.273
9 751.2793 171.902 4.370 0.000
10 67.6254 65.984 1.025 0.306
So we can reproduce the values from statsmodel.

scikit-learn's LinearRegression doesn't calculate this information but you can easily extend the class to do it:
from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""
def __init__(self, *args, **kwargs):
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression, self)\
.__init__(*args, **kwargs)
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([
np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
for i in range(sse.shape[0])
])
self.t = self.coef_ / se
self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
return self
Stolen from here.
You should take a look at statsmodels for this kind of statistical analysis in Python.

An easy way to pull of the p-values is to use statsmodels regression:
import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']
You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):

The code in elyase's answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.
class LinearRegression(linear_model.LinearRegression):
def __init__(self,*args,**kwargs):
# *args is the list of arguments that might go into the LinearRegression object
# that we don't know about and don't want to have to deal with. Similarly, **kwargs
# is a dictionary of key words and values that might also need to go into the orginal
# LinearRegression object. We put *args and **kwargs so that we don't have to look
# these up and write them down explicitly here. Nice and easy.
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression,self).__init__(*args,**kwargs)
# Adding in t-statistics for the coefficients.
def fit(self,x,y):
# This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
# of constants.
# Not totally sure what 'super' does here and why you redefine self...
self = super(LinearRegression, self).fit(x,y)
n, k = x.shape
yHat = np.matrix(self.predict(x)).T
# Change X and Y into numpy matricies. x also has a column of ones added to it.
x = np.hstack((np.ones((n,1)),np.matrix(x)))
y = np.matrix(y).T
# Degrees of freedom.
df = float(n-k-1)
# Sample variance.
sse = np.sum(np.square(yHat - y),axis=0)
self.sampleVariance = sse/df
# Sample variance for x.
self.sampleVarianceX = x.T*x
# Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly)
self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)
# Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
self.se = self.covarianceMatrix.diagonal()[1:]
# T statistic for each beta.
self.betasTStat = np.zeros(len(self.se))
for i in xrange(len(self.se)):
self.betasTStat[i] = self.coef_[0,i]/self.se[i]
# P-value for each beta. This is a two sided t-test, since the betas can be
# positive or negative.
self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

There could be a mistake in #JARH's answer in the case of a multivariable regression.
(I do not have enough reputation to comment.)
In the following line:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],
the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.
So this should be:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]
(See t-values for OLS regression for more details)

You can use scipy for p-value. This code is from scipy documentation.
>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:
import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)
The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.

p_value is among f statistics. if you want to get the value, simply use this few lines of code:
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

Getting little bit into the theory of linear regression, here is the summary of what we need to compute the p-values for the coefficient estimators (random variables), to check if they are significant (by rejecting the corresponding null hyothesis):
Now, let's compute the p-values using the following code snippets:
import numpy as np
# generate some data
np.random.seed(1)
n = 100
X = np.random.random((n,2))
beta = np.array([-1, 2])
noise = np.random.normal(loc=0, scale=2, size=n)
y = X#beta + noise
Compute p-values from the above formulae with scikit-learn:
# use scikit-learn's linear regression model to obtain the coefficient estimates
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
beta_hat = [reg.intercept_] + reg.coef_.tolist()
beta_hat
# [0.18444290873001834, -1.5879784718284842, 2.5252138207251904]
# compute the p-values
from scipy.stats import t
# add ones column
X1 = np.column_stack((np.ones(n), X))
# standard deviation of the noise.
sigma_hat = np.sqrt(np.sum(np.square(y - X1#beta_hat)) / (n - X1.shape[1]))
# estimate the covariance matrix for beta
beta_cov = np.linalg.inv(X1.T#X1)
# the t-test statistic for each variable from the formula from above figure
t_vals = beta_hat / (sigma_hat * np.sqrt(np.diagonal(beta_cov)))
# compute 2-sided p-values.
p_vals = t.sf(np.abs(t_vals), n-X1.shape[1])*2
t_vals
# array([ 0.37424023, -2.36373529, 3.57930174])
p_vals
# array([7.09042437e-01, 2.00854025e-02, 5.40073114e-04])
Compute p-values with statsmodels:
import statsmodels.api as sm
X1 = sm.add_constant(X)
model = sm.OLS(y, X2)
model = model.fit()
model.tvalues
# array([ 0.37424023, -2.36373529, 3.57930174])
# compute p-values
t.sf(np.abs(model.tvalues), n-X1.shape[1])*2
# array([7.09042437e-01, 2.00854025e-02, 5.40073114e-04])
model.summary()
As can be seen from above, the p-values computed in both the cases are exactly same.

Another option to those already proposed would be to use permutation testing. Fit the model N times with values of y shuffled and compute the proportion of the coefficients of fitted models that have larger values (one-sided test) or larger absolute values (two-sided test) compared to those given by the original model. These proportions are the p-values.

Related

statsmodel time series analysis - using AR model with differing lags between endogenous and exogenous variables

I am trying to build an ARDL model in python, where I have a model given as:
y = b0 + b1^t-1 + b2^t-2 + ... b5^t-5 + a1^x-1
In other words, a time series model with 5 autoregressive lagged terms, and 1 exogenous lag.
I tried using statsmodels.tsa.arima_model.ARMA, which allows for exogenous variables, and I get an output with the following variables:
const, x1, ar.L1.y, ar.L2.y, ar.L3.y, ar.L4.y, ar.L5.y
the code i used to get these variables is
model = ARMA(endog=returns, order=(ar_order, 0, exog_order), exog=exog_returns)
model_fit = model.fit()
print(model_fit.summary())
based on the coefficient value of x1, I assume this isn't 1 time lag prior, but rather the value of the exogenous variables influence as a whole.
Is there anyway to specify I just want the model to use 1 time lag for the exogenous variable in the model?

Don't use ARMA is it deprecated. Either use SARIMAX or AutoReg.
The key to using exog variables is to make sure they are aligned to the y data they affect. Here I shift x by 1 so that it is as if the lag of x is driving y.
If using AutoReg, you can do
from statsmodels.tsa.api import AutoReg
import numpy as np
import pandas as pd
rg = np.random.default_rng(0)
idx = pd.date_range("1999-12-31",freq="M",periods=100)
x = pd.DataFrame({"x":rg.standard_normal(100)}, index=idx)
x = x.shift(1) # lag x
y = x # np.ones((1,)) + rg.standard_normal(100)
# Drop first obs which is NaN
y = y.iloc[1:]
x = x.iloc[1:]
# 5 lags and exogenous regressors
res = AutoReg(y,5, exog=x, old_names=False).fit()
# Summary
res.summary()
This produces
AutoReg Model Results
==============================================================================
Dep. Variable: y No. Observations: 99
Model: AutoReg-X(5) Log Likelihood -127.520
Method: Conditional MLE S.D. of innovations 0.940
Date: Mon, 01 Mar 2021 AIC 0.046
Time: 14:57:44 BIC 0.262
Sample: 06-30-2000 HQIC 0.133
- 03-31-2008
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0500 0.098 -0.512 0.609 -0.241 0.141
y.L1 -0.0342 0.070 -0.486 0.627 -0.172 0.104
y.L2 0.0621 0.070 0.884 0.377 -0.076 0.200
y.L3 -0.0715 0.071 -1.009 0.313 -0.210 0.067
y.L4 -0.0941 0.070 -1.347 0.178 -0.231 0.043
y.L5 -0.0323 0.071 -0.453 0.651 -0.172 0.107
x 1.0650 0.103 10.304 0.000 0.862 1.268
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 1.1584 -1.0801j 1.5838 -0.1194
AR.2 1.1584 +1.0801j 1.5838 0.1194
AR.3 -1.3814 -1.7595j 2.2370 -0.3559
AR.4 -1.3814 +1.7595j 2.2370 0.3559
AR.5 -2.4649 -0.0000j 2.4649 -0.5000
-----------------------------------------------------------------------------

Why do I get such strange results for time series analysis?

I'm trying to build a model to predict the number of daily orders for a delivery cafe. Here is the data,
Here you can see two major peaks: they are holidays -- Feb 14 and Mar 8, respectively. Also, you can see an obvious seasonality with the period of 7: people order more at the weekend and less at working days.
Dickey-Fuller test shows that the series is not stationary with the p-value = 0.152
Then I decided to apply the Box-Cox transformation because the deviation looks uneven. After that, Dickey-Fuller test's p-value is 0.222, but transformed series now looks like a sinoid,
Then I applied a seasonal difference like this:
data["counts_diff"] = data.counts_box - data.counts_box.shift(7)
plt.figure(figsize=(15,10))
sm.tsa.seasonal_decompose(data.counts_diff[7:]).plot()
p_value = sm.tsa.stattools.adfuller(data.counts_diff[7:])[1]
Now p-value is about 10e-5 and the series is stationary
Then I plotted ACF and PACF to choose initial parameters to grid search,
Common sense and the rules I know told me to choose,
Q = 1
q = 4
P = 3
p = 6
d = 0
D = 1
The code for model finding:
ps = range(0, p+1)
d=0
qs = range(0, q+1)
Ps = range(0, P+1)
D=1
Qs = range(0, Q+1)
parameters_list = []
for p in ps:
for q in qs:
for P in Ps:
for Q in Qs:
parameters_list.append((p,q,P,Q))
len(parameters_list)
%%time
from IPython.display import clear_output
results = []
best_aic = float("inf")
i = 1
for param in parameters_list:
print("counting {}/{}...".format(i, len(parameters_list)))
i += 1
try:
model=sm.tsa.statespace.SARIMAX(data.counts_diff[7:], order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], 7)).fit(disp=-1)
except ValueError:
print('wrong parameters:', param)
continue
except LinAlgError:
print('LU decomposition error:', param)
continue
finally:
clear_output()
aic = model.aic
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
After grid search, I get this,
But when I plot it, it shows a constant line at the zero,
Residuals are non-biased, have no trend and are not auto-correlated,
The code for plotting is here:
# inverse Box-Cox transformation
def invboxcox(y, lmbda):
if lmbda == 0:
return np.exp(y)
else:
return np.exp(np.log(lmbda*y+1) / lmbda)
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
plt.ylabel('Orders count')
data.counts.plot()
#pylab.show()
data.model.plot(color='r')
plt.ylabel('Model explanation')
pylab.show()
If I uncomment the line, the plot looks as follows,
What am I missing? Should I consider the sinoid shape of the transformed series? And why is the scale so different?
Also, the code,
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
(data.counts_diff + 1).plot()
#pylab.show()
data.model.plot(color='r')
pylab.show()
plots two fairly similar plots,
So, AFAIK, the problem is somewhere in the reverse transformations.

Let's first import the essentials, load the data and transform the series,
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.base.transform import BoxCox
# Load data
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Transformation
box_cox = BoxCox()
y, lmbda = box_cox.transform_boxcox(df['counts'])
After Box-Cox transforming your series, I test it for a unit root,
>>>print(sm.tsa.kpss(y)[1])
0.0808334102754407
And,
>>>print(sm.tsa.adfuller(y)[1])
0.18415817136548102
While not entirely stationary according to the ADF test, the KPSS test is not in agreement. Visual inspection seems to suggest it may be stationary 'enough'. Let's consider a model,
df['counts'] = y
model = sm.tsa.SARIMAX(df['counts'], None, (1, 0, 1), (2, 1, 1, 7))
result = model.fit(method='bfgs')
And,
>>>print(result.summary())
Optimization terminated successfully.
Current function value: -3.505128
Iterations: 33
Function evaluations: 41
Gradient evaluations: 41
Statespace Model Results
=========================================================================================
Dep. Variable: counts No. Observations: 346
Model: SARIMAX(1, 0, 1)x(2, 1, 1, 7) Log Likelihood 1212.774
Date: Wed, 24 Jul 2019 AIC -2413.549
Time: 13:37:19 BIC -2390.593
Sample: 07-01-2018 HQIC -2404.401
- 06-11-2019
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.8699 0.052 16.691 0.000 0.768 0.972
ma.L1 -0.5811 0.076 -7.621 0.000 -0.731 -0.432
ar.S.L7 0.0544 0.056 0.963 0.335 -0.056 0.165
ar.S.L14 0.0987 0.060 1.654 0.098 -0.018 0.216
ma.S.L7 -0.9520 0.036 -26.637 0.000 -1.022 -0.882
sigma2 4.385e-05 2.44e-06 17.975 0.000 3.91e-05 4.86e-05
===================================================================================
Ljung-Box (Q): 34.12 Jarque-Bera (JB): 68.51
Prob(Q): 0.73 Prob(JB): 0.00
Heteroskedasticity (H): 1.57 Skew: 0.08
Prob(H) (two-sided): 0.02 Kurtosis: 5.20
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The Ljung-Box result seems to suggest the residuals do not exhibit auto-correlation - which is good! Let's inverse transform the data and the results, and plot the fit,
# Since the first 7 days were back-filled
y_hat = result.fittedvalues[7:]
# Inverse transformations
y_hat_inv = pd.DataFrame(box_cox.untransform_boxcox(y_hat, lmbda),
index=y_hat.index)
y_inv = pd.DataFrame(box_cox.untransform_boxcox(y, lmbda),
index=df.index)
# Plot fitted values with data
_, ax = plt.subplots()
y_inv.plot(ax=ax)
y_hat_inv.plot(ax=ax)
plt.legend(['Data', 'Fitted values'])
plt.show()
Where I get,
Which does not look bad at all!

getting the standard error of linear regression coefficient using bootstrap

I would like to calculate the standard error of linear regression coefficient using bootstrap technique (100 resamples) but the result I got is zero, which is not normal. I think something is wrong with the bootstrap part of the code. Do you know how to fix my code?
x, y = np.genfromtxt("input.txt", unpack=True)
#regression part
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print std_err
#bootstrap part of the code
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(A,B)
print std_err2
input.txt:
-1.08 -1.07
-2.62 -2.56
-2.84 -2.79
-2.22 -2.16
-3.47 -3.55
-2.81 -2.79
-2.86 -2.71
-3.41 -3.42
-4.18 -4.21
-3.50 -3.48
-5.67 -5.55
-6.83 -6.95
-6.13 -6.13
-8.34 -8.19
-7.82 -7.83
-9.86 -9.58
-8.67 -8.62
-9.81 -9.81
-8.39 -8.30

I had no issues with your above code running in Python 3.6.1. Maybe check that your scipy version is current?
from scipy import stats
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.11429903085322253
print(stderr_2) # 0.10158283281966374
Correctly Bootstrapping the Data
The correct way to do this would be to use the resample method from sklearn.utils. This method handles the data in a consistent array format. Since your data is an x, y pair, the y value is dependent on your x value. If you randomly sample x and y independently you lose that dependency and your resampled data will not accurately represent your population.
from scipy import stats
from sklearn.utils import resample
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A, B = resample(x, y, n_samples=100) # defaults to w/ replacement
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.9864339054638176
print(stderr_2) # 0.002669659193615103

Fitting lognormal distribution using Scipy vs Matlab

I am trying to fit a lognormal distribution using Scipy. I've already done it using Matlab before but because of the need to extend the application beyond statistical analysis, I am in the process of trying to reproduce the fitted values in Scipy.
Below is the Matlab code I used to fit my data:
% Read input data (one value per line)
x = [];
fid = fopen(file_path, 'r'); % reading is default action for fopen
disp('Reading network degree data...');
if fid == -1
disp('[ERROR] Unable to open data file.')
else
while ~feof(fid)
[x] = [x fscanf(fid, '%f', [1])];
end
c = fclose(fid);
if c == 0
disp('File closed successfully.');
else
disp('[ERROR] There was a problem with closing the file.');
end
end
[f,xx] = ecdf(x);
y = 1-f;
parmhat = lognfit(x); % MLE estimate
mu = parmhat(1);
sigma = parmhat(2);
And here's the fitted plot:
Now here's my Python code with the aim of achieving the same:
import math
from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF
# The same input is read as a list in Python
ecdf_func = ECDF(degrees)
x = ecdf_func.x
ccdf = 1-ecdf_func.y
# Fit data
shape, loc, scale = stats.lognorm.fit(degrees, floc=0)
# Parameters
sigma = shape # standard deviation
mu = math.log(scale) # meanlog of the distribution
fit_ccdf = stats.lognorm.sf(x, [sigma], floc=1, scale=scale)
Here's the fit using the Python code.
As you can see, both sets of code are capable of producing good fits, at least visually speaking.
Problem is that there is a huge difference in the estimated parameters mu and sigma.
From Matlab: mu = 1.62 sigma = 1.29.
From Python: mu = 2.78 sigma = 1.74.
Why is there such a difference?
Note: I have double checked that both sets of data fitted are exactly the same. Same number of points, same distribution.
Your help is much appreciated! Thanks in advance.
Other info:
import scipy
import numpy
import statsmodels
scipy.__version__
'0.9.0'
numpy.__version__
'1.6.1'
statsmodels.__version__
'0.5.0.dev-1bbd4ca'
Version of Matlab is R2011b.
Edition:
As demonstrated in the answer below, the fault lies with Scipy 0.9. I am able to reproduce the mu and sigma results from Matlab using Scipy 11.0.
An easy way to update your Scipy is:
pip install --upgrade Scipy
If you don't have pip (you should!):
sudo apt-get install pip

There is a bug in the fit method in scipy 0.9.0 that has been fixed in later versions of scipy.
The output of the script below should be:
Explicit formula: mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm: mu = 4.99203468, sig = 0.81691081
but with scipy 0.9.0, it is
Explicit formula: mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm: mu = 4.23197270, sig = 1.11581240
The following test script shows three ways to get the same results:
import numpy as np
from scipy import stats
def lognfit(x, ddof=0):
x = np.asarray(x)
logx = np.log(x)
mu = logx.mean()
sig = logx.std(ddof=ddof)
return mu, sig
# A simple data set for easy reproducibility
x = np.array([50., 50, 100, 200, 200, 300, 500])
# Explicit formula
my_mu, my_sig = lognfit(x)
# Fit a normal distribution to log(x)
norm_mu, norm_sig = stats.norm.fit(np.log(x))
# Fit the lognormal distribution
lognorm_sig, _, lognorm_expmu = stats.lognorm.fit(x, floc=0)
print "Explicit formula: mu = %10.8f, sig = %10.8f" % (my_mu, my_sig)
print "Fit log(x) to norm: mu = %10.8f, sig = %10.8f" % (norm_mu, norm_sig)
print "Fit x to lognorm: mu = %10.8f, sig = %10.8f" % (np.log(lognorm_expmu), lognorm_sig)
With the option ddof=1 in the std. dev. calculation to use the unbiased variance estimation:
In [104]: x
Out[104]: array([ 50., 50., 100., 200., 200., 300., 500.])
In [105]: lognfit(x, ddof=1)
Out[105]: (4.9920345004312647, 0.88236457185021866)
There is a note in matlab's lognfit documentation that says when censoring is not used, lognfit computes sigma using the square root of the unbiased estimator of the variance. This corresponds to using ddof=1 in the above code.

Linear regression with matplotlib / numpy

I'm trying to generate a linear regression on a scatter plot I have generated, however my data is in list format, and all of the examples I can find of using polyfit require using arange. arange doesn't accept lists though. I have searched high and low about how to convert a list to an array and nothing seems clear. Am I missing something?
Following on, how best can I use my list of integers as inputs to the polyfit?
Here is the polyfit example I am following:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(data)
y = np.arange(data)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, 'yo', x, m*x+b, '--k')
plt.show()

arange generates lists (well, numpy arrays); type help(np.arange) for the details. You don't need to call it on existing lists.
>>> x = [1,2,3,4]
>>> y = [3,5,7,9]
>>>
>>> m,b = np.polyfit(x, y, 1)
>>> m
2.0000000000000009
>>> b
0.99999999999999833
I should add that I tend to use poly1d here rather than write out "m*x+b" and the higher-order equivalents, so my version of your code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = [1,2,3,4]
y = [3,5,7,10] # 10, not 9, so the fit isn't perfect
coef = np.polyfit(x,y,1)
poly1d_fn = np.poly1d(coef)
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'yo', x, poly1d_fn(x), '--k') #'--k'=black dashed line, 'yo' = yellow circle marker
plt.xlim(0, 5)
plt.ylim(0, 12)

This code:
from scipy.stats import linregress
linregress(x,y) #x and y are arrays or lists.
gives out a list with the following:
slope : float
slope of the regression line
intercept : float
intercept of the regression line
r-value : float
correlation coefficient
p-value : float
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : float
Standard error of the estimate
Source

Use statsmodels.api.OLS to get a detailed breakdown of the fit/coefficients/residuals:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('Duncan', 'carData').data
y = df['income']
x = df['education']
model = sm.OLS(y, sm.add_constant(x))
results = model.fit()
print(results.params)
# const 10.603498 <- intercept
# education 0.594859 <- slope
# dtype: float64
print(results.summary())
# OLS Regression Results
# ==============================================================================
# Dep. Variable: income R-squared: 0.525
# Model: OLS Adj. R-squared: 0.514
# Method: Least Squares F-statistic: 47.51
# Date: Thu, 28 Apr 2022 Prob (F-statistic): 1.84e-08
# Time: 00:02:43 Log-Likelihood: -190.42
# No. Observations: 45 AIC: 384.8
# Df Residuals: 43 BIC: 388.5
# Df Model: 1
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# const 10.6035 5.198 2.040 0.048 0.120 21.087
# education 0.5949 0.086 6.893 0.000 0.421 0.769
# ==============================================================================
# Omnibus: 9.841 Durbin-Watson: 1.736
# Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.609
# Skew: 0.776 Prob(JB): 0.00497
# Kurtosis: 4.802 Cond. No. 123.
# ==============================================================================
New in matplotlib 3.5.0
To plot the best-fit line, just pass the slope m and intercept b into the new plt.axline:
import matplotlib.pyplot as plt
# extract intercept b and slope m
b, m = results.params
# plot y = m*x + b
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
Note that the slope m and intercept b can be easily extracted from any of the common regression methods:
numpy.polyfit
import numpy as np
m, b = np.polyfit(x, y, deg=1)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
scipy.stats.linregress
from scipy import stats
m, b, *_ = stats.linregress(x, y)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
statsmodels.api.OLS
import statsmodels.api as sm
b, m = sm.OLS(y, sm.add_constant(x)).fit().params
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
sklearn.linear_model.LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x[:, None], y)
b = reg.intercept_
m = reg.coef_[0]
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([1.5,2,2.5,3,3.5,4,4.5,5,5.5,6])
y = np.array([10.35,12.3,13,14.0,16,17,18.2,20,20.7,22.5])
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
mn=np.min(x)
mx=np.max(x)
x1=np.linspace(mn,mx,500)
y1=gradient*x1+intercept
plt.plot(x,y,'ob')
plt.plot(x1,y1,'-r')
plt.show()
USe this ..

George's answer goes together quite nicely with matplotlib's axline which plots an infinite line.
from scipy.stats import linregress
import matplotlib.pyplot as plt
reg = linregress(x, y)
plt.axline(xy1=(0, reg.intercept), slope=reg.slope, linestyle="--", color="k")

from pylab import *
import numpy as np
x1 = arange(data) #for example this is a list
y1 = arange(data) #for example this is a list
x=np.array(x) #this will convert a list in to an array
y=np.array(y)
m,b = polyfit(x, y, 1)
plot(x, y, 'yo', x, m*x+b, '--k')
show()

Another quick and dirty answer is that you can just convert your list to an array using:
import numpy as np
arr = np.asarray(listname)

Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find p-value (significance) in scikit-learn LinearRegression - python

How can I find the p-value (significance) of each coefficient? lm = sklearn.linear_model.LinearRegression() lm.fit(x,y)

You can use scipy for p-value. This code is from scipy documentation. >>> from scipy import stats >>> import numpy as np >>> x = np.random.random(10) >>> y = np.random.random(10) >>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

p_value is among f statistics. if you want to get the value, simply use this few lines of code: import statsmodels.api as sm from scipy import stats diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target X2 = sm.add_constant(X) est = sm.OLS(y, X2) print(est.fit().f_pvalue)

Related

statsmodel time series analysis - using AR model with differing lags between endogenous and exogenous variables

Why do I get such strange results for time series analysis?

getting the standard error of linear regression coefficient using bootstrap

Fitting lognormal distribution using Scipy vs Matlab

Linear regression with matplotlib / numpy

Categories

Resources