Why do I get such strange results for time series analysis? - python

I'm trying to build a model to predict the number of daily orders for a delivery cafe. Here is the data,
Here you can see two major peaks: they are holidays -- Feb 14 and Mar 8, respectively. Also, you can see an obvious seasonality with the period of 7: people order more at the weekend and less at working days.
Dickey-Fuller test shows that the series is not stationary with the p-value = 0.152
Then I decided to apply the Box-Cox transformation because the deviation looks uneven. After that, Dickey-Fuller test's p-value is 0.222, but transformed series now looks like a sinoid,
Then I applied a seasonal difference like this:
data["counts_diff"] = data.counts_box - data.counts_box.shift(7)
plt.figure(figsize=(15,10))
sm.tsa.seasonal_decompose(data.counts_diff[7:]).plot()
p_value = sm.tsa.stattools.adfuller(data.counts_diff[7:])[1]
Now p-value is about 10e-5 and the series is stationary
Then I plotted ACF and PACF to choose initial parameters to grid search,
Common sense and the rules I know told me to choose,
Q = 1
q = 4
P = 3
p = 6
d = 0
D = 1
The code for model finding:
ps = range(0, p+1)
d=0
qs = range(0, q+1)
Ps = range(0, P+1)
D=1
Qs = range(0, Q+1)
parameters_list = []
for p in ps:
for q in qs:
for P in Ps:
for Q in Qs:
parameters_list.append((p,q,P,Q))
len(parameters_list)
%%time
from IPython.display import clear_output
results = []
best_aic = float("inf")
i = 1
for param in parameters_list:
print("counting {}/{}...".format(i, len(parameters_list)))
i += 1
try:
model=sm.tsa.statespace.SARIMAX(data.counts_diff[7:], order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], 7)).fit(disp=-1)
except ValueError:
print('wrong parameters:', param)
continue
except LinAlgError:
print('LU decomposition error:', param)
continue
finally:
clear_output()
aic = model.aic
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
After grid search, I get this,
But when I plot it, it shows a constant line at the zero,
Residuals are non-biased, have no trend and are not auto-correlated,
The code for plotting is here:
# inverse Box-Cox transformation
def invboxcox(y, lmbda):
if lmbda == 0:
return np.exp(y)
else:
return np.exp(np.log(lmbda*y+1) / lmbda)
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
plt.ylabel('Orders count')
data.counts.plot()
#pylab.show()
data.model.plot(color='r')
plt.ylabel('Model explanation')
pylab.show()
If I uncomment the line, the plot looks as follows,
What am I missing? Should I consider the sinoid shape of the transformed series? And why is the scale so different?
Also, the code,
data["model"] = invboxcox(best_model.fittedvalues, lmbda)
plt.figure(figsize(15,5))
(data.counts_diff + 1).plot()
#pylab.show()
data.model.plot(color='r')
pylab.show()
plots two fairly similar plots,
So, AFAIK, the problem is somewhere in the reverse transformations.

Let's first import the essentials, load the data and transform the series,
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.base.transform import BoxCox
# Load data
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Transformation
box_cox = BoxCox()
y, lmbda = box_cox.transform_boxcox(df['counts'])
After Box-Cox transforming your series, I test it for a unit root,
>>>print(sm.tsa.kpss(y)[1])
0.0808334102754407
And,
>>>print(sm.tsa.adfuller(y)[1])
0.18415817136548102
While not entirely stationary according to the ADF test, the KPSS test is not in agreement. Visual inspection seems to suggest it may be stationary 'enough'. Let's consider a model,
df['counts'] = y
model = sm.tsa.SARIMAX(df['counts'], None, (1, 0, 1), (2, 1, 1, 7))
result = model.fit(method='bfgs')
And,
>>>print(result.summary())
Optimization terminated successfully.
Current function value: -3.505128
Iterations: 33
Function evaluations: 41
Gradient evaluations: 41
Statespace Model Results
=========================================================================================
Dep. Variable: counts No. Observations: 346
Model: SARIMAX(1, 0, 1)x(2, 1, 1, 7) Log Likelihood 1212.774
Date: Wed, 24 Jul 2019 AIC -2413.549
Time: 13:37:19 BIC -2390.593
Sample: 07-01-2018 HQIC -2404.401
- 06-11-2019
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.8699 0.052 16.691 0.000 0.768 0.972
ma.L1 -0.5811 0.076 -7.621 0.000 -0.731 -0.432
ar.S.L7 0.0544 0.056 0.963 0.335 -0.056 0.165
ar.S.L14 0.0987 0.060 1.654 0.098 -0.018 0.216
ma.S.L7 -0.9520 0.036 -26.637 0.000 -1.022 -0.882
sigma2 4.385e-05 2.44e-06 17.975 0.000 3.91e-05 4.86e-05
===================================================================================
Ljung-Box (Q): 34.12 Jarque-Bera (JB): 68.51
Prob(Q): 0.73 Prob(JB): 0.00
Heteroskedasticity (H): 1.57 Skew: 0.08
Prob(H) (two-sided): 0.02 Kurtosis: 5.20
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The Ljung-Box result seems to suggest the residuals do not exhibit auto-correlation - which is good! Let's inverse transform the data and the results, and plot the fit,
# Since the first 7 days were back-filled
y_hat = result.fittedvalues[7:]
# Inverse transformations
y_hat_inv = pd.DataFrame(box_cox.untransform_boxcox(y_hat, lmbda),
index=y_hat.index)
y_inv = pd.DataFrame(box_cox.untransform_boxcox(y, lmbda),
index=df.index)
# Plot fitted values with data
_, ax = plt.subplots()
y_inv.plot(ax=ax)
y_hat_inv.plot(ax=ax)
plt.legend(['Data', 'Fitted values'])
plt.show()
Where I get,
Which does not look bad at all!

Related

ordering by multiply columns pandas - 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument

I have the dataframe 'rankedvariableslist', with the index 'Sleepvariables' being the sleep variable of interest, and the two columns being the R-squared and P-value of that model and variable respectively.
I am trying to sort the data in ascending order by 'P-value', then by 'R-squared value', but I keep getting the error: ''values' is not ordered, please explicitly specify the categories order by passing in a categories argument' and am not sure why.
I would be so grateful for a helping hand!
correspondantsleepvariable = []
correspondantpvalue = []
correspondantpvalue = []
newerresults = resultmodeldistancevariation2sleepsummary.tables[0]
newerdata = pd.DataFrame(newerresults)
rsquaredvalue = newerdata.iloc[0,3]
rsquaredvalues.append(rsquaredvalue)
modelpvalues = resultmodeldistancevariation2sleepsummary.tables[1]
newerdatavalues = pd.DataFrame(modelpvalues)
pvalue = newerdatavalues.iloc[12,4]
correspondantpvalue.append(pvalue)
correspondantsleepvariable.append(sleepvariable[i])
rankedvariableslist.sort_values(['P-value','R-squared value'],ascending = [True, False])
print(rankedvariableslist.head(3)
Sleepvariables R-squared value P-value
0 hours_of_sleep 0.026 0.491
1 frequency_of_alarm_usage 0.026 0.681
2 sleepiness_bed 0.026 0.413
As an example of the dataframe 'newerresults':
OLS Regression Results
==============================================================================
Dep. Variable: distance R-squared: 0.028
Model: OLS Adj. R-squared: 0.016
Method: Least Squares F-statistic: 2.338
Date: Fri, 18 Nov 2022 Prob (F-statistic): 0.00773
Time: 12:39:29 Log-Likelihood: -1274.1
No. Observations: 907 AIC: 2572.
Df Residuals: 895 BIC: 2630.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
The following code worked - instead of converting the model summary output to a dataframe, I converted the model summary output to a html file).
correspondantsleepvariable = []
correspondantpvalue = []
correspondantpvalue = []
results_as_html = resultmodeldistancevariation2sleepsummary.tables[0].as_html()
datehere = pd.read_html(results_as_html, header=None, index_col=None)[0]
rsquaredvalue = datehere.iloc[0,3]
rsquaredvalue.astype(float)
rsquaredvalues.append(rsquaredvalue)
results_as_html = resultmodeldistancevariation2sleepsummary.tables[1].as_html()
datehere = pd.read_html(results_as_html, header=0, index_col=0)[0]
pvalue = datehere.iloc[11,3]
pvalue.astype(float)
correspondantpvalue.append(pvalue)
correspondantsleepvariable.append(sleepvariable[i])
rankedvariableslist =
pd.DataFrame({'Sleepvariables':correspondantsleepvariable, 'R-squared value':rsquaredvalues,'P-value':correspondantpvalue})
rankedvariableslist.sort_values(by=['P-value','R-squared value'],ascending = [True,False],inplace=True)
print(rankedvariableslist)
Sleepvariables R-squared value P-value
9 time_spent_awake_during_night_mins 0.034 0.005
4 sleep_quality 0.030 0.041
20 sleepiness_resolution_index 0.028 0.129
Thanks so much for all your help - I am so grateful!

statsmodel time series analysis - using AR model with differing lags between endogenous and exogenous variables

I am trying to build an ARDL model in python, where I have a model given as:
y = b0 + b1^t-1 + b2^t-2 + ... b5^t-5 + a1^x-1
In other words, a time series model with 5 autoregressive lagged terms, and 1 exogenous lag.
I tried using statsmodels.tsa.arima_model.ARMA, which allows for exogenous variables, and I get an output with the following variables:
const, x1, ar.L1.y, ar.L2.y, ar.L3.y, ar.L4.y, ar.L5.y
the code i used to get these variables is
model = ARMA(endog=returns, order=(ar_order, 0, exog_order), exog=exog_returns)
model_fit = model.fit()
print(model_fit.summary())
based on the coefficient value of x1, I assume this isn't 1 time lag prior, but rather the value of the exogenous variables influence as a whole.
Is there anyway to specify I just want the model to use 1 time lag for the exogenous variable in the model?
Don't use ARMA is it deprecated. Either use SARIMAX or AutoReg.
The key to using exog variables is to make sure they are aligned to the y data they affect. Here I shift x by 1 so that it is as if the lag of x is driving y.
If using AutoReg, you can do
from statsmodels.tsa.api import AutoReg
import numpy as np
import pandas as pd
rg = np.random.default_rng(0)
idx = pd.date_range("1999-12-31",freq="M",periods=100)
x = pd.DataFrame({"x":rg.standard_normal(100)}, index=idx)
x = x.shift(1) # lag x
y = x # np.ones((1,)) + rg.standard_normal(100)
# Drop first obs which is NaN
y = y.iloc[1:]
x = x.iloc[1:]
# 5 lags and exogenous regressors
res = AutoReg(y,5, exog=x, old_names=False).fit()
# Summary
res.summary()
This produces
AutoReg Model Results
==============================================================================
Dep. Variable: y No. Observations: 99
Model: AutoReg-X(5) Log Likelihood -127.520
Method: Conditional MLE S.D. of innovations 0.940
Date: Mon, 01 Mar 2021 AIC 0.046
Time: 14:57:44 BIC 0.262
Sample: 06-30-2000 HQIC 0.133
- 03-31-2008
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0500 0.098 -0.512 0.609 -0.241 0.141
y.L1 -0.0342 0.070 -0.486 0.627 -0.172 0.104
y.L2 0.0621 0.070 0.884 0.377 -0.076 0.200
y.L3 -0.0715 0.071 -1.009 0.313 -0.210 0.067
y.L4 -0.0941 0.070 -1.347 0.178 -0.231 0.043
y.L5 -0.0323 0.071 -0.453 0.651 -0.172 0.107
x 1.0650 0.103 10.304 0.000 0.862 1.268
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 1.1584 -1.0801j 1.5838 -0.1194
AR.2 1.1584 +1.0801j 1.5838 0.1194
AR.3 -1.3814 -1.7595j 2.2370 -0.3559
AR.4 -1.3814 +1.7595j 2.2370 0.3559
AR.5 -2.4649 -0.0000j 2.4649 -0.5000
-----------------------------------------------------------------------------

Python Statsmodels ARIMA: How to add/combine a feature(variable) in the model

Hello I am modeling an ARIMA in Python using statsmodels.
The code is below:
p = d = q = range(0, 2)
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 1) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
warnings.filterwarnings("ignore") # specify to ignore warning messages
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
As the result of the model I get this diagnostic table:
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ma.L1 -0.6255 0.077 -8.165 0.830 -0.376 -0.475
ar.S.L12 -0.0010 0.001 1.732 0.003 -0.000 0.002
ma.S.L12 -0.8769 0.026 -33.811 0.000 -0.928 -0.826
sigma2 30.0972 0.004 22.634 0.000 0.089 0.106
==============================================================================
To improve the model, I want to get rid of the ma.L1 and want to add a feature(a new variable) made by me. But I do not know how to add a variable into this model. Particularly, I have a list of numbers like this [2.34, 3.22, 2.11, ...] same length as the dataset I used for this ARIMA. I want to add this list as a variable into the ARIMA model OR I should say combine this new variable with the ARIMA model. Any idea how should I do ?

Find p-value (significance) in scikit-learn LinearRegression

How can I find the p-value (significance) of each coefficient?
lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)
This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
and we get
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.518
Model: OLS Adj. R-squared: 0.507
Method: Least Squares F-statistic: 46.27
Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62
Time: 10:08:24 Log-Likelihood: -2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424
x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801
x6 476.7458 339.035 1.406 0.160 -189.621 1143.113
x7 101.0446 212.533 0.475 0.635 -316.685 518.774
x8 177.0642 161.476 1.097 0.273 -140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 -62.065 197.316
==============================================================================
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.
lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
And this gives us.
Coefficients Standard Errors t values Probabilities
0 152.1335 2.576 59.061 0.000
1 -10.0122 59.749 -0.168 0.867
2 -239.8191 61.222 -3.917 0.000
3 519.8398 66.534 7.813 0.000
4 324.3904 65.422 4.958 0.000
5 -792.1842 416.684 -1.901 0.058
6 476.7458 339.035 1.406 0.160
7 101.0446 212.533 0.475 0.635
8 177.0642 161.476 1.097 0.273
9 751.2793 171.902 4.370 0.000
10 67.6254 65.984 1.025 0.306
So we can reproduce the values from statsmodel.
scikit-learn's LinearRegression doesn't calculate this information but you can easily extend the class to do it:
from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""
def __init__(self, *args, **kwargs):
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression, self)\
.__init__(*args, **kwargs)
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([
np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
for i in range(sse.shape[0])
])
self.t = self.coef_ / se
self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
return self
Stolen from here.
You should take a look at statsmodels for this kind of statistical analysis in Python.
An easy way to pull of the p-values is to use statsmodels regression:
import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']
You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):
The code in elyase's answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.
class LinearRegression(linear_model.LinearRegression):
def __init__(self,*args,**kwargs):
# *args is the list of arguments that might go into the LinearRegression object
# that we don't know about and don't want to have to deal with. Similarly, **kwargs
# is a dictionary of key words and values that might also need to go into the orginal
# LinearRegression object. We put *args and **kwargs so that we don't have to look
# these up and write them down explicitly here. Nice and easy.
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression,self).__init__(*args,**kwargs)
# Adding in t-statistics for the coefficients.
def fit(self,x,y):
# This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
# of constants.
# Not totally sure what 'super' does here and why you redefine self...
self = super(LinearRegression, self).fit(x,y)
n, k = x.shape
yHat = np.matrix(self.predict(x)).T
# Change X and Y into numpy matricies. x also has a column of ones added to it.
x = np.hstack((np.ones((n,1)),np.matrix(x)))
y = np.matrix(y).T
# Degrees of freedom.
df = float(n-k-1)
# Sample variance.
sse = np.sum(np.square(yHat - y),axis=0)
self.sampleVariance = sse/df
# Sample variance for x.
self.sampleVarianceX = x.T*x
# Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly)
self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)
# Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
self.se = self.covarianceMatrix.diagonal()[1:]
# T statistic for each beta.
self.betasTStat = np.zeros(len(self.se))
for i in xrange(len(self.se)):
self.betasTStat[i] = self.coef_[0,i]/self.se[i]
# P-value for each beta. This is a two sided t-test, since the betas can be
# positive or negative.
self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)
There could be a mistake in #JARH's answer in the case of a multivariable regression.
(I do not have enough reputation to comment.)
In the following line:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],
the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.
So this should be:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]
(See t-values for OLS regression for more details)
You can use scipy for p-value. This code is from scipy documentation.
>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:
import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)
The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.
p_value is among f statistics. if you want to get the value, simply use this few lines of code:
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)
Getting little bit into the theory of linear regression, here is the summary of what we need to compute the p-values for the coefficient estimators (random variables), to check if they are significant (by rejecting the corresponding null hyothesis):
Now, let's compute the p-values using the following code snippets:
import numpy as np
# generate some data
np.random.seed(1)
n = 100
X = np.random.random((n,2))
beta = np.array([-1, 2])
noise = np.random.normal(loc=0, scale=2, size=n)
y = X#beta + noise
Compute p-values from the above formulae with scikit-learn:
# use scikit-learn's linear regression model to obtain the coefficient estimates
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
beta_hat = [reg.intercept_] + reg.coef_.tolist()
beta_hat
# [0.18444290873001834, -1.5879784718284842, 2.5252138207251904]
# compute the p-values
from scipy.stats import t
# add ones column
X1 = np.column_stack((np.ones(n), X))
# standard deviation of the noise.
sigma_hat = np.sqrt(np.sum(np.square(y - X1#beta_hat)) / (n - X1.shape[1]))
# estimate the covariance matrix for beta
beta_cov = np.linalg.inv(X1.T#X1)
# the t-test statistic for each variable from the formula from above figure
t_vals = beta_hat / (sigma_hat * np.sqrt(np.diagonal(beta_cov)))
# compute 2-sided p-values.
p_vals = t.sf(np.abs(t_vals), n-X1.shape[1])*2
t_vals
# array([ 0.37424023, -2.36373529, 3.57930174])
p_vals
# array([7.09042437e-01, 2.00854025e-02, 5.40073114e-04])
Compute p-values with statsmodels:
import statsmodels.api as sm
X1 = sm.add_constant(X)
model = sm.OLS(y, X2)
model = model.fit()
model.tvalues
# array([ 0.37424023, -2.36373529, 3.57930174])
# compute p-values
t.sf(np.abs(model.tvalues), n-X1.shape[1])*2
# array([7.09042437e-01, 2.00854025e-02, 5.40073114e-04])
model.summary()
As can be seen from above, the p-values computed in both the cases are exactly same.
Another option to those already proposed would be to use permutation testing. Fit the model N times with values of y shuffled and compute the proportion of the coefficients of fitted models that have larger values (one-sided test) or larger absolute values (two-sided test) compared to those given by the original model. These proportions are the p-values.

Linear regression with matplotlib / numpy

I'm trying to generate a linear regression on a scatter plot I have generated, however my data is in list format, and all of the examples I can find of using polyfit require using arange. arange doesn't accept lists though. I have searched high and low about how to convert a list to an array and nothing seems clear. Am I missing something?
Following on, how best can I use my list of integers as inputs to the polyfit?
Here is the polyfit example I am following:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(data)
y = np.arange(data)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, 'yo', x, m*x+b, '--k')
plt.show()
arange generates lists (well, numpy arrays); type help(np.arange) for the details. You don't need to call it on existing lists.
>>> x = [1,2,3,4]
>>> y = [3,5,7,9]
>>>
>>> m,b = np.polyfit(x, y, 1)
>>> m
2.0000000000000009
>>> b
0.99999999999999833
I should add that I tend to use poly1d here rather than write out "m*x+b" and the higher-order equivalents, so my version of your code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = [1,2,3,4]
y = [3,5,7,10] # 10, not 9, so the fit isn't perfect
coef = np.polyfit(x,y,1)
poly1d_fn = np.poly1d(coef)
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'yo', x, poly1d_fn(x), '--k') #'--k'=black dashed line, 'yo' = yellow circle marker
plt.xlim(0, 5)
plt.ylim(0, 12)
This code:
from scipy.stats import linregress
linregress(x,y) #x and y are arrays or lists.
gives out a list with the following:
slope : float
slope of the regression line
intercept : float
intercept of the regression line
r-value : float
correlation coefficient
p-value : float
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : float
Standard error of the estimate
Source
Use statsmodels.api.OLS to get a detailed breakdown of the fit/coefficients/residuals:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('Duncan', 'carData').data
y = df['income']
x = df['education']
model = sm.OLS(y, sm.add_constant(x))
results = model.fit()
print(results.params)
# const 10.603498 <- intercept
# education 0.594859 <- slope
# dtype: float64
print(results.summary())
# OLS Regression Results
# ==============================================================================
# Dep. Variable: income R-squared: 0.525
# Model: OLS Adj. R-squared: 0.514
# Method: Least Squares F-statistic: 47.51
# Date: Thu, 28 Apr 2022 Prob (F-statistic): 1.84e-08
# Time: 00:02:43 Log-Likelihood: -190.42
# No. Observations: 45 AIC: 384.8
# Df Residuals: 43 BIC: 388.5
# Df Model: 1
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# const 10.6035 5.198 2.040 0.048 0.120 21.087
# education 0.5949 0.086 6.893 0.000 0.421 0.769
# ==============================================================================
# Omnibus: 9.841 Durbin-Watson: 1.736
# Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.609
# Skew: 0.776 Prob(JB): 0.00497
# Kurtosis: 4.802 Cond. No. 123.
# ==============================================================================
New in matplotlib 3.5.0
To plot the best-fit line, just pass the slope m and intercept b into the new plt.axline:
import matplotlib.pyplot as plt
# extract intercept b and slope m
b, m = results.params
# plot y = m*x + b
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
Note that the slope m and intercept b can be easily extracted from any of the common regression methods:
numpy.polyfit
import numpy as np
m, b = np.polyfit(x, y, deg=1)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
scipy.stats.linregress
from scipy import stats
m, b, *_ = stats.linregress(x, y)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
statsmodels.api.OLS
import statsmodels.api as sm
b, m = sm.OLS(y, sm.add_constant(x)).fit().params
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
sklearn.linear_model.LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x[:, None], y)
b = reg.intercept_
m = reg.coef_[0]
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([1.5,2,2.5,3,3.5,4,4.5,5,5.5,6])
y = np.array([10.35,12.3,13,14.0,16,17,18.2,20,20.7,22.5])
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
mn=np.min(x)
mx=np.max(x)
x1=np.linspace(mn,mx,500)
y1=gradient*x1+intercept
plt.plot(x,y,'ob')
plt.plot(x1,y1,'-r')
plt.show()
USe this ..
George's answer goes together quite nicely with matplotlib's axline which plots an infinite line.
from scipy.stats import linregress
import matplotlib.pyplot as plt
reg = linregress(x, y)
plt.axline(xy1=(0, reg.intercept), slope=reg.slope, linestyle="--", color="k")
from pylab import *
import numpy as np
x1 = arange(data) #for example this is a list
y1 = arange(data) #for example this is a list
x=np.array(x) #this will convert a list in to an array
y=np.array(y)
m,b = polyfit(x, y, 1)
plot(x, y, 'yo', x, m*x+b, '--k')
show()
Another quick and dirty answer is that you can just convert your list to an array using:
import numpy as np
arr = np.asarray(listname)
Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Categories

Resources