Statsmodels and data format

Statsmodels and data format - python

I have 2 1D np.array X and Y that are the values of a time series.I am trying to do time series forecasting. However when i apply the following code :
mod = RollingOLS(endog=Y, exog=X, window=75, min_nobs=None,expanding=True)
fit=mod.fit()
print("Akaike")
print(fit.aic)
i get an array of length equal to the one of X and Y which lead me to think that the mdeolization doesn't work as i would like to because i should get only one value.
Thus, i think that the format of X and Y is inadequate. How can i solve this ?

According to the document, exog is supposed to be 2D array, [nobs, k].
exog: array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
You might want to add a column filled with ones.
import statsmodels.api as sm
X = sm.add_constant(X)
# if your statsmodel version is new
# X = statsmodels.tools.tools.add_constant(X)
https://www.statsmodels.org/stable/generated/statsmodels.regression.rolling.RollingOLS.html
https://www.statsmodels.org/stable/examples/notebooks/generated/rolling_ls.html
Maybe related question
Why do I get only one parameter from a statsmodels OLS fit

Related

Fit in parallel but predict one by one - scikit-learn

I have some code who fits a Linear Regression to some data. In particular I have only one feature (one-dimensional x) but several variables to fit on the same x-values (two-dimensional y). I can then take advantage of parallelisation by fitting the whole y matrix at the same time.
The issue is that I than would like to store the predictors independently so that I can predict only one selected variable and not all the variables I fitted.
Here follows some code example :
#Generation of x and y
x = np.linspace(0,10,num=11).reshape(-1,1)
y = []
for i in range(5) :
coef = np.random.rand(2)*10
y.append(x*coef[0]+coef[1])
y = np.concatenate(y,axis=1)
#Linear regression fit
lin_reg = linear_model.LinearRegression(n_jobs=-1)
lin_reg.fit(X=x,y=y)
If I run lin_reg.predict(x) I get all the variables predicted (a 2D matrix). I would like to being able to save somewhere (in a DataFrame, but that's not an issue) the sub-prediction function, as if I could store lin_reg[i] or lin_reg.predict[i], which I could call to only predict the 1D array corresponding to the selected variable.
Is that possible ?

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.

The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

How to calculate the standard deviation of a list of m x n matrices in Python?

Say I have a data set of 100 data. The interesting part about this data set is that each data is a 4x3 matrix. My question is how should I calculate the standard deviation of this data set? I tried the following code, but I don't know if the result is correct. If it is correct, I want to know how it works. I know the standard deviation equation for 1d data, but I don't know the definition of std for a collection of m x n data. There is only explanation for 1d data in the docstring of np.std.
import numpy as np
datalist = []
for _ in range(100):
data = np.random.random((4,3))
datalist.append(data)
std = np.std(np.asarray(datalist))
print(std)

Seems like you're having unnecessary steps. To begin with, you can get 100 matrices of 4x3 like this:
x = np.random.rand(100, 4, 3)
Then just call np.std on it:
np.std(x)
0.2827262559096299
That's if you want the standard deviation of all values. If you want it per matrix cell, specify the axis argument:
np.std(x, axis=0)
array([[0.27863211, 0.2670126 , 0.28752064],
[0.28540484, 0.25365294, 0.28905531],
[0.28848584, 0.27695767, 0.26886147],
[0.27138472, 0.3135065 , 0.29361115]])
axis=0 means that it's going to collapse the axis 0 (the one with size 100), which will return a matrix of 4x3.

Closed Form Ridge Regression

I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!

First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.

Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.

Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.

Durbin–Watson statistic for one dimensional time series data

I'm experimenting to decide if a time-series (as in, one list of floats) is correlated with itself. I've already had a play with the acf function in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.acf.html), now I'm looking at whether the Durbin–Watson statistic has any worth.
It seems like this kind of thing should work:
from statsmodels.regression.linear_model import OLS
import numpy as np
data = np.arange(100) # this should be highly correlated
ols_res = OLS(data)
dw_res = np.sum(np.diff(ols_res.resid.values))
If you were to run this, you would get:
Traceback (most recent call last):
...
File "/usr/lib/pymodules/python2.7/statsmodels/regression/linear_model.py", line 165, in initialize
self.nobs = float(self.wexog.shape[0])
AttributeError: 'NoneType' object has no attribute 'shape'
It seems that D/W is usually used to compare two time-series (e.g. http://connor-johnson.com/2014/02/18/linear-regression-with-python/) for correlation, so I think the problem is that i've not passed another time-series to compare to. Perhaps this is supposed to be passed in the exog parameter to OLS?
exog : array-like
A nobs x k array where nobs is the number of observations and k is
the number of regressors.
(from http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html)
Side-note: I'm not sure what a "nobs x k" array means. Maybe an array with is x by k?
So what should I be doing here? Am I expected to pass the data twice,
or to lag it manually myself, or?
Thanks!

I've accepted user333700's answer, but I wanted to post a code snippet follow up.
This small program computes the durbin-watson correlation for a linear range (which should be highly correlated, thus giving a value close to 0) and then for random values (which should not be correlated, thus giving a value close to 2):
from statsmodels.regression.linear_model import OLS
import numpy as np
from statsmodels.stats.stattools import durbin_watson
def dw(data):
ols_res = OLS(data, np.ones(len(data))).fit()
return durbin_watson(ols_res.resid)
print("dw of range=%f" % dw(np.arange(2000)))
print("dw of rand=%f" % dw(np.random.randn(2000)))
When run:
dw of range=0.000003
dw of rand=2.036162
So I think that looks good :)

OLS is a regression that needs y and x (or endog and exog). x needs to be at least a constant in your case, ie. np.ones(len(endog), 1).
Also, you need to fit the model, i.e. ols_res = OLS(y, x).fit().
nobs x k means 2 dimensional with nobs observation in rows and k variables in columns, i.e. exog.shape is (nobs, k)
Durbin Watson is a test statistic for serial correlation. It is included in the OLS summary output. There are other tests for no autocorrelation included in statsmodels.
(I would recommend working through some example or tutorial notebooks.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Statsmodels and data format - python

Related

Fit in parallel but predict one by one - scikit-learn

Getting variance values for random samples generated from a standard normal distribution using numpy

How to calculate the standard deviation of a list of m x n matrices in Python?

Closed Form Ridge Regression

Durbin–Watson statistic for one dimensional time series data

Categories

Resources