I'm experimenting to decide if a time-series (as in, one list of floats) is correlated with itself. I've already had a play with the acf function in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.acf.html), now I'm looking at whether the Durbin–Watson statistic has any worth.
It seems like this kind of thing should work:
from statsmodels.regression.linear_model import OLS
import numpy as np
data = np.arange(100) # this should be highly correlated
ols_res = OLS(data)
dw_res = np.sum(np.diff(ols_res.resid.values))
If you were to run this, you would get:
Traceback (most recent call last):
...
File "/usr/lib/pymodules/python2.7/statsmodels/regression/linear_model.py", line 165, in initialize
self.nobs = float(self.wexog.shape[0])
AttributeError: 'NoneType' object has no attribute 'shape'
It seems that D/W is usually used to compare two time-series (e.g. http://connor-johnson.com/2014/02/18/linear-regression-with-python/) for correlation, so I think the problem is that i've not passed another time-series to compare to. Perhaps this is supposed to be passed in the exog parameter to OLS?
exog : array-like
A nobs x k array where nobs is the number of observations and k is
the number of regressors.
(from http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html)
Side-note: I'm not sure what a "nobs x k" array means. Maybe an array with is x by k?
So what should I be doing here? Am I expected to pass the data twice,
or to lag it manually myself, or?
Thanks!
I've accepted user333700's answer, but I wanted to post a code snippet follow up.
This small program computes the durbin-watson correlation for a linear range (which should be highly correlated, thus giving a value close to 0) and then for random values (which should not be correlated, thus giving a value close to 2):
from statsmodels.regression.linear_model import OLS
import numpy as np
from statsmodels.stats.stattools import durbin_watson
def dw(data):
ols_res = OLS(data, np.ones(len(data))).fit()
return durbin_watson(ols_res.resid)
print("dw of range=%f" % dw(np.arange(2000)))
print("dw of rand=%f" % dw(np.random.randn(2000)))
When run:
dw of range=0.000003
dw of rand=2.036162
So I think that looks good :)
OLS is a regression that needs y and x (or endog and exog). x needs to be at least a constant in your case, ie. np.ones(len(endog), 1).
Also, you need to fit the model, i.e. ols_res = OLS(y, x).fit().
nobs x k means 2 dimensional with nobs observation in rows and k variables in columns, i.e. exog.shape is (nobs, k)
Durbin Watson is a test statistic for serial correlation. It is included in the OLS summary output. There are other tests for no autocorrelation included in statsmodels.
(I would recommend working through some example or tutorial notebooks.)
Related
I have 2 1D np.array X and Y that are the values of a time series.I am trying to do time series forecasting. However when i apply the following code :
mod = RollingOLS(endog=Y, exog=X, window=75, min_nobs=None,expanding=True)
fit=mod.fit()
print("Akaike")
print(fit.aic)
i get an array of length equal to the one of X and Y which lead me to think that the mdeolization doesn't work as i would like to because i should get only one value.
Thus, i think that the format of X and Y is inadequate. How can i solve this ?
According to the document, exog is supposed to be 2D array, [nobs, k].
exog: array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
You might want to add a column filled with ones.
import statsmodels.api as sm
X = sm.add_constant(X)
# if your statsmodel version is new
# X = statsmodels.tools.tools.add_constant(X)
https://www.statsmodels.org/stable/generated/statsmodels.regression.rolling.RollingOLS.html
https://www.statsmodels.org/stable/examples/notebooks/generated/rolling_ls.html
Maybe related question
Why do I get only one parameter from a statsmodels OLS fit
There is a very useful function in R called findfrequency on the forecast package that returns the period of the dominant frequency of a time series. More info on the function from the author can be found here: https://robjhyndman.com/hyndsight/tscharacteristics/
I want to implement something equivalent in Python and I am having trouble with the functions that should be equal to the spec.ar R function that is inside findfrequency.
The function starts from detrending the series which is easily done with x = statsmodels.tsa.tsatools.detrend(myTs, order=1, axis=0). Now that I have the residuals I would like to do in Python the equivalent of the spec.ar function in R that first fits an AR model to x (or uses the existing fit) and computes (and by default plots) the spectral density of the fitted model.
I have not found anything similar so I am doig each step at a time, first the AR and then the spec estimation.
I am using the Airpassengers time series and I am not able to get the same results on R and Python for the AR order or coefficients.
My R code:
x <- AirPassengers
x <- residuals(tslm(x ~ trend))
ARmodel <- ar(x)
ARmodel
I get that 15 is the selected order for my autoregressive model.
My Python Code:
import statsmodels.api as sm
dataPeriodic = pd.read_csv('AirPassengers.csv')
tsPeriodic = dataPeriodic.iloc[:,1]
x = statsmodels.tsa.tsatools.detrend(tsPeriodic, order=1, axis=0)
n = x.shape[0]
est_order = sm.tsa.AR(x).select_order(maxlag=20, ic='aic', trend='nc')
print(est_order)
Here I get a very different result with an order selected that is equal to 10 instead of 15 and I have to specify the upper limit of the lag search with the maxlag parameter..
I have tried with the tsa.AutoReg without success, I get another different result.
So, is there a way to fit an AR model in the same way that R does ? Something similiar to spec.ar or even something similar to the findfrequency function ? I am quite confused by the big diferences the 'same' methods can output in the two languages.
Closest I could find in Python for findfrequency of the R forecast package was by using pandas.infer_freq like this:
>>> import pandas as pd
>>> ts_data = pd.read_csv("ts_data.csv")
>>> pd.infer_freq(ts_data.index.values)
4
A friend of mine asked me about this linear regression code and I also couldn't solve it, so now it's my question as well.
Error we are getting:
ValueError: endog and exog matrices are different sizes
When I remove "Tech" from ind_names then it works fine. This might be pointless but for the sake of eliminating syntax error possibilities I tried doing it.
Tech and Financial industry labels are not equally distributed in the DataFrame so maybe this is causing a size mismatch? But I couldn't debug any further so decided to ask you guys.
It'd be really nice to get some confirmation on the error and solution ideas. Please find code below.
#We have a portfolio constructed of 3 randomly generated factors (fac1, fac2, fac3).
#Python code provides the following message
#ValueError: The indices for endog and exog are not aligned
import pandas as pd
from numpy.random import rand
import numpy as np
import statsmodels.api as sm
fac1, fac2, fac3 = np.random.rand(3, 1000) #Generate random factors
#Consider a collection of hypothetical stock portfolios
#Generate randomly 1000 tickers
import random; random.seed(0)
import string
N = 1000
def rands(n):
choices = string.ascii_uppercase
return ''.join([random.choice(choices) for _ in range(n)])
tickers = np.array([rands(5) for _ in range(N)])
ticker_subset = tickers.take(np.random.permutation(N)[:1000])
#Weighted sum of factors plus noise
port = pd.Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000), index=ticker_subset)
factors = pd.DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3}, index=ticker_subset)
#Correlations between each factor and the portfolio
#print(factors.corrwith(port))
factors1=sm.add_constant(factors)
#Calculate factor exposures using a regression estimated by OLS
#print(sm.OLS(np.asarray(port), np.asarray(factors1)).fit().params)
#Calculate the exposure on each industry
def beta_exposure(chunk, factors=None):
return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params
#Assume that we have only two industries – financial and tech
ind_names = np.array(['Financial', 'Tech'])
#Create a random industry classification
sampler = np.random.randint(0, len(ind_names), N)
industries = pd.Series(ind_names[sampler], index=tickers, name='industry')
by_ind = port.groupby(industries)
exposures=by_ind.apply(beta_exposure, factors=factors1)
print(exposures)
#exposures.unstack()
#Determinate the exposures on each industry
Understanding the error message:
ValueError: endog and exog matrices are different sizes
Okay, not too bad. The endogenous matrix and exogenous matrix are of different sizes. And the module provides this page which tells that endogenous are the factors within the system and exogenous are factors outside it.
Some debugging
Check what shapes we are getting for our arrays. To do that we need to take apart that oneliner and print of the .shape of the arguments or maybe print the first handful of each. Also, comment out the line throwing the error. So there, we discover that we get:
chunk [490]
factor [1000 4]
chunk [510]
factor [1000 4]
Oh! There it is. We were expecting factor to be chunked too. It should be [490 4] the first time and [510 4] the second time. Note: since the categories are assigned randomly, this will differ each time.
So basically we have just too much info in that function. We can use the chunk to see what factors to chose, filter the factors to be just those and then everything will work.
Looking over the function definitions in the docs:
class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
We are just passing two arguments and the rest are optional. Let's look at the two we are passing.
endog (array-like) – 1-d endogenous response variable. The dependent variable.
exog (array-like) – A nobs x k array where nobs is the number of observations and k is the number of regressors...
Ah, endog and exog again. endog is 1d array-like. So far so good, shape 490 works. exog nobs? Oh, its number of observations. So it's a 2d array and in this case, we need shape 490 by 4.
This specific issue:
beta_exposure should be:
def beta_exposure(chunk, factors=None):
factors = factors.loc[factors.index.isin(chunk.index)]
return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params
The issue is that you are applying beta_exposures to each part of the list (it is randomized, so let's say 490 elements for Financial and 510 for Tech) but factors=factors1 always gives you 1000 values (the groupby code doesn't touch that).
See http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html and http://www.statsmodels.org/dev/endog_exog.html for the references I used researching this.
To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.
I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.
The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:
kstest(mydata,'norm')
where mydata is a Numpy array. Instead, I want to do something like:
kstest(mydata,myfunc)
where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:
ks_2samp(mydata,myfunc(abscissa))
but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)
In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?
Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:
>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))
To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:
>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)
The above would normally be run with the more convenient form:
>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)
If we have uniformly distributed data, it is easy to build the cdf by hand:
>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)
as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.
you can do for example:
>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>>
where x, y are two instances of numpy.array:
>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)
first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.