There is a very useful function in R called findfrequency on the forecast package that returns the period of the dominant frequency of a time series. More info on the function from the author can be found here: https://robjhyndman.com/hyndsight/tscharacteristics/
I want to implement something equivalent in Python and I am having trouble with the functions that should be equal to the spec.ar R function that is inside findfrequency.
The function starts from detrending the series which is easily done with x = statsmodels.tsa.tsatools.detrend(myTs, order=1, axis=0). Now that I have the residuals I would like to do in Python the equivalent of the spec.ar function in R that first fits an AR model to x (or uses the existing fit) and computes (and by default plots) the spectral density of the fitted model.
I have not found anything similar so I am doig each step at a time, first the AR and then the spec estimation.
I am using the Airpassengers time series and I am not able to get the same results on R and Python for the AR order or coefficients.
My R code:
x <- AirPassengers
x <- residuals(tslm(x ~ trend))
ARmodel <- ar(x)
ARmodel
I get that 15 is the selected order for my autoregressive model.
My Python Code:
import statsmodels.api as sm
dataPeriodic = pd.read_csv('AirPassengers.csv')
tsPeriodic = dataPeriodic.iloc[:,1]
x = statsmodels.tsa.tsatools.detrend(tsPeriodic, order=1, axis=0)
n = x.shape[0]
est_order = sm.tsa.AR(x).select_order(maxlag=20, ic='aic', trend='nc')
print(est_order)
Here I get a very different result with an order selected that is equal to 10 instead of 15 and I have to specify the upper limit of the lag search with the maxlag parameter..
I have tried with the tsa.AutoReg without success, I get another different result.
So, is there a way to fit an AR model in the same way that R does ? Something similiar to spec.ar or even something similar to the findfrequency function ? I am quite confused by the big diferences the 'same' methods can output in the two languages.
Closest I could find in Python for findfrequency of the R forecast package was by using pandas.infer_freq like this:
>>> import pandas as pd
>>> ts_data = pd.read_csv("ts_data.csv")
>>> pd.infer_freq(ts_data.index.values)
4
Related
Recently I wanted to demonstrate generating a continuous random variable using the universality of the Uniform. For that, I wanted to use the combination of numpy and matplotlib. However, the generated random variable seems a little bit off to me - and I don't know whether it is caused by the way in which NumPy's random uniform and vectorized works or if I am doing something fundamentally wrong here.
Let U ~ Unif(0, 1) and X = F^-1(U). Then X is a real variable with a CDF F (please note that the F^-1 here denotes the quantile function, I also omit the second part of the universality because it will not be necessary).
Let's assume that the CDF of interest to me is:
then:
According to the universality of the uniform, to generate a real variable, it is enough to plug U ~ Unif(0, 1) in the F-1. Therefore, I've written a very simple code snippet for that:
U = np.random.uniform(0, 1, 1000000)
def logistic(u):
x = np.log(u / (1 - u))
return x
logistic_transform = np.vectorize(logistic)
X = logistic_transform(U)
However, the result seems a little bit off to me - although the histogram of a generated real variable X resembles a logistic distribution (which simplified CDF I've used) - the r.v. seems to be distributed in a very unequal way - and I can't wrap my head around exactly why it is so. I would be grateful for any suggestions on that. Below are the histograms of U and X.
You have a large sample size, so you can increase the number of bins in your histogram and still get a good number samples per bin. If you are using matplotlib's hist function, try (for exampe) bins=400. I get this plot, which has the symmetry that I think you expected:
Also--and this is not relevant to the question--your function logistic will handle a NumPy array without wrapping it with vectorize, so you can save a few CPU cycles by writing X = logistic(U). And you can save a few lines of code by using scipy.special.logit instead of implementing it yourself.
Let's say I have a set of data points called signal and I want to integrate it twice with respect to time (i.e., if signal was acceleration, I'd like to integrate it twice w.r.t. time to get the position). I can integrate it once using simps but the output here is a scalar. How can you numerically integrate a (random) data set twice? I'd imagine it would look something like this, but obviously the inputs are not compatible after the first integration.
n_samples = 5000
t_range = np.arange(float(n_samples))
signal = np.random.normal(0.,1.,n_samples)
signal_integration = simps(signal, t_range)
signal_integration_double = simps(simps(signal, t_range), t_range)
Any help would be appreciated.
Sorry I answered too fast. scipy.integrate.simps give the value of the integration over the range you give it, similar to np.sum(signal).
What you want is the integration beween the start and each data point, which is what cumsum does. A better method could be scipy.integrate.cumtrapz. You can apply either method twice to get the result you want.
See:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.cumtrapz.html
Original answer:
I think you want np.cumsum. Integration of discrete data is just a sum. You have to multiply the result by the step value to get the correct scale.
See https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cumsum.html
By partial integration you get from y''=f to
y(t) = y(0) + y'(0)*t + integral from 0 to t of (t-s)*f(s) ds
As you seem to assume that y(0)=0 and also y'(0)=0, you can thus get the the desired integral value in one integration as
simps((t-t_range)*signal, t_range)
I'm currently making the switch from MATLAB to Python for a project that involves solving differential equations.
In MATLAB if the t array that's passed only contains two elements, the solver outputs all the intermediate steps of the simulation. However, in Python you just get the start and end point. To get time points in between you have to explicitly specify the time points you want.
from scipy import integrate as sp_int
import numpy as np
def odeFun(t,y):
k = np.ones((2))
dy_dt = np.zeros(y.shape)
dy_dt[0]= k[1]*y[1]-k[0]*y[0]
dy_dt[1]=-dy_dt[0]
return(dy_dt)
t = np.linspace(0,10,1000)
yOut = sp_int.odeint(odeFun,[1,0],t)
I've also looked into the following method:
solver = sp_int.ode(odefun).set_integrator('vode', method='bdf')
solver.set_initial_value([1,0],0)
dt = 0.01
solver.integrate(solver.t+dt)
However, it still requires an explicit dt. From reading around I understand that Python's solvers (e.g. 'vode') calculates intermediate steps for the dt requested, and then interpolates that time point and outputs it. What I'd like though is to get all these intermediate steps directly without the interpolation. This is because they represent the minimum number of points required to fully describe the time series within the integration tolerances.
Is there an option available to do that?
I'm working in Python 3.
scipy.integrate.odeint
odeint has an option full_output that allows you to obtain a dictionary with information on the integration, including tcur which is:
vector with the value of t reached for each time step. (will always be at least as large as the input times).
(Note the second sentence: The actual steps are always as fine as your desired output. If you want use the minimum number of necessary step, you must ask for a coarse sampling.)
Now, this does not give you the values, but we can obtain those by integrating a second time using these very steps:
from scipy.integrate import odeint
import numpy as np
def f(y,t):
return np.array([y[1]-y[0],y[0]-y[1]])
start,end = 0,10 # time range we want to integrate
y0 = [1,0] # initial conditions
# Function to add the initial time and the target time if needed:
def ensure_start_and_end(times):
times = np.insert(times,0,start)
if times[-1] < end:
times = np.append(times,end)
return times
# First run to establish the steps
first_times = np.linspace(start,end,100)
first_run = odeint(f,y0,first_times,full_output=True)
first_steps = np.unique(first_run[1]["tcur"])
# Second run to obtain the results at the steps
second_times = ensure_start_and_end(first_steps)
second_run = odeint(f,y0,second_times,full_output=True,h0=second_times[0])
second_steps = np.unique(second_run[1]["tcur"])
# ensuring that the second run actually uses (almost) the same steps.
np.testing.assert_allclose(first_steps,second_steps,rtol=1e-5)
# Your desired output
actual_steps = np.vstack((second_times, second_run[0].T)).T
scipy.integrate.ode
Having some experience with this module, I am not aware of any way to obtain the step size without digging deeply into the internals.
I'm experimenting to decide if a time-series (as in, one list of floats) is correlated with itself. I've already had a play with the acf function in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.acf.html), now I'm looking at whether the Durbin–Watson statistic has any worth.
It seems like this kind of thing should work:
from statsmodels.regression.linear_model import OLS
import numpy as np
data = np.arange(100) # this should be highly correlated
ols_res = OLS(data)
dw_res = np.sum(np.diff(ols_res.resid.values))
If you were to run this, you would get:
Traceback (most recent call last):
...
File "/usr/lib/pymodules/python2.7/statsmodels/regression/linear_model.py", line 165, in initialize
self.nobs = float(self.wexog.shape[0])
AttributeError: 'NoneType' object has no attribute 'shape'
It seems that D/W is usually used to compare two time-series (e.g. http://connor-johnson.com/2014/02/18/linear-regression-with-python/) for correlation, so I think the problem is that i've not passed another time-series to compare to. Perhaps this is supposed to be passed in the exog parameter to OLS?
exog : array-like
A nobs x k array where nobs is the number of observations and k is
the number of regressors.
(from http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html)
Side-note: I'm not sure what a "nobs x k" array means. Maybe an array with is x by k?
So what should I be doing here? Am I expected to pass the data twice,
or to lag it manually myself, or?
Thanks!
I've accepted user333700's answer, but I wanted to post a code snippet follow up.
This small program computes the durbin-watson correlation for a linear range (which should be highly correlated, thus giving a value close to 0) and then for random values (which should not be correlated, thus giving a value close to 2):
from statsmodels.regression.linear_model import OLS
import numpy as np
from statsmodels.stats.stattools import durbin_watson
def dw(data):
ols_res = OLS(data, np.ones(len(data))).fit()
return durbin_watson(ols_res.resid)
print("dw of range=%f" % dw(np.arange(2000)))
print("dw of rand=%f" % dw(np.random.randn(2000)))
When run:
dw of range=0.000003
dw of rand=2.036162
So I think that looks good :)
OLS is a regression that needs y and x (or endog and exog). x needs to be at least a constant in your case, ie. np.ones(len(endog), 1).
Also, you need to fit the model, i.e. ols_res = OLS(y, x).fit().
nobs x k means 2 dimensional with nobs observation in rows and k variables in columns, i.e. exog.shape is (nobs, k)
Durbin Watson is a test statistic for serial correlation. It is included in the OLS summary output. There are other tests for no autocorrelation included in statsmodels.
(I would recommend working through some example or tutorial notebooks.)
I've read through existing posts about this module (and the Scipy docs), but it's still not clear to me how to use Scipy's kstest module to do a goodness-of-fit test when you have a data set and a callable function.
The PDF I want to test my data against isn't one of the standard scipy.stats distributions, so I can't just call it using something like:
kstest(mydata,'norm')
where mydata is a Numpy array. Instead, I want to do something like:
kstest(mydata,myfunc)
where 'myfunc' is the callable function. This doesn't work—which is unsurprising, since there's no way for kstest to know what the abscissa for the 'mydata' array is in order to generate the corresponding theoretical frequencies using 'myfunc'. Suppose the frequencies in 'mydata' correspond to the values of the random variable is the array 'abscissa'. Then I thought maybe I could use stats.ks_2samp:
ks_2samp(mydata,myfunc(abscissa))
but I don't know if that's statistically valid. (Sidenote: do kstest and ks_2samp expect frequency arrays to be normalized to one, or do they want the absolute frequencies?)
In any case, since the one-sample KS test is supposed to be used for goodness-of-fit testing, I have to assume there's some way to do it with kstest directly. How do you do this?
Some examples may shed some light on how to use scipy.stats.kstest. Lets first set up some test data, e.g. normally distributed with mean 5 and standard deviation 10:
>>> data = scipy.stats.norm.rvs(loc=5, scale=10, size=(1000,))
To run kstest on these data we need a function f(x) that takes an array of quantiles, and returns the corresponding value of the cumulative distribution function. If we reuse the cdf function of scipy.stats.norm we could do:
>>> scipy.stats.kstest(data, lambda x: scipy.stats.norm.cdf(x, loc=5, scale=10))
(0.019340993719575206, 0.84853828416694665)
The above would normally be run with the more convenient form:
>>> scipy.stats.kstest(data, 'norm', args=(5, 10))
(0.019340993719575206, 0.84853828416694665)
If we have uniformly distributed data, it is easy to build the cdf by hand:
>>> data = np.random.rand(1000)
>>> scipy.stats.kstest(data, lambda x: x)
(0.019145675289412523, 0.85699937276355065)
as for ks_2samp, it tests null hypothesis that both samples are sampled from same probability distribution.
you can do for example:
>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>>
where x, y are two instances of numpy.array:
>>> ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)
first value is the test statistics, and second value is the p-value. if the p-value is less than 95 (for a level of significance of 5%), this means that you cannot reject the Null-Hypothese that the two sample distributions are identical.