Augmented Dickey Fuller test in Python with statsmodels - python

I'm performing an ADF test on several(~500) time series to test stationarity. So I need a quantitative way of choosing the correct number of lags for each one of them. One possible approach is to use, say, 80% of my sample for the test and to get the parameters of the regression in it and compute the ssr (sum of squares regression) and search for the minimum. However, this may lead to over fitting and in order to avoid it, this regression can then be applied to the remaining 20% and calculate the ssr of this sub-sample. The number of lags that lead to the minimum value of this second ssr, should be the correct one.
The problem is that statsmodels documentation is less than incomplete (at least for a newbie like me!). For example, given the line
res = ts.adfuller(dUs, maxlag=max_lag_, autolag=None, regression='ct', store=True, regresults=True)
the regression coefficients are stored in res[3].resols.params but the order is unknown. I had to ask someone to run the test on one of my time series in R (which gives you the used formula and the corresponding coefficients like this
R-output).
Python order of parameters seems to be (for a 'ct' regression) lag 1, lag diff 1, lag diff 2, ...lag diff N, intercept, time trend. I, then, re-construct the fitted series with the following code:
xFit[0:max_lag_ + 1] = dUs[0:max_lag_ + 1]
for i in range (max_lag_ + 1,xFit.size):
xFit[i] = xFit[i-1] + res[3].resols.params[0] * xFit[i-1] + res[3].resols.params[res[3].resols.params.size - 2] + res[3].resols.params[res[3].resols.params.size - 1] * t[i]
for j in range(1,max_lag_ +1):
xFit[i] = xFit[i] + res[3].resols.params[j] * lag[i-1-j]
Note that the lag variable is constructed from my dUs variable like this
lag = dUs[1:]-dUs[:-1]
The thing is that the xFit series and res[3].resols.fittedvalues are different! I think that it might has something to do with my initialization of the first max_lag_ data points (in fact, note that the res[3].resols.fittedvalues is max_lag_ + 1 shorter than the original series): I chose them to be equal to the original series. But I can' t figure out what exactly is going on. The difference between xFit and res[3].resols.params is HUGE:
time-series-comparison. Note also that, increasing the lag number makes my fitting better up to some value, and then the series explodes. This does not happen with fittedvalues!
As a final test, I ran the ADF test on xFit; I understand this should lead to the res[3].resols.params I already got.
Given the line
res2 = ts.adfuller(xFit, maxlag=max_lag_, autolag=None, regression='ct', store=True, regresults=True)
the output of res2[3].resols.params is
[ -1.60231256e+00 4.23814175e-02 -4.15837300e-02 4.99642618e-02
-6.92483339e+02 3.89141878e+00]
while res[3].resols.params is
[ -1.29269094e+00 2.11857016e-02 -5.82679110e-02 -2.09614163e-02
-5.44413351e+02 2.69502722e+00]
I know that many of you would suggest to move to R but, a) I never used it (although I could learn) and b) getting software installed at work is not that easy and it could take me a lot of precious time.
Any ideas? any mistake I'm missing?
Thanks in advance,
C

I solved the issue (although I didn't had time to post it before!).
The thing is that R fits the difference series while Python fits the time series itself. That in combination with an error (xFit should be replaced with dUs in the reconstruction of the time series!) made everything weird as explained above.
The right code is
for i in range (max_lag_ + 1,xFit.size):
xFit[i] = res[3].resols.params[0] * dUs[i-1] + res[3].resols.params[res[3].resols.params.size - 1]
for j in range(1,max_lag_ +1):
xFit[i] = xFit[i] + res[3].resols.params[j] * lag[i-1-j]

Related

Pyomo constraints and using pandas

I am using Pyomo to optimize a cashflow matching problem using bonds.
I also want to have a detailed constraint that does looks at the cashflows I am expecting to get from my portfolio versus fixed requirements and conduct a number of calculations on the differences:
Calculate difference (wanted minus expected to receive or "in - out" in the picture)
Calculate the accumulation of these differences to the last end point using accumulation factors (multiply difference with accumulation factors - which are stored as a model.AccumFactors)
Sum these year-on-year accumulation factors (cumsum(axis=1))
Find the minimum
[Excel description of process][1]
Now Panda commands don't work in this situation. Is there anything I can do to fix this? Alternative approaches?
Thanks gmavrom.
Trying to think a different formulation. The code below has Multipliers as the model variable and everything else as parameters.
Unfortunately the below doesn't work and just prints out strings of:
54993.219033692505*Multipliers[Bond1] + 63662.18895851663*Multipliers[Bond2] + 64451.10079031628*Multipliers[Bond3] + … etc
def Test1_Constraint(model, TimeIndex):
SumAccumulatedShortfall=0
for TimeCount in range(0,TimeIndex+1):
AccumulatedShortfall = (model.Liabilities[TimeCount] - \
sum(model.BondPayment[BondIndex, TimeCount] *model.Multipliers[BondIndex] for BondIndex in model.Bonds))* \
model.AccumulationFactor[TimeCount]
SumAccumulatedShortfall = SumAccumulatedShortfall + AccumulatedShortfall
print('SumAccum',SumAccumulatedShortfall)
return (SumAccumulatedShortfall/model.TotalLiabilityValue <= 0.03)

Python optimizing a calculation with scipy.integrate.quad (takes very long)

I´m currently writing a program in python for calculating the total spectral emissivity (infrared waves) of any given material at different temperatures (200K - 500K), based on measurement data received by measuring the directional - hemispherical emissivity of the material at many different wavelengths using an IR - spectroscope. The calculation is done by integrating the measured intensity over all wavelenghts, using Plancks law as a weighing function (all of this doesn´t really matter for my question itself; i just want to explain the background so that the code is easier to understand). This is my code:
from scipy import integrate
from scipy.interpolate import CubicSpline
import numpy as np
import math as m
def planck_blackbody(lambda_, T): # wavelength, temperature
h = float(6.6260755e-34)
c = float(2.99792458e+8)
k = float(1.380658e-23)
try:
a = 2.0 * h * (c ** 2)
b = h * c / (lambda_ * k * T)
intensity = a / ((lambda_ ** 5) * (m.exp(b) - 1.0))
return float(intensity)
except OverflowError: # for lower temperatures
pass
def spectral_emissivity(emifilename, t, lambda_1, lambda_2):
results = []
with open(emifilename, 'r') as emifile:
emilines = emifile.readlines()
try:
w = [float(x.split('\t')[0].strip('\n')) * 1e-6 for x in emilines]
e = [float(x.split('\t')[1].strip('\n')) for x in emilines]
except ValueError:
pass
w = np.asarray(w) # wavelength
e = np.asarray(e) # measured intensity
def part_1(lambda_, T):
E = interp1d(w, e, fill_value = 'extrapolate')(lambda_)
return E * planck_blackbody(lambda_, T)
def E_complete(T):
E_complete_part_1 = integrate.quad(part_1, lambda_1, lambda_2, args=T, limit=50)
E_complete_part_2 = integrate.quad(planck_blackbody, lambda_1, lambda_2, args=T, limit=50)
return E_complete_part_1[0] / E_complete_part_2[0]
for T in t:
results.append([T, E_complete(T)])
with open("{}.plk".format(emifilename[:-4]), 'w') as resultfile:
for item in results:
resultfile.write("{}\t{}\n".format(item[0], item[1]))
t = np.arange(200, 501, 1)
spectral_emissivity('C:\test.dat', t, 1.4e-6, 35e-6)
The measured intensity is stored in a text file with two columns, the first being the wavelength of the infrared waves and the second being the directional-hemispherical emissivity of the measured material at that wavelength.
When i run this code, while it is producing the right results, i still encounter 2 problems:
I get an error message from scipy.integrate.quad:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
Can someone explain to me what exactly this means? I understand that integrate.quad is a numerical iteration method and that my functions somehow seem to require more than 50 iterations, but is there a way around this? i tried increasing the Limit, but even with 200 i still get this error message... it´s especially weird given that the integrands are pretty straightforward functions...
is closely connected to the first problem: this program takes ages (about 5 minutes!) to finish one single file, but i need to process many files every hour. cProfile reveals that 98% percent of this time is spent inside the integraion function. A MathCad program doing the exact same thing and producing the same outputs only takes some seconds to finish. Even though i spent the last week seatching for a solution, i simply don´t manage to speed this program up, and no one else on stackoverflow and elsewhere seems to have comparable time problems with integrate.quad.
So, finally, my question: is there any obvious way to optimize this code for it to run faster (except from compiling it into C+ or anything like that)? I tried reducing all floats to 6 digits (i can´t go any lower in accuracy) but that didn´t change anything.
Update: looking into it some more, i figured out that most of the time wasn´t actually consumed by the Integration itself, but by the CubicSpline operation that i used to interpolate my data. I tried out different methods and CubicSpline seemed to be the only working one for some reason (even though my data is monotonically increasing, i got errors from every other method i tried, saying that some values were either above or below the interpolation range). That is, until i found out about extrapolation with scipy.interpolate.interp1dand (fill_value = 'extrapolate'. Ths did the trick for me, enabling me to use the far less consuming interp1d method and effectively reducing the runtime of my program from 280 to 49 seconds (also added list comprehension for w and e). While this is a big improvement, i still wonder why my program takes nearly 1 Minute to calculate some integrals... and i still get the above mentioned IntegrationWarning. So any advice is highly appreciated!
(btw, since i am pretty new to python, I´m happy about any tips or critique i can get!)

scipy.optimize.minimize chi squared python

So i am doing this assignment, where i am supposed to minimize the chi squared function. I saw someone doing this on the internet so i just copied it:
Multiple variables in SciPy's optimize.minimize
I made a chi-squared function which is a function in 3 variables (x,y,sigma) where sigma is random gaussian fluctuation random.gauss(0,sigma). I did not print that code here because on first sight it might be confusing (I used a lot of recursion). But i can assure you that this function is correct.
now this code just makes a list of the calculated minimization(Which are different every time because of the random gaussian fluctuation). But here comes the main problem. If i did my calculation correctly, we should get a list with a mean of 2 (since i have 2 degrees of freedom as you can see in this link: https://en.wikipedia.org/wiki/Chi-squared_test).
def Chi2(pos):
return Chi(pos[0],pos[1],1)
x_list= []
y_list= []
chi_list = []
for i in range(1000):
result = scipy.optimize.minimize(Chi2,[5,5]).x
x_list.append(result[0])
y_list.append(result[1])
chi_list.append(Chi2(result))
But when i use this code i get a list of mean 4, however if i add the method "Powell" i get a mean of 9!!
So my main question is, how is it possible these means are so different and how do i know which method to use to get the best optimization?
Because i think the error might be in my chisquare function i will show this one as well. The story behind this assignment is that we need to find the position of a mobile device and we have routers on the positions (0,0),(20,0),(0,20) and (20,20). We used a lot of recursion, and the graph of the chi_squared looked fine(it has a minimum on (5,5)
def perfectsignal(x_m,y_m,x_r,y_r):
return 20*np.log10(c / (4 * np.pi * f)) - 10 * np.log((x_m-x_r)**2 + (y_m-y_r)**2 + 2**2)
def signal(x_m,y_m,x_r,y_r,sigma):
return perfectsignal(x_m,y_m,x_r,y_r) + random.gauss(0,sigma)
def res(x_m,y_m,x_r,y_r,sigma,sigma2):
x = (signal(x_m,y_m,x_r,y_r,sigma) - perfectsignal(x_m,y_m,x_r,y_r))/float(sigma2);
return x
def Chi(x,y,sigma):
return(res(x,y,0,0,sigma,1)**2+res(x,y,20,0,sigma,1)**2+res(x,y,0,20,sigma,1)**2+res(x,y,20,20,sigma,1)**2)
Kees

Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)
Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.
How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.
I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!

how do I detect zero-vectors that make k-means cosine crash Matlab?

I'm running kmeans on a large dataset and I'm always getting the error below:
Error using kmeans (line 145)
Some points have small relative magnitudes, making them effectively zero.
Either remove those points, or choose a distance other than 'cosine'.
Error in runkmeans (line 7)
[L, C]=kmeans(data, 10, 'Distance', 'cosine', 'EmptyAction', 'drop')
My problem is that even when I add a 1 to all the vectors, I still get this error. I would expect it to pass then, but apparently there are too many zero's still (that is what is causing it, right?).
My question is this: what is the condition that makes Matlab decide that a point has "a small relative magnitude" and "is effectively zero"?
I want to remove all these points from my dataset using python, before I hand over the data to Matlab, because I need to compare my results with a gold standard that I process in python.
Thanks in advance!
EDIT-ANSWER
The correct answer was given below, but in case someone finds this question through Google, here's how you remove the "effectively zero-vectors" from your matrix in python. Every row (!) is a data point, so you want to transpose in python or Matlab if you're running kmeans:
def getxnorm(data):
return np.sqrt(np.sum(data ** 2, axis=1))
def remove_zero_vector(data, startxnorm, excluded=[]):
eps = 2.2204e-016
xnorm = getxnorm(data)
if np.min(xnorm) <= (eps * np.max(xnorm)):
local_index=np.transpose(np.where(xnorm == np.min(xnorm)))[0][0]
global_index=np.transpose(np.where(startxnorm == np.min(xnorm)))[0][0]
data=np.delete(data, local_index, 0) # data with zero vector removed
excluded.append(global_index) # add global index to list of excluded vectors
return remove_zero_vector(data, startxnorm, excluded)
else:
return (data, excluded)
I'm sure there's a much more scipythonic way for doing this, but it'll do :-)
If you're using this kmeans, then the relevant code that is throwing the error is:
case 'cosine'
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps * max(Xnorm))
error(['Some points have small relative magnitudes, making them ', ...
'effectively zero.\nEither remove those points, or choose a ', ...
'distance other than ''cosine''.'], []);
end
So there's your test.
As you can see, what's important is relative size, so adding one to everything only makes things worse (max(Xnorm) is getting larger too). A good fix might be to scale all the data by a constant.
In your other question it looked like your data was scalar. If your input vectors only have one feature/dimension the cosine distance between them will always be undefined (or zero) because by definition they are pointing in the same direction (along the single axis). The cosine measure gives the angle between two vectors, which can only be non-zero if the vectors can point in different directions (ie dimension > 1).

Categories

Resources