I have 10k data points like this:
0.010222
0.010345
0.010465
0.010611
0.010768
0.010890
0.011049
0.011206
0.011329
0.011465
0.011613
0.11763
0.011888
0.012015
0.012154
0.012282
0.012408
0.012524
....
I want to calculate Lyapunov exponent for that. This is what I've done so far:
lyapunovs = []
eps = 0.0001
for i in range(N):
for j in range(i + 1, N):
if np.abs(data[i] - data[j]) < eps:
for k in range(1, min(N - i, N - j)):
d0 = np.abs(data[i] - data[j])
dn = np.abs(data[i + k] - data[j + k])
lyapunovs.append(math.log(dn) - math.log(d0)) # problem
My problem is that I don't know first Lyapunov exponent is average of all the lyapunovs when k = 1 or average of all the lyapunovs for the first time that data[i] - data[j] < eps?
Is this right implementation for Lyapunov exponent?
And this is the Numerical Calculation of Lyapunov Exponent
I would calculate the Lyapunov Exponent in this way and then output the results as tuples in a file see blog:
https://blog.abhranil.net/2014/07/22/calculating-the-lyapunov-exponent-of-a-time-series-with-python-code/:
from math import log
import numpy as np
with open('data.txt', 'r') as f:
data = [float(i) for i in f.read().split()]
N = len(data)
eps = 0.001
lyapunovs = [[] for i in range(N)]
for i in range(N):
for j in range(i + 1, N):
if np.abs(data[i] - data[j]) < eps:
for k in range(min(N - i, N - j)):
lyapunovs[k].append(log(np.abs(data[i+k] - data[j+k])))
with open('lyapunov.txt', 'w') as f:
for i in range(len(lyapunovs)):
if len(lyapunovs[i]):
string = str((i, sum(lyapunovs[i]) / len(lyapunovs[i])))
f.write(string + '\n')
I see from the chosen loop structure in the question that a triangle of the Cartesian product of the points is being used. This might improve the estimate of the derivatives, which are susceptible to noise, but it is not part of the Lyapunov exponent explicitly. See this example of the calculations on a known function in the absence of measurement error. Feel free to look into that aspect more, but below I will assume the comparison of signal points adjacent in time.
Your original question uses NumPy, so I will also make use of it. One of the rules of thumb to using NumPy well is to avoid loops, although it is possible to vectorize functions that contain loops. With no explicit time measurements, and no repeated values, you could simply do:
import numpy as np
x = np.random.normal(0,1,size=10**4) # Mock signal data
np.mean(np.log(np.abs(np.diff(x))))
Or if the signal is paired with an array of timepoints, then the numerical derivative can involve time:
import numpy as np
x = np.random.normal(0,1,size=10**4) # Mock signal data
t = np.arange(10**4) # Mock time data
np.mean(np.log(np.abs(np.diff(x) / np.diff(t))))
However, in some datasets it is possible for adjacent values to repeat! This can occur when you've measured the signal only to a few decimal places, and it is a problem because it leads to np.log(0) (=-np.inf) which will blow up your calculation. A simple solution is to remove duplicated values, but this will only be suitable if duplicates are relatively rare and you have a large sample size. It is possible to estimate an upper bound on the estimate of the L-exponent by considering the precision of your measurements, but that is not the estimate of the L-exponent itself.
I just want to mention that knowing the literal expression is the best.
I will take an example with the logistic map equation :
def logisticmap(x_init, r, length):
x = [x_init]
for t in range(length):
x.append(r*x[-1]*(1-x[-1]))
return np.array(x)
Now let's generate the data :
x = logistic(0.2, 3.92, 1000)
plt.plot(x)
plt.show()
Plot logistic map
Here is the proposed solution by Galan,
np.mean(np.log(abs(np.diff(x))))
Which gives : -1.0379
When you derive the Lyapunov exponent from the logistic map equation :
np.mean(np.log(abs(r*(1-2*x))))
It gives : 0.538296
Which is the actual true value for the Lyapunov, since the system is in its chaotic regime it must be positive, so I guess the evaluation from data points is not working in this example, you can try with more data points, but it will still give you a negative LE.
Unfortunately I don't know enough to guide you towards a better estimation for the Lyapunov if you can't derive a mathematical expression, but I would be intersted to know !
I tried to reduce computational complexity with numpy vectorization.
def lyapunov_exponent(series: np.array, threshold: float): -> np.array
N = len(series)
eps = threshold
L = [np.array([0]*N)]
for i in range(1, N):
diff = np.abs(series[i:]-series[:-i])
dist = np.log(diff)
L.append(np.concatenate([[0]*i, dist]))
L = np.array(L)
tf_L = np.where(L<eps, 1, 0)
count_L = np.zeros_like(tf_L)
for i in range(N):
indices = ( np.array(range(0,N-i)), np.array(range(i,N)) )
count_L[indices] = np.cumsum(tf_L[indices])
avg = np.sum(count_L * L, axis=0) / np.sum(count_L, axis=0)
return avg
If there is room for improvement or you get some different result than already answered, please reply.
Related
I am trying to speed up the following code that computes:
where I only need to compute this function for x > y from 0 to 1 (but need very high discretization like dt = 0.001). I have vectorized my solution, but it still not fast enough(really need like a 10x improvement). Any ideas? (Tried something like cython, but still slow because of the nature of vectorization)
def solveF(x, f, lam):
nx = len(x)
res = np.zeros((nx, nx))
for i in range(0, nx):
for j in range(0, nx):
if i > j:
res[i][j] = f*np.exp(lam*(x[i]-x[j]))
return res
def fastKernelCalc(f, x, dx):
nx = len(x)
kappa = np.zeros((nx, nx))
f2 = f.transpose()
for i in range(nx):
t1 = time.time()
for j, xj in enumerate(x):
kernel = 0
if i-j>0 and j!=0:
kernel -= sum(np.diagonal(f, offset=j-i)[0:j])*dx
for k in range(0, j):
kernel += sum(f2[k][k:k+i-j]*kappa[i-j+k][k:k+i-j])*dx*dx
kappa[i][j] = kernel
return kappa
X = 1
dx = 0.001
nx = int(round(X/dx))+1
spatial = np.linspace(0, X, nx)
f = solveF(spatial, 5, 5)
kernel= fastKernelCalc(f, spatial, dx)
My first thought was that if speed is paramount, you should probably use C or Fortran for numerical stuff. Python is great, but not fast.
Things that do pop out:
In the double loop in solveF, you could do
for j in range(0,i)
since if j > i you do nothing. That won't save you much time because there were no calculations done, but it is something that could be improved.
Could you rewrite your equation so that you don't calculate the transpose of f? that could be computationally intensive if f is big.
I'm not a python expert, so this might be stupid, but I would avoid using "sum" and "diagonal". Sometimes (take this with a grain of salt) this generic functions have to do a lot of checks to ensure the operation can be done.
If this is of the utmost importance, and worth the effort, I would add timers at different parts of the code to time which part is the bottleneck. If there is a bottleneck.
Hope this helps.
So, I need help minimizing the time it takes to run the code with large numbers of data only by using NumPy. I think the for loops made my code inefficient.. But I do not know how to make the for loop into a list comprehension, which might help it run faster..
def lagrange(p,node,n,x):
m=[]
#base lagrange polynomial
for i in range(n):
for j in range(p+1):
L=1
for k in range(p+1):
if k!=j:
L= L*(x[i] - node[k])/(node[j] - node[k])
m.append(L)
lagrange= np.array(m).reshape(n,p+1)
return lagrange
def interpolant(a,b,p,n,x,f):
m=[]
node=np.linspace(a,b,p+1)
for j in range(n):
polynomial=0
for i in range(p+1):
polynomial += f(node[i]) * lagrange(p,node,n,x)
m.append(polynomial)
interpolant = np.array(inter)
return interpolant
It appears the value of lagrange_poly(...) is recomputed n*(p+1) times for no reason which is very very expensive! You can compute it once before the loop, store it in a variable and reuse the variable later.
Here is the fixed code:
def uniform_poly_interpolation(a,b,p,n,x,f,produce_fig):
inter=[]
xhat=np.linspace(a,b,p+1)
#use for loop to iterate interpolant.
mat = lagrange_poly(p,xhat,n,x,1e-10)[0]
for j in range(n):
po=0
for i in range(p+1):
po += f(xhat[i]) * mat[i,j]
inter.append(po)
interpolant = np.array(inter)
return interpolant
This should be much much faster.
Moreover, the execution is slow because accessing scalar values of Numpy arrays from CPython is very slow. Numpy is designed to work with array and not to extract scalar values in loops. Additionally, the loop CPython interpreter are relatively slow. You can solve this problem efficiently with Numba that compile your code to a very fast native code using a JIT-compiler.
Here is the Numba code:
import numba as nb
#nb.njit
def lagrange_poly(p, xhat, n, x, tol):
error_flag = 0
er = 1
lagrange_matrix = np.empty((n, p+1), dtype=np.float64)
for l in range(p):
if abs(xhat[l] - xhat[l+1]) < tol:
error_flag = er
# Base lagrange polynomial
for i in range(n):
for j in range(p+1):
L = 1.0
for k in range(p+1):
if k!=j:
L = L * (x[i] - xhat[k]) / (xhat[j] - xhat[k])
lagrange_matrix[i, j] = L
return lagrange_matrix, error_flag
Overall, this should be several order of magnitude faster.
I am currently trying to write some python code to solve an arbitrary system of first order ODEs, using a general explicit Runge-Kutta method defined by the values alpha, gamma (both vectors of dimension m) and beta (lower triangular matrix of dimension m x m) of the Butcher table which are passed in by the user. My code appears to work for single ODEs, having tested it on a few different examples, but I'm struggling to generalise my code to vector valued ODEs (i.e. systems).
In particular, I try to solve a Van der Pol oscillator ODE (reduced to a first order system) using Heun's method defined by the Butcher Tableau values given in my code, but I receive the errors
"RuntimeWarning: overflow encountered in double_scalars f = lambda t,u: np.array(... etc)" and
"RuntimeWarning: invalid value encountered in add kvec[i] = f(t+alpha[i]*h,y+h*sum)"
followed by my solution vector that is clearly blowing up. Note that the commented out code below is one of the examples of single ODEs that I tried and is solved correctly. Could anyone please help? Here is my code:
import numpy as np
def rk(t,y,h,f,alpha,beta,gamma):
'''Runga Kutta iteration'''
return y + h*phi(t,y,h,f,alpha,beta,gamma)
def phi(t,y,h,f,alpha,beta,gamma):
'''Phi function for the Runga Kutta iteration'''
m = len(alpha)
count = np.zeros(len(f(t,y)))
kvec = k(t,y,h,f,alpha,beta,gamma)
for i in range(1,m+1):
count = count + gamma[i-1]*kvec[i-1]
return count
def k(t,y,h,f,alpha,beta,gamma):
'''returning a vector containing each step k_{i} in the m step Runga Kutta method'''
m = len(alpha)
kvec = np.zeros((m,len(f(t,y))))
kvec[0] = f(t,y)
for i in range(1,m):
sum = np.zeros(len(f(t,y)))
for l in range(1,i+1):
sum = sum + beta[i][l-1]*kvec[l-1]
kvec[i] = f(t+alpha[i]*h,y+h*sum)
return kvec
def timeLoop(y0,N,f,alpha,beta,gamma,h,rk):
'''function that loops through time using the RK method'''
t = np.zeros([N+1])
y = np.zeros([N+1,len(y0)])
y[0] = y0
t[0] = 0
for i in range(1,N+1):
y[i] = rk(t[i-1],y[i-1], h, f,alpha,beta,gamma)
t[i] = t[i-1]+h
return t,y
#################################################################
'''f = lambda t,y: (c-y)**2
Y = lambda t: np.array([(1+t*c*(c-1))/(1+t*(c-1))])
h0 = 1
c = 1.5
T = 10
alpha = np.array([0,1])
gamma = np.array([0.5,0.5])
beta = np.array([[0,0],[1,0]])
eff_rk = compute(h0,Y(0),T,f,alpha,beta,gamma,rk, Y,11)'''
#constants
mu = 100
T = 1000
h = 0.01
N = int(T/h)
#initial conditions
y0 = 0.02
d0 = 0
init = np.array([y0,d0])
#Butcher Tableau for Heun's method
alpha = np.array([0,1])
gamma = np.array([0.5,0.5])
beta = np.array([[0,0],[1,0]])
#rhs of the ode system
f = lambda t,u: np.array([u[1],mu*(1-u[0]**2)*u[1]-u[0]])
#solving the system
time, sol = timeLoop(init,N,f,alpha,beta,gamma,h,rk)
print(sol)
Your step size is not small enough. The Van der Pol oscillator with mu=100 is a fast-slow system with very sharp turns at the switching of the modes, so rather stiff. With explicit methods this requires small step sizes, the smallest sensible step size is 1e-5 to 1e-6. You get a solution on the limit cycle already for h=0.001, with resulting velocities up to 150.
You can reduce some of that stiffness by using a different velocity/impulse variable. In the equation
x'' - mu*(1-x^2)*x' + x = 0
you can combine the first two terms into a derivative,
mu*v = x' - mu*(1-x^2/3)*x
so that
x' = mu*(v+(1-x^2/3)*x)
v' = -x/mu
The second equation is now uniformly slow close to the limit cycle, while the first has long relatively straight jumps when v leaves the cubic v=x^3/3-x.
This integrates nicely with the original h=0.01, keeping the solution inside the box [-3,3]x[-2,2], even if it shows some strange oscillations that are not present for smaller step sizes and the exact solution.
I'm implementing the PC algorithm in python. Such algorithm constructs the graphical model of a n-variate gaussian distribution. This graphical model is basically the skeleton of a directed acyclic graph, which means that if a structure like:
(x1)---(x2)---(x3)
Is in the graph, then x1 is independent by x3 given x2. More generally if A is the adjacency matrix of the graph and A(i,j)=A(j,i) = 0 (there is a missing edge between i and j) then i and j are conditionally independent, by all the variables that appear in any path from i to j. For statistical and machine learning purposes, it is be possible to "learn" the underlying graphical model.
If we have enough observations of a jointly gaussian n-variate random variable we could use the PC algorithm that works as follows:
given n as the number of variables observed, initialize the graph as G=K(n)
for each pair i,j of nodes:
if exists an edge e from i to j:
look for the neighbours of i
if j is in neighbours of i then remove j from the set of neighbours
call the set of neighbours k
TEST if i and j are independent given the set k, if TRUE:
remove the edge e from i to j
This algorithm computes also the separating set of the graph, that are used by another algorithm that constructs the dag starting from the skeleton and the separation set returned by the pc algorithm. This is what i've done so far:
def _core_pc_algorithm(a,sigma_inverse):
l = 0
N = len(sigma_inverse[0])
n = range(N)
sep_set = [ [set() for i in n] for j in n]
act_g = complete(N)
z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)
while l<N:
for (i,j) in itertools.permutations(n,2):
adjacents_of_i = adj(i,act_g)
if j not in adjacents_of_i:
continue
else:
adjacents_of_i.remove(j)
if len(adjacents_of_i) >=l:
for k in itertools.combinations(adjacents_of_i,l):
if N-len(k)-3 < 0:
return (act_g,sep_set)
if test(sigma_inverse,z,i,j,l,a,k):
act_g[i][j] = 0
act_g[j][i] = 0
sep_set[i][j] |= set(k)
sep_set[j][i] |= set(k)
l = l + 1
return (act_g,sep_set)
a is the tuning-parameter alpha with which i will test for conditional independence, and sigma_inverse is the inverse of the covariance matrix of the sampled observations. Moreover, my test is:
def test(sigma_inverse,z,i,j,l,a,k):
def erfinv(x): #used to approximate the inverse of a gaussian cumulative density function
sgn = 1
a = 0.147
PI = numpy.pi
if x<0:
sgn = -1
temp = 2/(PI*a) + numpy.log(1-x**2)/2
add_1 = temp**2
add_2 = numpy.log(1-x**2)/a
add_3 = temp
rt1 = (add_1-add_2)**0.5
rtarg = rt1 - add_3
return sgn*(rtarg**0.5)
def indep_test_ijK(K): #compute partial correlation of i and j given ONE conditioning variable K
part_corr_coeff_ij = z(sigma_inverse,i,j) #this gives the partial correlation coefficient of i and j
part_corr_coeff_iK = z(sigma_inverse,i,K) #this gives the partial correlation coefficient of i and k
part_corr_coeff_jK = z(sigma_inverse,j,K) #this gives the partial correlation coefficient of j and k
part_corr_coeff_ijK = (part_corr_coeff_ij - part_corr_coeff_iK*part_corr_coeff_jK)/((((1-part_corr_coeff_iK**2))**0.5) * (((1-part_corr_coeff_jK**2))**0.5)) #this gives the partial correlation coefficient of i and j given K
return part_corr_coeff_ijK == 0 #i independent from j given K if partial_correlation(i,k)|K == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
if l == 0:
return z(sigma_inverse,i,j) == 0 #i independent from j <=> partial_correlation(i,j) == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
elif l == 1:
return indep_test_ijK(k[0])
elif l == 2:
return indep_test_ijK(k[0]) and indep_test_ijK(k[1]) #ASSUMING THAT IJ ARE INDEPENDENT GIVEN Y,Z <=> IJ INDEPENDENT GIVEN Y AND IJ INDEPENDENT GIVEN Z
else: #i have to use the independent test with the z-fisher function
return indep_test()
Where z is a lambda that receives a matrix (the inverse of the covariance matrix), an integer i, an integer j and it computes the partial correlation of i and j given all the rest of variables with the following rule (which I read in my teacher's slides):
corr(i,j)|REST = -var^-1(i,j)/sqrt(var^-1(i,i)*var^-1(j,j))
The main core of this application is the indep_test() function:
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
This function implements a statistical test which uses the fisher's z-transform of estimated partial correlations. I am using this algorithm in two ways:
Generate data from a linear regression model and compare the learned DAG with the expected one
Read a dataset and learn the underlying DAG
In both cases i do not always get correct results, either because I know the DAG underlying a certain dataset, or because i know the generative model but it does not coincide with the one my algorithm learns. I perfectly know that this is a non-trivial task and I may have misunderstand theoretical concept as well as committed error even in parts of the code i have omitted here; but first i'd like to know (from someone who is more experienced than me), if the test i wrote is right, and also if there are library functions that perform this kind of tests, i tried searching but i couldn't find any suitable function.
I get to the point. The most critical issue in the above code, regards the following error:
sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)
I was mistaking the mean of n, it is not the size of the precision matrix but the number of total multi-variate observations (in my case, 10000 instead of 5). Another wrong assumption is that z(sigma_inverse[i][j]) has to provide the partial correlation of i and j given all the rest. That's not correct, z is the Fisher's transform on a proper subset of the precision matrix which estimates the partial correlation of i and j given the K. The correct test is the following:
if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)
r = CM[i, j] #r is the partial correlation of i and j
elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame
r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given K
else: #more than one conditioning variable
CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for
PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset
r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))
r = min(0.999999, max(-0.999999,r))
res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmation
return 2 * (1 - norm.cdf(abs(res))) #obtaining p-value
I hope someone could find this helpful
I want to find the closest representation of a floating point number in the form N/2**M in python, where N and M are integers. I attempted to use the minimisation function from scipy.optimise but it cannot be confined to the case where N and M are integers.
I ended up using a simple implementation that iterates through values of M and N and finds the minimum, but this is computationally expensive and time consuming for arrays of many numbers, what might be a better way of doing this?
My simple implementation is shown below:
import numpy as np
def ValueRepresentation(X):
M, Dp = X
return M/(2**Dp)
def Diff(X, value):
return abs(ValueRepresentation(X) - value)
def BestApprox(value):
mindiff = 1000000000
for i in np.arange(0, 1000, 1):
for j in np.arange(0, 60, 1):
diff = Diff([i, j], value)
if diff < mindiff:
mindiff = diff
M = i
Dp = j
return M, Dp
Just use the built-in functionality:
In [10]: 2.5.as_integer_ratio() # get representation as fraction
Out[10]: (5, 2)
In [11]: (2).bit_length() - 1 # convert 2**M to M
Out[11]: 1
Note that all non-infinite, non-NaN floats are dyadic rationals, so we can rely on the denominator being an exact power of 2.
Thanks to jasonharper I realised my implementation is ridiculously inefficient and could be much simpler.
The implementation of his method is shown below:
def BestApprox_fast(value):
mindiff = 1000000000
for Dp in np.arange(0, 32, 1):
M = round(value*2**Dp)
if abs(M) < 1000:
diff = Diff([M, Dp], value)
if diff < mindiff:
mindiff = diff
M_best = M
Dp_best = Dp
return M_best, Dp_best
It is approximately 200 times quicker.
With the limits on M and N given, the range of N/2**M is a well defined discrete number scale:
[0-1000/2^26, 501-1000/2^25, 501-1000/2^24, ... 501-1000/2^1, 501-1000/2^0].
In this given discrete set, different subsets have different accuracy/resolution. The first subset [0-1000/2^26] has accuracy of 2^-26 or 26 binary bits resolution. So whenever the given number falls in the corresponding continuous domain [0,1000/2^26], the best accuracy achievable is 2^-26. Successively, the best accuracy is 2^25 when the given number is beyond the first domain but falls in domain [500/2^25,1000/2^25], which corresponds to the second subset [501-1000/2^25]. (Note the difference between discrete set and continuous domain.)
With the above logic, we know the best accuracy, defined by M, depends on where the given number falls on the scale. Thus we can implement it as following python code:
import numpy as np
limits = 1000.0/2**np.arange(0,61)
a = 103.23 # test value
for i in range(60,-1,-1):
if a <= limits[i]:
N = i
M = round(a * 2**N)
r = [M, N]
break
if a > 1000:
r = [round(a), 0]
This solution has O(c) execution time, so it is ideal for multiple invocations.