Python vs MATLAB performance on algorithm - python

I have a performance question about two bits of code. One is implemented in python and one in MATLAB. The code calculates the sample entropy of a time series (which sounds complicated but is basically a bunch of for loops).
I am running both implementations on relatively large time series (~95k+ samples) depending on the time series. The MATLAB implementation finishes the calculation in ~45 sec to 1 min. The python one basically never finishes. I threw tqdm over the python for loops and the upper loop was only moving at about ~1.85s/it which gives 50+ hours as a estimated completion time (I've let it run for 15+ mins and the iteration count was pretty consistent).
Example inputs and runtimes:
MATLAB (~ 52 sec):
a = rand(1, 95000)
sampenc(a, 4, 0.1 * std(a))
Python (currently 5 mins in with 49 hours estimated):
import numpy as np
a = np.random.rand(1, 95000)[0]
sample_entropy(a, 4, 0.1 * np.std(a))
Python Implementation:
# https://github.com/nikdon/pyEntropy
def sample_entropy(time_series, sample_length, tolerance=None):
"""Calculate and return Sample Entropy of the given time series.
Distance between two vectors defined as Euclidean distance and can
be changed in future releases
Args:
time_series: Vector or string of the sample data
sample_length: Number of sequential points of the time series
tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
Returns:
Vector containing Sample Entropy (float)
References:
[1] http://en.wikipedia.org/wiki/Sample_Entropy
[2] http://physionet.incor.usp.br/physiotools/sampen/
[3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
of biological signals
"""
if tolerance is None:
tolerance = 0.1 * np.std(time_series)
n = len(time_series)
prev = np.zeros(n)
curr = np.zeros(n)
A = np.zeros((sample_length, 1)) # number of matches for m = [1,...,template_length - 1]
B = np.zeros((sample_length, 1)) # number of matches for m = [1,...,template_length]
for i in range(n - 1):
nj = n - i - 1
ts1 = time_series[i]
for jj in range(nj):
j = jj + i + 1
if abs(time_series[j] - ts1) < tolerance: # distance between two vectors
curr[jj] = prev[jj] + 1
temp_ts_length = min(sample_length, curr[jj])
for m in range(int(temp_ts_length)):
A[m] += 1
if j < n - 1:
B[m] += 1
else:
curr[jj] = 0
for j in range(nj):
prev[j] = curr[j]
N = n * (n - 1) / 2
B = np.vstack(([N], B[:sample_length - 1]))
similarity_ratio = A / B
se = - np.log(similarity_ratio)
se = np.reshape(se, -1)
return se
MATLAB Implementation:
function [e,A,B]=sampenc(y,M,r);
%function [e,A,B]=sampenc(y,M,r);
%
%Input
%
%y input data
%M maximum template length
%r matching tolerance
%
%Output
%
%e sample entropy estimates for m=0,1,...,M-1
%A number of matches for m=1,...,M
%B number of matches for m=0,...,M-1 excluding last point
n=length(y);
lastrun=zeros(1,n);
run=zeros(1,n);
A=zeros(M,1);
B=zeros(M,1);
p=zeros(M,1);
e=zeros(M,1);
for i=1:(n-1)
nj=n-i;
y1=y(i);
for jj=1:nj
j=jj+i;
if abs(y(j)-y1)<r
run(jj)=lastrun(jj)+1;
M1=min(M,run(jj));
for m=1:M1
A(m)=A(m)+1;
if j<n
B(m)=B(m)+1;
end
end
else
run(jj)=0;
end
end
for j=1:nj
lastrun(j)=run(j);
end
end
N=n*(n-1)/2;
B=[N;B(1:(M-1))];
p=A./B;
e=-log(p);
I've also tried a few other python implementations and all of them have the same slow result:
vectorized-sample-entropy
sampen
sampen2.py
Wikipedia sample entropy implementation
I don't think computer issue as it runs relativity fast in MATLAB.
As far as I can tell, implementation-wise both sets of code are the same. I have no idea why the python implementations are so slow. I would understand a difference of a few seconds but not such a large discrepancy. Let me know your thoughts on why this is or suggestions on how to improve the python versions.
BTW: I'm using Python 3.6.5 with numpy 1.14.5 and MATLAB R2018a.

As said in the comments, Matlab uses a jit-compiler by default Python doesn't. In Python you could use Numba to do quite the same.
Your code with slight modifications
import numba as nb
import numpy as np
import time
#nb.jit(fastmath=True,error_model='numpy')
def sample_entropy(time_series, sample_length, tolerance=None):
"""Calculate and return Sample Entropy of the given time series.
Distance between two vectors defined as Euclidean distance and can
be changed in future releases
Args:
time_series: Vector or string of the sample data
sample_length: Number of sequential points of the time series
tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
Returns:
Vector containing Sample Entropy (float)
References:
[1] http://en.wikipedia.org/wiki/Sample_Entropy
[2] http://physionet.incor.usp.br/physiotools/sampen/
[3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
of biological signals
"""
if tolerance is None:
tolerance = 0.1 * np.std(time_series)
n = len(time_series)
prev = np.zeros(n)
curr = np.zeros(n)
A = np.zeros((sample_length)) # number of matches for m = [1,...,template_length - 1]
B = np.zeros((sample_length)) # number of matches for m = [1,...,template_length]
for i in range(n - 1):
nj = n - i - 1
ts1 = time_series[i]
for jj in range(nj):
j = jj + i + 1
if abs(time_series[j] - ts1) < tolerance: # distance between two vectors
curr[jj] = prev[jj] + 1
temp_ts_length = min(sample_length, curr[jj])
for m in range(int(temp_ts_length)):
A[m] += 1
if j < n - 1:
B[m] += 1
else:
curr[jj] = 0
for j in range(nj):
prev[j] = curr[j]
N = n * (n - 1) // 2
B2=np.empty(sample_length)
B2[0]=N
B2[1:]=B[:sample_length - 1]
similarity_ratio = A / B2
se = - np.log(similarity_ratio)
return se
Timings
a = np.random.rand(1, 95000)[0] #Python
a = rand(1, 95000) #Matlab
Python 3.6, Numba 0.40dev, Matlab 2016b, Core i5-3210M
Python: 487s
Python+Numba: 12.2s
Matlab: 71.1s

Related

Need help speeding up numpy code that finds number of `coincidences' between two NumPy arrays

I am looking for some help speeding up some code that I have written in Numpy. Here is the code:
def TimeChunks(timevalues, num):
avg = len(timevalues) / float(num)
out = []
last = 0.0
while last < len(timevalues):
out.append(timevalues[int(last):int(last + avg)])
last += avg
return out
### chunk i can be called by out[i] ###
NumChunks = 100000
t1chunks = TimeChunks(t1, NumChunks)
t2chunks = TimeChunks(t2, NumChunks)
NumofBins = 2000
CoincAllChunks = 0
for i in range(NumChunks):
CoincOneChunk = 0
Hist1, something1 = np.histogram(t1chunks[i], NumofBins)
Hist2, something2 = np.histogram(t2chunks[i], NumofBins)
Mask1 = (Hist1>0)
Mask2 = (Hist2>0)
MaskCoinc = Mask1*Mask2
CoincOneChunk = np.sum(MaskCoinc)
CoincAllChunks = CoincAllChunks + CoincOneChunk
Is there anything that can be done to improve this to make it more efficient for large arrays?
To explain the point of the code in a nutshell, the purpose of the code is simply to find the average "coincidences" between two NumPy arrays, representing time values of two channels (divided by some normalisation constant). This "coincidence" occurs when there is at least one time value in each of the two channels in a certain time interval.
For example:
t1 = [.4, .7, 1.1]
t2 = [0.8, .9, 1.5]
There is a coincidence in the window [0,1] and one coincidence in the interval [1, 2].
I want to find the average number of these "coincidences" when I break down my time array into a number of equally distributed bins. So for example if:
t1 = [.4, .7, 1.1, 2.1, 3, 3.3]
t2 = [0.8, .9, 1.5, 2.2, 3.1, 4]
And I want 4 bins, the intervals I'll consider are ([0,1], [1,2], [2,3], [3,4]). Therefore the total coincidences will be 4 (because there is a coincidence in each bin), and therefore the average coincidences will be 4.
This code is an attempt to do this for large time arrays for very small bin sizes, and as a result, to make it work I had to break down my time arrays into smaller chunks, and then for-loop through each of these chunks.
I've tried making this as vectorized as I can, but it still is very slow...
Any ideas what can be done to speed it up further?
Any suggestions/hints will be appreciated. Thanks!.
This is 17X faster and more correct using a custom made numba_histogram function that beats the generic np.histogram. Note that you are computing and comparing histograms of two different series separately, which is not accurate for your purpose. So, in my numba_histogram function I use the same bin edges to compute the histograms of both series simultaneously.
We can still optimize it even further if you provide more precise details about the algorithm. Namely, if you provide specific details about the parameters and the criteria on which you decide that two intervals coincide.
import numpy as np
from numba import njit
#njit
def numba_histogram(a, b, n):
hista, histb = np.zeros(n, dtype=np.intp), np.zeros(n, dtype=np.intp)
a_min, a_max = min(a[0], b[0]), max(a[-1], b[-1])
for x, y in zip(a, b):
bin = n * (x - a_min) / (a_max - a_min)
if x == a_max:
hista[n - 1] += 1
elif bin >= 0 and bin < n:
hista[int(bin)] += 1
bin = n * (y - a_min) / (a_max - a_min)
if y == a_max:
histb[n - 1] += 1
elif bin >= 0 and bin < n:
histb[int(bin)] += 1
return np.sum( (hista > 0) * (histb > 0) )
#njit
def calc_coincidence(t1, t2):
NumofBins = 2000
NumChunks = 100000
avg = len(t1) / NumChunks
CoincAllChunks = 0
last = 0.0
while last < len(t1):
t1chunks = t1[int(last):int(last + avg)]
t2chunks = t2[int(last):int(last + avg)]
CoincAllChunks += numba_histogram(t1chunks, t2chunks, NumofBins)
last += avg
return CoincAllChunks
Test with 10**8 arrays:
t1 = np.arange(10**8) + np.random.rand(10**8)
t2 = np.arange(10**8) + np.random.rand(10**8)
CoincAllChunks = calc_coincidence(t1, t2)
print( CoincAllChunks )
# 34793890 Time: 24.96140170097351 sec. (Original)
# 34734897 Time: 1.499996423721313 sec. (Optimized)

Preventing overflow of large integers in (GPU) optimized methods such as gmpy2 and numba

I am trying to check whether a large integer is a perfect square using gmpy2 in a JIT-decorated (optimized) routine using numba. The example here is for illustrative purposes only (from a theoretical point of view, such equations or elliptic curves can be treated differently/better). My code seems to overflow since it yields solutions that aren't really ones:
import numpy as np
from numba import jit
import gmpy2
from gmpy2 import mpz, xmpz
import time
import sys
#jit('void(uint64)')
def findIntegerSolutionsGmpy2(limit: np.uint64):
for x in np.arange(0, limit+1, dtype=np.uint64):
y = mpz(x**6-4*x**2+4)
if gmpy2.is_square(y):
print([x,gmpy2.sqrt(y),y])
def main() -> int:
limit = 100000000
start = time.time()
findIntegerSolutionsGmpy2(limit)
end = time.time()
print("Time elapsed: {0}".format(end - start))
return 0
if __name__ == '__main__':
sys.exit(main())
Using a limit = 1000000000 the routine finishes within approx. 4 seconds. The limit, which I am handing over to the decorated function, will not exceed an unsigned integer of 64 Bit (which seems not to be an issue here).
I read that big integers do not work in combination with numba's JIT optimization (see for example here).
My Question:
Is there any possibility to use large integers in (GPU) optimized code?
I could now manage to avoid the loss of precision by the following code:
#jit('void(uint64)')
def findIntegerSolutionsGmpy2(limit: np.uint64):
for x in np.arange(0, limit+1, dtype=np.uint64):
x_ = mpz(int(x))**2
y = x_**3-mpz(4)*x_+mpz(4)
if gmpy2.is_square(y):
print([x,gmpy2.sqrt(y),y])
But by using the limit = 100000000 this ammended/fixed routine finishes not within 4 seconds anymore. It took now 912 seconds. Very likely we have an insurmountable gap between precision and speed.
Using CUDA it becomes faster, namely 5 minutes (machine with 128GB RAM, an Intel Xeon CPU E5-2630 v4, 2.20GHz processor and two graphic cards of type Tesla V100 with 16GB RAM each), but I obtain along the correct results also wrong results again.
%%time
from numba import jit, cuda
import numpy as np
from math import sqrt
#cuda.jit
def findIntegerSolutionsCuda(arr):
i=0
for x in range(0, 1000000000+1):
y = float(x**6-4*x**2+4)
sqr = int(sqrt(y))
if sqr*sqr == int(y):
arr[i][0]=x
arr[i][1]=sqr
arr[i][2]=y
i+=1
arr=np.zeros((10,3))
findIntegerSolutionsCuda[128, 255](arr)
print(arr)
Real reason of wrong results is simple, you forgot to convert x to mpz, so statement x ** 6 - 4 * x ** 2 + 4 is promoted to np.uint64 type and computed with overflow (because x in statement is np.uint64). Fix is trivial, just add x = mpz(x):
#jit('void(uint64)', forceobj = True)
def findIntegerSolutionsGmpy2(limit: np.uint64):
for x in np.arange(0, limit+1, dtype=np.uint64):
x = mpz(x)
y = mpz(x**6-4*x**2+4)
if gmpy2.is_square(y):
print([x,gmpy2.sqrt(y),y])
also in you may notice that I added forceobj = True, this is to suppress Numba compilation warnings at start.
After this fix everything works fine and you don't see wrong results.
If your task is to check if expression gives strict square then I decided to invent and implement another solution for you, code below.
It works as following. You may notice that if a number is square then it is also square modulus any number (taking modulus is x % N operation).
We can take any number, for example product of some primes, K = 2 * 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19. Now we can make a simple filter, compute all squares modulo K, mark this squares inside bit vector and then check what numbers modulo K have ones in this filter bit vector.
Filter K (product of primes), mentioned above, leaves only 1% of candidates for squares. We can also do a second stage, apply same filter with other primes, e.g. K2 = 23 * 29 * 31 * 37 * 41. This will filter them even mor by 3%. In total we will have 1% * 3% = 0.03% amount remaining of initial candidates.
After two filterings only few numbers remain to be checked. They can be easily fast-checked with gmpy2.is_square().
Filtering stage can be easily wrapped into Numba function, as I did below, this function can have extra Numba param parallel = True, this will tell Numba to automatically run all Numpy operations in parallel on all CPU cores.
In code I use limit = 1 << 30, this signifies limit of all x to be checked, and I use block = 1 << 26, this signifies how many numbers to check at a time, in parallel Numba function. If you have enough memory you may set block to be larger to occupy all CPU cores more efficiently. block of size 1 << 26 approximately uses around 1 GB of memory.
After using my idea with filtering and using multi-core CPU my code solves same task as yours hundred times faster.
Try it online!
import numpy as np, numba
#numba.njit('u8[:](u8[:], u8, u8, u1[:])', cache = True, parallel = True)
def do_filt(x, i, K, filt):
x += i; x %= K
x2 = x
x2 *= x2; x2 %= K
x6 = x2 * x2; x6 %= K
x6 *= x2; x6 %= K
x6 += np.uint64(4 * K + 4)
x2 <<= np.uint64(2)
x6 -= x2; x6 %= K
y = x6
#del x2
filt_y = filt[y]
filt_y_i = np.flatnonzero(filt_y).astype(np.uint64)
return filt_y_i
def main():
import math
gmpy2 = None
import gmpy2
Int = lambda x: (int(x) if gmpy2 is None else gmpy2.mpz(x))
IsSquare = lambda x: gmpy2.is_square(x)
Sqrt = lambda x: Int(gmpy2.sqrt(x))
Ks = [2 * 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19, 23 * 29 * 31 * 37 * 41]
filts = []
for i, K in enumerate(Ks):
a = np.arange(K, dtype = np.uint64)
a *= a
a %= K
filts.append((K, np.zeros((K,), dtype = np.uint8)))
filts[-1][1][a] = 1
print(f'filter {i} ratio', round(len(np.flatnonzero(filts[-1][1])) / K, 4))
limit = 1 << 30
block = 1 << 26
for i in range(0, limit, block):
print(f'i block {i // block:>3} (2^{math.log2(i + 1):>6.03f})')
x = np.arange(0, min(block, limit - i), dtype = np.uint64)
for ifilt, (K, filt) in enumerate(filts):
len_before = len(x)
x = do_filt(x, i, K, filt)
print(f'squares filtered by filter {ifilt}:', round(len(x) / len_before, 4))
x_to_check = x
print(f'remain to check {len(x_to_check)}')
sq_x = []
for x0 in x_to_check:
x = Int(i + x0)
y = x ** 6 - 4 * x ** 2 + 4
if not IsSquare(y):
continue
yr = Sqrt(y)
assert yr * yr == y
sq_x.append((int(x), int(yr)))
print('squares found', len(sq_x))
print(sq_x)
del x
if __name__ == '__main__':
main()
Output:
filter 0 ratio 0.0094
filter 1 ratio 0.0366
i block 0 (2^ 0.000)
squares filtered by filter 0: 0.0211
squares filtered by filter 1: 0.039
remain to check 13803
squares found 2
[(0, 2), (1, 1)]
i block 1 (2^24.000)
squares filtered by filter 0: 0.0211
squares filtered by filter 1: 0.0392
remain to check 13880
squares found 0
[]
i block 2 (2^25.000)
squares filtered by filter 0: 0.0211
squares filtered by filter 1: 0.0391
remain to check 13835
squares found 0
[]
i block 3 (2^25.585)
squares filtered by filter 0: 0.0211
squares filtered by filter 1: 0.0393
remain to check 13907
squares found 0
[]
...............................

Is there a way to increase the line length for an equation in Gekko after receiving "APM model error: string > 15000 characters"?

I'm using Gekko for an optimization problem with constraints that require summations over array variables. Because these arrays are long, I keep getting the error: APM model error: string > 15000 characters
The summation needs to be summed over three variables: i in range(1, years), n in range(1, i), and j in range(1,receptors). As it compiles, the number of variables included in each summation increases. I want to leave the code as a summation with the following line:
m.Equation(emissions[:,3] == sum(sum(sum(f[n,j]*-r[j,2]*unit *(.001*(i-n)**2 + 0.062*(i-n)) for i in range(years)) for n in range(i))for j in range(rec)))
However, these constraints cause the error of more than 15,000 characters for a line.
I have previously solved the problem using for loops and intermediates to solve all of these variables outside of the "constraint" environment. It has given me the right answer, but takes a long time to compile the model (upwards of 4 hours for model building, and less than 3 minutes to solve it). The code looked like this:
for i in range(years):
emissions[i,0] = s[i,1]
emissions[i,1] = s[i,3]
emissions[i,2] = s[i,5]
emissions[i,3] = 0
emissions[i,4] = 0
emissions[i,5] = 0
for n in range(i):
for j in range(rec):
#update + binary * flux * conversion * growth
emissions[i,3] = m.Intermediate(emissions[i,3] + f[n,j] * - rankedcopy[j,2] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
emissions[i,4] = m.Intermediate(emissions[i,4] + f[n,j] * - rankedcopy[j,3] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
emissions[i,5] = m.Intermediate(emissions[i,5] + f[n,j] * - rankedcopy[j,4] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
I'm hoping that avoiding the for loops will increase the efficiency which enables me to expand the model, but I'm unsure of a way to increase the APM model string limit.
I am also open to other suggestions of how to embed intermediates into the summation.
Try using the m.sum() function as a built-in GEKKO object. If you use the Python sum function then it creates a large summation equation that needs to be interpreted at run-time and may exceed the equation size limit. The m.sum() creates the summation in byte-code instead.
m.Equation(emissions[:,3] == \
m.sum(m.sum(m.sum(f[n,j]*-r[j,2]*unit *(.001*(i-n)**2 + 0.062*(i-n)) \
for i in range(years)) for n in range(i))for j in range(rec)))
Here is a simple example that shows the difference in performance.
from gekko import GEKKO
import numpy as np
import time
n = 5000
v = np.linspace(0,n-1,n)
# summation method 1 - Python sum
m = GEKKO()
t = time.time()
s = sum(v)
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
# summation method 2 - Intermediates
m = GEKKO()
t = time.time()
s = 0
for i in range(n):
s = m.Intermediate(s + v[i])
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
# summation method 3 - Gekko sum
m = GEKKO()
t = time.time()
s = m.sum(v)
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
Results
12497500.0
Elapsed time: 0.17874956130981445
12497500.0
Elapsed time: 5.171698570251465
12497500.0
Elapsed time: 0.1246955394744873
The 15,000 character limit for a single equation is a hard limit. We thought about making it adjustable with m.options.MAX_MEMORY but then large equations can make very dense matrix factorizations for the solver. It is often better to break up the equation or use other methods to reduce the equation size.

Test for conditional independence in python as part of the PC algorithm

I'm implementing the PC algorithm in python. Such algorithm constructs the graphical model of a n-variate gaussian distribution. This graphical model is basically the skeleton of a directed acyclic graph, which means that if a structure like:
(x1)---(x2)---(x3)
Is in the graph, then x1 is independent by x3 given x2. More generally if A is the adjacency matrix of the graph and A(i,j)=A(j,i) = 0 (there is a missing edge between i and j) then i and j are conditionally independent, by all the variables that appear in any path from i to j. For statistical and machine learning purposes, it is be possible to "learn" the underlying graphical model.
If we have enough observations of a jointly gaussian n-variate random variable we could use the PC algorithm that works as follows:
given n as the number of variables observed, initialize the graph as G=K(n)
for each pair i,j of nodes:
if exists an edge e from i to j:
look for the neighbours of i
if j is in neighbours of i then remove j from the set of neighbours
call the set of neighbours k
TEST if i and j are independent given the set k, if TRUE:
remove the edge e from i to j
This algorithm computes also the separating set of the graph, that are used by another algorithm that constructs the dag starting from the skeleton and the separation set returned by the pc algorithm. This is what i've done so far:
def _core_pc_algorithm(a,sigma_inverse):
l = 0
N = len(sigma_inverse[0])
n = range(N)
sep_set = [ [set() for i in n] for j in n]
act_g = complete(N)
z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)
while l<N:
for (i,j) in itertools.permutations(n,2):
adjacents_of_i = adj(i,act_g)
if j not in adjacents_of_i:
continue
else:
adjacents_of_i.remove(j)
if len(adjacents_of_i) >=l:
for k in itertools.combinations(adjacents_of_i,l):
if N-len(k)-3 < 0:
return (act_g,sep_set)
if test(sigma_inverse,z,i,j,l,a,k):
act_g[i][j] = 0
act_g[j][i] = 0
sep_set[i][j] |= set(k)
sep_set[j][i] |= set(k)
l = l + 1
return (act_g,sep_set)
a is the tuning-parameter alpha with which i will test for conditional independence, and sigma_inverse is the inverse of the covariance matrix of the sampled observations. Moreover, my test is:
def test(sigma_inverse,z,i,j,l,a,k):
def erfinv(x): #used to approximate the inverse of a gaussian cumulative density function
sgn = 1
a = 0.147
PI = numpy.pi
if x<0:
sgn = -1
temp = 2/(PI*a) + numpy.log(1-x**2)/2
add_1 = temp**2
add_2 = numpy.log(1-x**2)/a
add_3 = temp
rt1 = (add_1-add_2)**0.5
rtarg = rt1 - add_3
return sgn*(rtarg**0.5)
def indep_test_ijK(K): #compute partial correlation of i and j given ONE conditioning variable K
part_corr_coeff_ij = z(sigma_inverse,i,j) #this gives the partial correlation coefficient of i and j
part_corr_coeff_iK = z(sigma_inverse,i,K) #this gives the partial correlation coefficient of i and k
part_corr_coeff_jK = z(sigma_inverse,j,K) #this gives the partial correlation coefficient of j and k
part_corr_coeff_ijK = (part_corr_coeff_ij - part_corr_coeff_iK*part_corr_coeff_jK)/((((1-part_corr_coeff_iK**2))**0.5) * (((1-part_corr_coeff_jK**2))**0.5)) #this gives the partial correlation coefficient of i and j given K
return part_corr_coeff_ijK == 0 #i independent from j given K if partial_correlation(i,k)|K == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
if l == 0:
return z(sigma_inverse,i,j) == 0 #i independent from j <=> partial_correlation(i,j) == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
elif l == 1:
return indep_test_ijK(k[0])
elif l == 2:
return indep_test_ijK(k[0]) and indep_test_ijK(k[1]) #ASSUMING THAT IJ ARE INDEPENDENT GIVEN Y,Z <=> IJ INDEPENDENT GIVEN Y AND IJ INDEPENDENT GIVEN Z
else: #i have to use the independent test with the z-fisher function
return indep_test()
Where z is a lambda that receives a matrix (the inverse of the covariance matrix), an integer i, an integer j and it computes the partial correlation of i and j given all the rest of variables with the following rule (which I read in my teacher's slides):
corr(i,j)|REST = -var^-1(i,j)/sqrt(var^-1(i,i)*var^-1(j,j))
The main core of this application is the indep_test() function:
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
This function implements a statistical test which uses the fisher's z-transform of estimated partial correlations. I am using this algorithm in two ways:
Generate data from a linear regression model and compare the learned DAG with the expected one
Read a dataset and learn the underlying DAG
In both cases i do not always get correct results, either because I know the DAG underlying a certain dataset, or because i know the generative model but it does not coincide with the one my algorithm learns. I perfectly know that this is a non-trivial task and I may have misunderstand theoretical concept as well as committed error even in parts of the code i have omitted here; but first i'd like to know (from someone who is more experienced than me), if the test i wrote is right, and also if there are library functions that perform this kind of tests, i tried searching but i couldn't find any suitable function.
I get to the point. The most critical issue in the above code, regards the following error:
sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)
I was mistaking the mean of n, it is not the size of the precision matrix but the number of total multi-variate observations (in my case, 10000 instead of 5). Another wrong assumption is that z(sigma_inverse[i][j]) has to provide the partial correlation of i and j given all the rest. That's not correct, z is the Fisher's transform on a proper subset of the precision matrix which estimates the partial correlation of i and j given the K. The correct test is the following:
if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)
r = CM[i, j] #r is the partial correlation of i and j
elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame
r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given K
else: #more than one conditioning variable
CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for
PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset
r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))
r = min(0.999999, max(-0.999999,r))
res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmation
return 2 * (1 - norm.cdf(abs(res))) #obtaining p-value
I hope someone could find this helpful

Is it possible to optimize this dynamic programming code?

This code is taking more than half an hour for a data set of 200000 floats.
import numpy as np
try:
import progressbar
pbar = progressbar.ProgressBar(widgets=[progressbar.Percentage(),
progressbar.Counter('%5d'), progressbar.Bar(), progressbar.ETA()])
except:
pbar = list
block_length = np.loadtxt('bb.txt.gz') # get data file from http://filebin.ca/29LbYfKnsKqJ/bb.txt.gz (2MB, 200000 float numbers)
N = len(block_length) - 1
# arrays to store the best configuration
best = np.zeros(N, dtype=float)
last = np.zeros(N, dtype=int)
log = np.log
# Start with first data cell; add one cell at each iteration
for R in pbar(range(N)):
# Compute fit_vec : fitness of putative last block (end at R)
#fit_vec = fitfunc.fitness(
T_k = block_length[:R + 1] - block_length[R + 1]
#N_k = np.cumsum(x[:R + 1][::-1])[::-1]
N_k = np.arange(R + 1, 0, -1)
fit_vec = N_k * (log(N_k) - log(T_k))
prior = 4 - log(73.53 * 0.05 * ((R+1) ** -0.478))
A_R = fit_vec - prior #fitfunc.prior(R + 1, N)
A_R[1:] += best[:R]
i_max = np.argmax(A_R)
last[R] = i_max
best[R] = A_R[i_max]
# Now find changepoints by iteratively peeling off the last block
change_points = np.zeros(N, dtype=int)
i_cp = N
ind = N
while True:
i_cp -= 1
change_points[i_cp] = ind
if ind == 0:
break
ind = last[ind - 1]
change_points = change_points[i_cp:]
print edges[change_points] # show result
The first loop is very slow because the length of arrays is R at every iteration, i.e. increasing, leading to N^2 complexity.
Is there any way to optimize this code further, e.g. through pre-computation? I am also happy with solutions using other programming languages.
I can replicate A_R (up to the fit-prior step) as a upper triangular NxN matrix with:
def trilog(n):
nn = n[:-1,None]-n[None,1:]
nn[np.tril_indices_from(nn,-1)]=1
return nn
T_k = trilog(block_length)
N_k = trilog(-np.arange(N+1))
fit_vec = N_k * (np.log(N_k) - np.log(T_k))
R = np.arange(N)+1
prior = 4 - log(73.53 * 0.05 * (R ** -0.478))
A_R = fit_vec - prior
A_R = np.triu(A_R,0)
print(A_R)
I haven't worked through the logic of calculation and applying best.
I've only done this with small arrays. For your full problem, the corresponding matrix is too large for my memory.
B=np.ones((200000,200000),float)
So just from memory considerations you might be stuck with the for R in range(N) iteration.

Categories

Resources