I want to vectorise the triple sum
\sum_{i=1}^I\sum_{j=1}^J\sum_{m=1}^J a_{ijm}
such that I end up with a matrix
A \in \mathbb{R}^{I \times J}
where A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm} for k = 1,...,I and l = 1, ...,J
carrying forward the sums to avoid pointless recomputation.
I currently use this code:
np.cumsum(np.cumsum(np.cumsum(a, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
but it is inefficient as it does lots of extra work and extracts the correct matrix at the end with the diagonal method. I can't think of how to make this faster.
The main challenge here is to compute the inner two sums, i.e. the sum of the square slices of a matrix originating from the top left. The final sum is just a cumsum on top of that along the 0th axis.
Setup:
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
Your implementation:
%%timeit
out = np.cumsum(np.cumsum(np.cumsum(arr, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your implementation improved by taking the diagonal before cumsumming over the 0th axis:
%%timeit
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Finally, some tril/triu trickery:
%%timeit
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is already better, but admittedly still not ideal. I don't see a better way to compute the inner two sums noted above with pure numpy.
You can use Numba so to produce a very fast implementation. Here is the code:
import numba as nb
import numpy as np
#nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# Parallel cumsum along the axis 1 and 2 + extraction of the diagonal
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# Cumsum along the axis 0
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
Here are performance results on my 6-core i5-9600KF with a 100x100x100 float64 input array:
Initial code: 12.7 ms
Chryophylaxs v1: 7.1 ms
Chryophylaxs v2: 5.5 ms
Numba: 0.2 ms
This implementation is significantly faster than all others. It is about 64 times faster than the initial implementation. It is also actually optimal on my machine since it completely saturate the bandwidth of my RAM only for reading the input array (which is mandatory). Note that it is better not to use multiple threads for very small arrays.
Note that this code also use far less memory as it only need 8 * nk * num_threads bytes of temporary storage as opposed to 16 * ni * nj * nk bytes for the initial solution.
I have a program for a simulation and inside the program I have a function. I have realized that the function consumes most time of simulation. So, I am trying to optimize the funcion first. The function is as follows
Julia version 1.1:
function fun_jul(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
K = length(ksi);
Z = zeros(length(x),K);
for n in 1:M
for k in 1:K
for l in 1:length(x)
Z[l,k] += (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
end
end
end
return Z
end
I also rewrite the above function in python+numba for comparison as follows
Python+numba
import numpy as np
from numba import prange, jit
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
But Julia codes are very slow here are my results:
Julia results:
using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 1.22 MiB
allocs estimate: 2
--------------
minimum time: 25.039 s (0.00% GC)
median time: 25.039 s (0.00% GC)
mean time: 25.039 s (0.00% GC)
maximum time: 25.039 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
Python results:
N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;
%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any help on improving the codes both for julia and python+numba would be appreciated.
Updated
Based on #Przemyslaw Szufel's answer and the other posts I have improved numba and julia codes. Now both are parallelized. Here are timings
Python+Numba times:
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((K,len(x)));
for n in range(1,M+1):
pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
for k in prange(0,K):
Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;
return Z
N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;
%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Julia times
N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul2(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 40.31 MiB
allocs estimate: 6302
--------------
minimum time: 705.470 ms (0.17% GC)
median time: 726.403 ms (0.17% GC)
mean time: 729.032 ms (1.68% GC)
maximum time: 765.426 ms (5.27% GC)
--------------
samples: 7
evals/sample: 1
I got down to 300ms on a single thread (instead of 28s on my machine) with the following code.
You are using multi-threading for Numba. In Julia you should use parallel processing (multi-threading support is experimental fo Julia). It seems that your code is doing some kind of parameter sweep - such codes are very easy to parallelize but it usually requires some adjustments to your computational process.
Here is the code:
function fun_jul2(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
for k in 1:K
for l in 1:L
Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul(M,cc,xi,x)
true
julia> #time fun_jul2(M,cc,xi,x);
0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)
** EDIT: with multithreading and inbounds suggested by DNF:
function fun_jul3(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
Threads.#threads for k in 1:K
for l in 1:L
#inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
And now the running time (remember to run set JULIA_NUM_THREADS=4 or Linux equivalent before launching Julia):
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul3(M,cc,xi,x)
true
julia> #time fun_jul3(M,cc,xi,x);
0.051470 seconds (2.71 k allocations: 6.989 MiB)
You could also try to further experiment with parallelizing of computing of F_im1 and F_im2.
You can do, or fail to do, loop optimization in any language that has loops. The major difference here is that the numba code is vectorized for the inner loop but the Julia code is not. To vectorize the Julia version, it is sometimes necessary to change operators to their vectorized versions with the ., so that + becomes .+ for example.
Since I cannot get Numba to install properly on my older Windows 10 machine, I ran the code versions below on free Linux versions on the Web. This means I had to use the Python interface for timeit(), not the command line.
Run in Jupyter at mybinder, probably with 1 thread since it is not specified. :
import timeit
timeit.timeit("""
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
N=400; a = -0.5; b = 0.5; x = np.linspace(a,b,N); cc = x;M = 2*N + 100; xi = M/40;
fun_py(M,cc,xi,x)
""", setup ="import numpy as np; from numba import prange, jit", number=5)
Out[1]: 61.07768889795989
Your machine must be a lot faster than Jupyter, ForBonder.
I ran this optimized julia code version below, in Jupyter on JuliaBox, 1 thread kernel specified:
using BenchmarkTools
F(n, x) = sinpi.(n * (x .+ 1) / 2) .* cospi.(n * (x .+ 1) / 2)
function fun_jul2(M, ksi, xi, x)
K = length(ksi)
Z = zeros(length(x), K)
for n in 1:M, k in 1:K
Z[:, k] .+= (1 - (n / (M + 1))^2)^xi * F(n, ksi[k]) * F(n, x)
end
return Z
end
const N=400; const a=-0.5; const b=0.5; const x=range(a,b,length=N);
const cc=x; const M = 2*N+100; const xi = M/40;
#btime fun_jul2(M, cc, xi, x)
8.076 s (1080002 allocations: 3.35 GiB)
For performance, just precompute the trigonometric part.
Indeed, sin is a costly operation:
%timeit np.sin(1.)
712 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit 1.2*3.4
5.88 ns ± 0.016 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In python :
#jit
def fun_py2(M,ksi,xi,x):
NN = np.arange(1,M+1)
Fksi = np.sin(np.pi*np.outer(NN,ksi+1))/2 # sin(a)cos(a) is sin(2a)/2
Fx = np.sin(np.pi*np.outer(NN,x+1))/2
U = (1-(NN/(M+1))**2)**xi
Z = np.zeros((len(x),len(ksi)))
for n in range(len(NN)):
for k in range(len(ksi)):
for l in range(len(x)):
Z[k,l] += U[n] * Fksi[n,k] * Fx[n,l];
return Z
For a 30x improvement:
np.allclose(fun_py(M,cc,xi,x),fun_py2(M,cc,xi,x))
True
%timeit fun_py(M,cc,xi,x)
1.14 s ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fun_py2(M,cc,xi,x)
29.5 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This doesn't trig any parallelism. I suppose the same will occur for Julia.
I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))
NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
np.random.seed(100)
a = np.random.rand(100_000)
print(np.var(a))
# 0.08349747560941487
print(var_nb(a))
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)
I want to multiply B = A # A.T in numpy. Obviously, the answer would be a symmetric matrix (i.e. B[i, j] == B[j, i]).
However, it is not clear to me how to leverage this easily to cut the computation time down in half (by only computing the lower triangle of B and then using that to get the upper triangle for free).
Is there a way to perform this optimally?
As noted in #PaulPanzer's link, dot can detect this case. Here's the timing proof:
In [355]: A = np.random.rand(1000,1000)
In [356]: timeit A.dot(A.T)
57.4 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [357]: B = A.T.copy()
In [358]: timeit A.dot(B)
98.6 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy dot too clever about symmetric multiplications
You can always use sklearns's pairwise_distances
Usage:
from sklearn.metrics.pairwise import pairwise_distances
gram = pairwise_distance(x, metric=metric)
Where metric is a callable or a string defining one of their implemented metrics (full list in the link above)
But, I wrote this for myself a while back so I can share what I did:
import numpy as np
def computeGram(elements, dist):
n = len(elements)
gram = np.zeros([n, n])
for i in range(n):
for j in range(i + 1):
gram[i, j] = dist(elements[i], elements[j])
upTriIdxs = np.triu_indices(n)
gram[upTriIdxs] = gram.T[upTriIdxs]
return gram
Where dist is a callable, in your case np.inner
In my project (about clustering algorithms, specifically k-medoids) is crucial to be able to compute pairwise distances efficiently. I have a dataset of ~60,000 objects. The problem is, distances must be computed between inhomogeneous vectors, i.e. vectors which may differ in length (in that case, missing items are treated as if they were 0).
Here is a minimal working example:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
I was testing my Python lists implementation versus a numpy implementation.
Here are the results (computation times):
Python lists: 79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy arrays: 226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why is there such a huge difference? I supposed numpy arrays were really fast.
Is there a way to improve my code? Am I misunderstanding the inner workings of numpy?
Edit: I may need, in the future, to be able to use a custom distance function for pairwise distances computations. The method should work also for data sets of length 60'000 without running out of memory.
I believe you can just make your arrays dense and set the unused last elements to 0s.
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
Timing Output:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Edit:
For a custom distance function see the very bottom of this documentation.
I finally found probably the most straightforward way to solve this problem without changing too much the code and rely solely on computations and not on memory (since that could be unfeasible for very large datasets).
Based on juanpa.arrivillaga suggestion, I tried numba, that is a library that speeds up array-oriented and math-heavy Python code and is targeted mainly at numpy. You can read a good guide on optimizing Python code here: https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
#jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
#jit
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
Timing output:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, the numpy with numba optimizations is about 19 times faster (with no other code optimization involved).