I have a program for a simulation and inside the program I have a function. I have realized that the function consumes most time of simulation. So, I am trying to optimize the funcion first. The function is as follows
Julia version 1.1:
function fun_jul(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
K = length(ksi);
Z = zeros(length(x),K);
for n in 1:M
for k in 1:K
for l in 1:length(x)
Z[l,k] += (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
return Z
I also rewrite the above function in python+numba for comparison as follows
import numpy as np
from numba import prange, jit
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
But Julia codes are very slow here are my results:
Julia results:
using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul(M,cc,xi,x)
memory estimate: 1.22 MiB
allocs estimate: 2
minimum time: 25.039 s (0.00% GC)
median time: 25.039 s (0.00% GC)
mean time: 25.039 s (0.00% GC)
maximum time: 25.039 s (0.00% GC)
samples: 1
evals/sample: 1
Python results:
N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;
%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any help on improving the codes both for julia and python+numba would be appreciated.
Based on #Przemyslaw Szufel's answer and the other posts I have improved numba and julia codes. Now both are parallelized. Here are timings
Python+Numba times:
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((K,len(x)));
for n in range(1,M+1):
pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
for k in prange(0,K):
Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;
return Z
N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;
%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Julia times
N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul2(M,cc,xi,x)
memory estimate: 40.31 MiB
allocs estimate: 6302
minimum time: 705.470 ms (0.17% GC)
median time: 726.403 ms (0.17% GC)
mean time: 729.032 ms (1.68% GC)
maximum time: 765.426 ms (5.27% GC)
samples: 7
evals/sample: 1
I got down to 300ms on a single thread (instead of 28s on my machine) with the following code.
You are using multi-threading for Numba. In Julia you should use parallel processing (multi-threading support is experimental fo Julia). It seems that your code is doing some kind of parameter sweep - such codes are very easy to parallelize but it usually requires some adjustments to your computational process.
Here is the code:
function fun_jul2(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
for k in 1:K
for l in 1:L
Z[l,k] += pow*F_im1[k]*F_im2[l];
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul(M,cc,xi,x)
julia> #time fun_jul2(M,cc,xi,x);
0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)
** EDIT: with multithreading and inbounds suggested by DNF:
function fun_jul3(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
Threads.#threads for k in 1:K
for l in 1:L
#inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
And now the running time (remember to run set JULIA_NUM_THREADS=4 or Linux equivalent before launching Julia):
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul3(M,cc,xi,x)
julia> #time fun_jul3(M,cc,xi,x);
0.051470 seconds (2.71 k allocations: 6.989 MiB)
You could also try to further experiment with parallelizing of computing of F_im1 and F_im2.
You can do, or fail to do, loop optimization in any language that has loops. The major difference here is that the numba code is vectorized for the inner loop but the Julia code is not. To vectorize the Julia version, it is sometimes necessary to change operators to their vectorized versions with the ., so that + becomes .+ for example.
Since I cannot get Numba to install properly on my older Windows 10 machine, I ran the code versions below on free Linux versions on the Web. This means I had to use the Python interface for timeit(), not the command line.
Run in Jupyter at mybinder, probably with 1 thread since it is not specified. :
import timeit
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
N=400; a = -0.5; b = 0.5; x = np.linspace(a,b,N); cc = x;M = 2*N + 100; xi = M/40;
""", setup ="import numpy as np; from numba import prange, jit", number=5)
Out[1]: 61.07768889795989
Your machine must be a lot faster than Jupyter, ForBonder.
I ran this optimized julia code version below, in Jupyter on JuliaBox, 1 thread kernel specified:
using BenchmarkTools
F(n, x) = sinpi.(n * (x .+ 1) / 2) .* cospi.(n * (x .+ 1) / 2)
function fun_jul2(M, ksi, xi, x)
K = length(ksi)
Z = zeros(length(x), K)
for n in 1:M, k in 1:K
Z[:, k] .+= (1 - (n / (M + 1))^2)^xi * F(n, ksi[k]) * F(n, x)
return Z
const N=400; const a=-0.5; const b=0.5; const x=range(a,b,length=N);
const cc=x; const M = 2*N+100; const xi = M/40;
#btime fun_jul2(M, cc, xi, x)
8.076 s (1080002 allocations: 3.35 GiB)
For performance, just precompute the trigonometric part.
Indeed, sin is a costly operation:
%timeit np.sin(1.)
712 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit 1.2*3.4
5.88 ns ± 0.016 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In python :
def fun_py2(M,ksi,xi,x):
NN = np.arange(1,M+1)
Fksi = np.sin(np.pi*np.outer(NN,ksi+1))/2 # sin(a)cos(a) is sin(2a)/2
Fx = np.sin(np.pi*np.outer(NN,x+1))/2
U = (1-(NN/(M+1))**2)**xi
Z = np.zeros((len(x),len(ksi)))
for n in range(len(NN)):
for k in range(len(ksi)):
for l in range(len(x)):
Z[k,l] += U[n] * Fksi[n,k] * Fx[n,l];
return Z
For a 30x improvement:
%timeit fun_py(M,cc,xi,x)
1.14 s ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fun_py2(M,cc,xi,x)
29.5 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This doesn't trig any parallelism. I suppose the same will occur for Julia.
I want to vectorise the triple sum
\sum_{i=1}^I\sum_{j=1}^J\sum_{m=1}^J a_{ijm}
such that I end up with a matrix
A \in \mathbb{R}^{I \times J}
where A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm} for k = 1,...,I and l = 1, ...,J
carrying forward the sums to avoid pointless recomputation.
I currently use this code:
np.cumsum(np.cumsum(np.cumsum(a, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
but it is inefficient as it does lots of extra work and extracts the correct matrix at the end with the diagonal method. I can't think of how to make this faster.
The main challenge here is to compute the inner two sums, i.e. the sum of the square slices of a matrix originating from the top left. The final sum is just a cumsum on top of that along the 0th axis.
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
Your implementation:
out = np.cumsum(np.cumsum(np.cumsum(arr, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your implementation improved by taking the diagonal before cumsumming over the 0th axis:
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Finally, some tril/triu trickery:
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is already better, but admittedly still not ideal. I don't see a better way to compute the inner two sums noted above with pure numpy.
You can use Numba so to produce a very fast implementation. Here is the code:
import numba as nb
import numpy as np
#nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# Parallel cumsum along the axis 1 and 2 + extraction of the diagonal
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# Cumsum along the axis 0
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
Here are performance results on my 6-core i5-9600KF with a 100x100x100 float64 input array:
Initial code: 12.7 ms
Chryophylaxs v1: 7.1 ms
Chryophylaxs v2: 5.5 ms
Numba: 0.2 ms
This implementation is significantly faster than all others. It is about 64 times faster than the initial implementation. It is also actually optimal on my machine since it completely saturate the bandwidth of my RAM only for reading the input array (which is mandatory). Note that it is better not to use multiple threads for very small arrays.
Note that this code also use far less memory as it only need 8 * nk * num_threads bytes of temporary storage as opposed to 16 * ni * nj * nk bytes for the initial solution.
In https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
optimize{False, True, ‘greedy’, ‘optimal’}, optional
Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the np.einsum_path function. See np.einsum_path for more details. Defaults to False.
It seems to me the optimize flag is to choose the order in multiple contractions. E.g.,
A B C -> D
for (AB)C or A(BC) or (AC)B which is faster, not for a binary contraction, e.g., AB->C.
For the following code for A[a,b] * B[b,c,d] = C[a,c,d]
import numpy as np
import time
import scipy.stats
# from https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
print(var_name, round(m, 5), "\u00B1", round(h, 5), unit )
def einsum_greedy(A, B, na, nb, nc, nd):
#res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B, optimize="greedy")
return res
def einsum_standard(A, B, na, nb, nc, nd):
# res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B)
return res
def btime_ABC(name_def, name_out, A, B, C, na, nb, nc, nd, n_times):
global opt_path
list_time = []
for i in range(n_times):
start_time = time.time()
C = name_def(A, B, na, nb, nc, nd)
finish_time = time.time()
list_time.append(finish_time - start_time)
mean_confidence_interval(list_time, name_out, 's' )
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = dim_comm = 90
n_times = 60
print('number of common dimension', dim_comm)
print('number of averaged time', n_times)
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C1 = np.zeros((na,nc,nd))
C2 = np.zeros((na,nc,nd))
btime_ABC(einsum_standard, 'einsum_standard', A, B, C1, na, nb, nc, nd, n_times)
btime_ABC(einsum_greedy, 'einsum_greedy', A, B, C2, na, nb, nc, nd, n_times)
I got
number of common dimension 90
number of averaged time 60
einsum_standard 0.04799 ± 0.00312 s
einsum_greedy 0.00805 ± 0.00137 s
the optimize flag helps in the binary contraction A[a,b] * B[b,c,d] = C[a,c,d]. Then, why?
My timings:
In [26]: A = np.random.random((90,80))
In [27]: B = np.random.random((80,81,82))
In [28]: timeit np.einsum('ab,bcd->acd',A,B,optimize=False)
39.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: timeit np.einsum('ab,bcd->acd',A,B,optimize=True)
9.06 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Looking at the einsum code, I see, early on a:
# If no optimization, run pure einsum
if optimize is False:
return c_einsum(*operands, **kwargs)
If not it checks various parameters, and does
operands, contraction_list = einsum_path(*operands, optimize=optimize,
# Call tensordot if still possible
if blas:
new_view = tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)))
Since we have only 2 arguments and path is straightforward, I think the True case is just:
In [30]: timeit np.tensordot(A,B,(1,0))
7.62 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is, from past study of tensordot:
In [31]: timeit (A#B.reshape(80,-1)).reshape(90,81,82)
6.44 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So basically the time difference is between running a compiled "pure einsum", and an alternative that casts this as a matmul problem, which can use the optimized BLAS routines. [29] time appears to be the [31] time plus some overhead.
I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))
NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
a = np.random.rand(100_000)
# 0.08349747560941487
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)
I have written a code, which uses sympy to set up a matrix and a vector. The elements of these two are sympy symbols. Then I invert the matrix and multiply the inverted matrix and the vector. This should be a generic solver for linear equation systems with n variables. I am interested in the symbolic solution of these linear equations.
The problem is that my code is too slow.
For instance, for n=4 it takes roughly 30 sec but for n=7 I haven't been able to solve it so far, the code ran all night (8h) and hasn't finished in the morning.
This is my code.
from sympy import *
import pprint
MM = Matrix(niso,1, lambda i,j:var('MM_%s' % (i+1) ))
MA = Matrix (niso,1, lambda i,j:var('m_%s%s' % ('A', chr(66+i)) ) )
MX = Matrix (niso,1, lambda i,j:var('m_%s%s'% (chr(66+i), 'A')))
RB = Matrix(niso,1, lambda i,j:var('R_%s%s' % ('A'+chr(66+i),i+2)))
R = Matrix (niso, niso-1, lambda i,j: var('R_%s%d' % (chr(65+i) , j+2 )))
K= Matrix(niso-1,1, lambda i,j:var('K_%d' % (i+2) ) )
C= Matrix(niso-1,1, lambda i,j:var('A_%d' % i))
A = Matrix(niso-1,niso-1, lambda i,j:var('A_%d' % i))
b = Matrix(niso-1,1, lambda i,j:var('A_%d' % i))
for i in range(0,niso-1):
for j in range(0,niso-1):
for i in range(0,niso-1):
A_in = Inverse(A)
if niso <= 4:
X =simplify(A_in*b)
if niso > 4:
X = A_in*b
Is there a way to speed it up?
Don't invert! With n=4
%timeit soln = A.LUsolve(b)
697 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With n=10
%timeit soln = A.LUsolve(b)
431 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In my project (about clustering algorithms, specifically k-medoids) is crucial to be able to compute pairwise distances efficiently. I have a dataset of ~60,000 objects. The problem is, distances must be computed between inhomogeneous vectors, i.e. vectors which may differ in length (in that case, missing items are treated as if they were 0).
Here is a minimal working example:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
I was testing my Python lists implementation versus a numpy implementation.
Here are the results (computation times):
Python lists: 79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy arrays: 226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why is there such a huge difference? I supposed numpy arrays were really fast.
Is there a way to improve my code? Am I misunderstanding the inner workings of numpy?
Edit: I may need, in the future, to be able to use a custom distance function for pairwise distances computations. The method should work also for data sets of length 60'000 without running out of memory.
I believe you can just make your arrays dense and set the unused last elements to 0s.
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
Timing Output:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a custom distance function see the very bottom of this documentation.
I finally found probably the most straightforward way to solve this problem without changing too much the code and rely solely on computations and not on memory (since that could be unfeasible for very large datasets).
Based on juanpa.arrivillaga suggestion, I tried numba, that is a library that speeds up array-oriented and math-heavy Python code and is targeted mainly at numpy. You can read a good guide on optimizing Python code here: https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
Timing output:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, the numpy with numba optimizations is about 19 times faster (with no other code optimization involved).