In https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
optimize{False, True, ‘greedy’, ‘optimal’}, optional
Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the np.einsum_path function. See np.einsum_path for more details. Defaults to False.
It seems to me the optimize flag is to choose the order in multiple contractions. E.g.,
A B C -> D
for (AB)C or A(BC) or (AC)B which is faster, not for a binary contraction, e.g., AB->C.
For the following code for A[a,b] * B[b,c,d] = C[a,c,d]
import numpy as np
import time
import scipy.stats
# from https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
print(var_name, round(m, 5), "\u00B1", round(h, 5), unit )
def einsum_greedy(A, B, na, nb, nc, nd):
#res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B, optimize="greedy")
return res
def einsum_standard(A, B, na, nb, nc, nd):
# res = np.zeros((na,nc,nd))
res = np.einsum('ab,bcd->acd', A, B)
return res
def btime_ABC(name_def, name_out, A, B, C, na, nb, nc, nd, n_times):
global opt_path
list_time = []
for i in range(n_times):
start_time = time.time()
C = name_def(A, B, na, nb, nc, nd)
finish_time = time.time()
list_time.append(finish_time - start_time)
mean_confidence_interval(list_time, name_out, 's' )
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = dim_comm = 90
n_times = 60
print('number of common dimension', dim_comm)
print('number of averaged time', n_times)
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C1 = np.zeros((na,nc,nd))
C2 = np.zeros((na,nc,nd))
btime_ABC(einsum_standard, 'einsum_standard', A, B, C1, na, nb, nc, nd, n_times)
btime_ABC(einsum_greedy, 'einsum_greedy', A, B, C2, na, nb, nc, nd, n_times)
I got
number of common dimension 90
number of averaged time 60
einsum_standard 0.04799 ± 0.00312 s
einsum_greedy 0.00805 ± 0.00137 s
the optimize flag helps in the binary contraction A[a,b] * B[b,c,d] = C[a,c,d]. Then, why?
My timings:
In [26]: A = np.random.random((90,80))
In [27]: B = np.random.random((80,81,82))
In [28]: timeit np.einsum('ab,bcd->acd',A,B,optimize=False)
39.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: timeit np.einsum('ab,bcd->acd',A,B,optimize=True)
9.06 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Looking at the einsum code, I see, early on a:
# If no optimization, run pure einsum
if optimize is False:
return c_einsum(*operands, **kwargs)
If not it checks various parameters, and does
operands, contraction_list = einsum_path(*operands, optimize=optimize,
einsum_call=True)
...
# Call tensordot if still possible
if blas:
...
new_view = tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)))
Since we have only 2 arguments and path is straightforward, I think the True case is just:
In [30]: timeit np.tensordot(A,B,(1,0))
7.62 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is, from past study of tensordot:
In [31]: timeit (A#B.reshape(80,-1)).reshape(90,81,82)
6.44 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So basically the time difference is between running a compiled "pure einsum", and an alternative that casts this as a matmul problem, which can use the optimized BLAS routines. [29] time appears to be the [31] time plus some overhead.
Related
# Generate test data
test = list(range(150))
groups = []
for _ in range(75_000):
groups.append(random.sample(test, 6))
Setup variables as numpy arrays:
# Best version
import numpy as np
import random
from numba import jit # Kind of optional see below
# Generate test data
test = list(range(150))
groups = np.array([random.sample(test, 6) for _ in range(75_000)])
# This will change every time but just leaving the same for example
scores_dict = {i: random.uniform(0, 120) for i in range(150)}
scores = np.array(list(scores_dict.items()))
Here's the vectorized version using numpy's sum and take:
def fun1(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun1(scores, groups) # Takes ~2.5 mins to run
18.6 s ± 625 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you really want to go all out you can try using numba on top of numpy:
#jit(nopython=True)
def fun2(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun2(scores, groups) # Takes ~1.2 mins to run
10.1 s ± 1.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a program for a simulation and inside the program I have a function. I have realized that the function consumes most time of simulation. So, I am trying to optimize the funcion first. The function is as follows
Julia version 1.1:
function fun_jul(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
K = length(ksi);
Z = zeros(length(x),K);
for n in 1:M
for k in 1:K
for l in 1:length(x)
Z[l,k] += (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
end
end
end
return Z
end
I also rewrite the above function in python+numba for comparison as follows
Python+numba
import numpy as np
from numba import prange, jit
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
But Julia codes are very slow here are my results:
Julia results:
using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 1.22 MiB
allocs estimate: 2
--------------
minimum time: 25.039 s (0.00% GC)
median time: 25.039 s (0.00% GC)
mean time: 25.039 s (0.00% GC)
maximum time: 25.039 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
Python results:
N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;
%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any help on improving the codes both for julia and python+numba would be appreciated.
Updated
Based on #Przemyslaw Szufel's answer and the other posts I have improved numba and julia codes. Now both are parallelized. Here are timings
Python+Numba times:
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((K,len(x)));
for n in range(1,M+1):
pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
for k in prange(0,K):
Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;
return Z
N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;
%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Julia times
N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
#benchmark fun_jul2(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 40.31 MiB
allocs estimate: 6302
--------------
minimum time: 705.470 ms (0.17% GC)
median time: 726.403 ms (0.17% GC)
mean time: 729.032 ms (1.68% GC)
maximum time: 765.426 ms (5.27% GC)
--------------
samples: 7
evals/sample: 1
I got down to 300ms on a single thread (instead of 28s on my machine) with the following code.
You are using multi-threading for Numba. In Julia you should use parallel processing (multi-threading support is experimental fo Julia). It seems that your code is doing some kind of parameter sweep - such codes are very easy to parallelize but it usually requires some adjustments to your computational process.
Here is the code:
function fun_jul2(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
for k in 1:K
for l in 1:L
Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul(M,cc,xi,x)
true
julia> #time fun_jul2(M,cc,xi,x);
0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)
** EDIT: with multithreading and inbounds suggested by DNF:
function fun_jul3(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
Threads.#threads for k in 1:K
for l in 1:L
#inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
And now the running time (remember to run set JULIA_NUM_THREADS=4 or Linux equivalent before launching Julia):
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul3(M,cc,xi,x)
true
julia> #time fun_jul3(M,cc,xi,x);
0.051470 seconds (2.71 k allocations: 6.989 MiB)
You could also try to further experiment with parallelizing of computing of F_im1 and F_im2.
You can do, or fail to do, loop optimization in any language that has loops. The major difference here is that the numba code is vectorized for the inner loop but the Julia code is not. To vectorize the Julia version, it is sometimes necessary to change operators to their vectorized versions with the ., so that + becomes .+ for example.
Since I cannot get Numba to install properly on my older Windows 10 machine, I ran the code versions below on free Linux versions on the Web. This means I had to use the Python interface for timeit(), not the command line.
Run in Jupyter at mybinder, probably with 1 thread since it is not specified. :
import timeit
timeit.timeit("""
#jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
N=400; a = -0.5; b = 0.5; x = np.linspace(a,b,N); cc = x;M = 2*N + 100; xi = M/40;
fun_py(M,cc,xi,x)
""", setup ="import numpy as np; from numba import prange, jit", number=5)
Out[1]: 61.07768889795989
Your machine must be a lot faster than Jupyter, ForBonder.
I ran this optimized julia code version below, in Jupyter on JuliaBox, 1 thread kernel specified:
using BenchmarkTools
F(n, x) = sinpi.(n * (x .+ 1) / 2) .* cospi.(n * (x .+ 1) / 2)
function fun_jul2(M, ksi, xi, x)
K = length(ksi)
Z = zeros(length(x), K)
for n in 1:M, k in 1:K
Z[:, k] .+= (1 - (n / (M + 1))^2)^xi * F(n, ksi[k]) * F(n, x)
end
return Z
end
const N=400; const a=-0.5; const b=0.5; const x=range(a,b,length=N);
const cc=x; const M = 2*N+100; const xi = M/40;
#btime fun_jul2(M, cc, xi, x)
8.076 s (1080002 allocations: 3.35 GiB)
For performance, just precompute the trigonometric part.
Indeed, sin is a costly operation:
%timeit np.sin(1.)
712 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit 1.2*3.4
5.88 ns ± 0.016 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In python :
#jit
def fun_py2(M,ksi,xi,x):
NN = np.arange(1,M+1)
Fksi = np.sin(np.pi*np.outer(NN,ksi+1))/2 # sin(a)cos(a) is sin(2a)/2
Fx = np.sin(np.pi*np.outer(NN,x+1))/2
U = (1-(NN/(M+1))**2)**xi
Z = np.zeros((len(x),len(ksi)))
for n in range(len(NN)):
for k in range(len(ksi)):
for l in range(len(x)):
Z[k,l] += U[n] * Fksi[n,k] * Fx[n,l];
return Z
For a 30x improvement:
np.allclose(fun_py(M,cc,xi,x),fun_py2(M,cc,xi,x))
True
%timeit fun_py(M,cc,xi,x)
1.14 s ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fun_py2(M,cc,xi,x)
29.5 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This doesn't trig any parallelism. I suppose the same will occur for Julia.
I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))
NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
np.random.seed(100)
a = np.random.rand(100_000)
print(np.var(a))
# 0.08349747560941487
print(var_nb(a))
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)
in my code I need to calculate the values of a vector many times which are the mean values from different patches of another array.
Here is an example of my code showing how I do it but I found that it is too less-efficient in running...
import numpy as np
vector_a = np.zeros(10)
array_a = np.random.random((100,100))
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,i+20:i+40]
Is there any way to make it more efficient? Any comments or suggestions are very welcome! Many thanks!
-yes, the 20 and 40 are fixed.
EDIT:
Actually you can do this much faster. The previous function can be improved by operating on summed columns like this:
def rolling_means_faster1(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Reshape as before
strides_b = (sum_a.strides[0], sum_a.strides[0])
array_b = np.lib.stride_tricks.as_strided(sum_a, (n, size), (strides_b))
# Average
v = np.sum(array_b, axis=1)
v /= (len(array_a) * size)
return v
Another way is to work with accumulated sums, adding and removing as necessary for each output element.
def rolling_means_faster2(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Add a zero a the beginning so the next operation works fine
sum_a = np.insert(sum_a, 0, 0)
# Sum the initial `size` elements and add and remove partial sums as necessary
v = np.sum(sum_a[:size]) - np.cumsum(sum_a[:n]) + np.cumsum(sum_a[-n:])
# Average
v /= (size * len(array_a))
return v
Benchmarking with the previous solution from before:
import numpy as np
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means_orig(array_a, n, first, size)
# 12.7 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means(array_a, n, first, size)
# 5.49 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_faster1(array_a, n, first, size)
# 166 µs ± 874 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit rolling_means_faster2(array_a, n, first, size)
# 182 µs ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So these last two seem to be very close in performance. It may depend on the relative sizes of the inputs.
This is a possible vectorized solution:
import numpy as np
# Data
np.random.seed(100)
array_a = np.random.random((100, 100))
# Take all the relevant columns
slice_a = array_a[:, 20:40 + 10]
# Make a "rolling window" with stride tricks
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (10, 100, 20), (strides_b))
# Take mean
result = np.mean(array_b, axis=(1, 2))
# Original method for testing correctness
vector_a = np.zeros(10)
idv1 = np.arange(10) + 20
idv2 = np.arange(10) + 40
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
print(np.allclose(vector_a, result))
# True
Here is a quick benchmark in IPython (sizes increased for appreciation):
import numpy as np
def rolling_means(array_a, n, first, size):
slice_a = array_a[:, first:(first + size + n)]
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (n, len(array_a), size), (strides_b))
return np.mean(array_b, axis=(1, 2))
def rolling_means_orig(array_a, n, first, size):
vector_a = np.zeros(n)
idv1 = np.arange(n) + first
idv2 = np.arange(n) + (first + size)
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
return vector_a
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means(array_a, n, first, size)
# 5.48 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_orig(array_a, n, first, size)
# 32.8 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution works on the assumption that you are trying to compute rolling average of a subset of window of columns.
As an example and ignoring rows, given [0, 1, 2, 3, 4] and a window of 2 the averages are [0.5, 1.5, 2.5, 3.5], and that you might only want the second and third averages.
Your current solution is inefficient as it is recomputes the mean for a column for each output in vector_a. Given that (a / n) + (b / n) == (a + b) / n we can get away with computing the mean of each column only once, and then combine the column means as needed to produce the final output.
window_first_start = idv1.min() # or idv1[0]
window_last_end = idv2.max() # or idv2[-1]
window_size = idv2[0] - idv1[0]
assert ((idv2 - idv1) == window_size).all(), "sanity check, not needed if assumption holds true"
# a view of the columns we are interested in, no copying is done here
view = array_a[:,window_first_start:window_last_end]
# calculate the means for each column
col_means = view.mean(axis=0)
# cumsum is used to find the rolling sum of means and so the rolling average
# We use an out variable to make sure we have a 0 in the first element of cum_sum.
# This makes like a little easier in the next step.
cum_sum = np.empty(len(col_means) + 1, dtype=col_means.dtype)
cum_sum[0] = 0
np.cumsum(col_means, out=cum_sum[1:])
result = (cum_sum[window_size:] - cum_sum[:-window_size]) / window_size
Having tested this against your own code, the above is significantly faster (increasing with the size of the input array), and slightly faster than the solution provided by jdehesa. With an input array of 1000x1000, it is two orders of magnitude faster than your solution and one order of magnitude faster than jdehesa's.
Try this:
import numpy as np
array_a = np.random.random((100,100))
vector_a = [np.mean(array_a[:,i+20:i+40]) for i in range(10)]
I have written a code, which uses sympy to set up a matrix and a vector. The elements of these two are sympy symbols. Then I invert the matrix and multiply the inverted matrix and the vector. This should be a generic solver for linear equation systems with n variables. I am interested in the symbolic solution of these linear equations.
The problem is that my code is too slow.
For instance, for n=4 it takes roughly 30 sec but for n=7 I haven't been able to solve it so far, the code ran all night (8h) and hasn't finished in the morning.
This is my code.
from sympy import *
import pprint
MM = Matrix(niso,1, lambda i,j:var('MM_%s' % (i+1) ))
MA = Matrix (niso,1, lambda i,j:var('m_%s%s' % ('A', chr(66+i)) ) )
MX = Matrix (niso,1, lambda i,j:var('m_%s%s'% (chr(66+i), 'A')))
RB = Matrix(niso,1, lambda i,j:var('R_%s%s' % ('A'+chr(66+i),i+2)))
R = Matrix (niso, niso-1, lambda i,j: var('R_%s%d' % (chr(65+i) , j+2 )))
K= Matrix(niso-1,1, lambda i,j:var('K_%d' % (i+2) ) )
C= Matrix(niso-1,1, lambda i,j:var('A_%d' % i))
A = Matrix(niso-1,niso-1, lambda i,j:var('A_%d' % i))
b = Matrix(niso-1,1, lambda i,j:var('A_%d' % i))
for i in range(0,niso-1):
for j in range(0,niso-1):
A[i,j]=MM[j+1,0]*(Add(Mul(R[0,j],1/MA[i,0]/(RB[i,0]-R[0,i])))+R[i+1,j]/MX[i,0]/(-RB[i,0]+R[0,i]))
for i in range(0,niso-1):
b[i,0]=MM[0,0]*(Add(Mul(1,1/MA[i,0]/(RB[i,0]-R[0,i])))+1/MX[i,0]/(-RB[i,0]+R[0,i]))
A_in = Inverse(A)
if niso <= 4:
X =simplify(A_in*b)
if niso > 4:
X = A_in*b
pprint(X)
Is there a way to speed it up?
Don't invert! With n=4
%timeit soln = A.LUsolve(b)
697 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With n=10
%timeit soln = A.LUsolve(b)
431 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)