Suppose I have two vectors and wish to take their dot product; this is simple,
import numpy as np
a = np.random.rand(3)
b = np.random.rand(3)
result = np.dot(a,b)
If I have stacks of vectors and I want each one dotted, the most naive code is
# 5 = number of vectors
a = np.random.rand(5,3)
b = np.random.rand(5,3)
result = [np.dot(aa,bb) for aa, bb in zip(a,b)]
Two ways to batch this computation are using a multiply and sum, and einsum,
result = np.sum(a*b, axis=1)
# or
result = np.einsum('ij,ij->i', a, b)
However, neither of these dispatch to the BLAS backend, and so use only a single core. This is not super great when N is very large, say 1 million.
tensordot does dispatch to the BLAS backend. A terrible way to do this computation with tensordot is
np.diag(np.tensordot(a,b, axes=[1,1])
This is terrible because it allocates an N*N matrix, and the majority of the elements are waste work.
Another (brilliantly fast) approach is the hidden inner1d function
from numpy.core.umath_tests import inner1d
result = inner1d(a,b)
but it seems this isn't going to be viable, since the issue that might export it publicly has gone stale. And this still boils down to writing the loop in C, instead of using multiple cores.
Is there a way to get dot, matmul, or tensordot to do all these dot products at once, on multiple cores?
First of all, there is no direct BLAS function to do that. Using many level 1 BLAS function calls is not very efficient since using multiple threads for a very short-timed computation tends to introduce a pretty-big overhead and not using multiple threads may be sub-optimal. Still, such computation is mainly memory-bound and so it scales poorly on platform with many cores (few cores are often enough to saturate the memory bandwidth).
One simple solution is to use the Numexpr package which should do that quite efficiently (it should avoid the creation of temporary arrays and should also use multiple threads). However, the performance is somewhat disappointing for big arrays in this case.
The best solution appears to use Numba (or Cython). Numba can generate a fast code for both small and big input arrays and it is easy to parallelize the code. However, please note that managing threads introduces an overhead which can be quite big for small arrays (up to few ms on some many-core platforms).
Here is a Numexpr implementation:
import numexpr as ne
expr = ne.NumExpr('sum(a * b, axis=1)')
result = expr.run(a, b)
Here is a (sequential) Numba implementation:
import numba as nb
# Use `parallel=True` for a parallel implementation
#nb.njit('float64[:](float64[:,::1], float64[:,::1])')
def multiDots(a, b):
assert a.shape == b.shape
n, m = a.shape
res = np.empty(n, dtype=np.float64)
# Use `nb.prange` instead of `range` to run the loop in parallel
for i in range(n):
s = 0.0
for j in range(m):
s += a[i,j] * b[i,j]
res[i] = s
return res
result = multiDots(a, b)
Here are some benchmarks on a (old) 2-core machine:
On small 5x3 arrays:
np.einsum('ij,ij->i', a, b, optimize=True): 45.2 us
Numba (parallel): 12.1 us
np.sum(a*b, axis=1): 9.5 us
np.einsum('ij,ij->i', a, b): 6.5 us
Numexpr: 3.2 us
inner1d(a, b): 1.8 us
Numba (sequential): 1.3 us
On small 1000000x3 arrays:
np.sum(a*b, axis=1): 27.8 ms
Numexpr: 15.3 ms
np.einsum('ij,ij->i', a, b, optimize=True): 9.0 ms
np.einsum('ij,ij->i', a, b): 8.8 ms
Numba (sequential): 6.8 ms
inner1d(a, b): 6.5 ms
Numba (parallel): 5.3 ms
The sequential Numba implementation gives a good trade-off. You can use a switch if you really want the best performance. Choosing the best n threshold in a platform-independent way is not so easy though.
Related
I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:
mask = (data < low) or (data > high)
I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???
In [1]: import numpy as np
In [2]: import sys
In [3]: print(sys.version)
3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
In [4]: print(np.__version__)
1.14.0
In [5]: arr = np.random.random((10, 1000, 1000))
In [6]: def method1(arr, low, high):
...: """ Fully vectorized computations """
...: out = np.empty(arr.shape, dtype=np.bool)
...: np.greater_equal(arr, high, out)
...: np.logical_or(out, arr < low, out)
...: return out
...:
In [7]: def method2(arr, low, high):
...: """ Partially vectorized computations """
...: out = np.empty(arr.shape, dtype=np.bool)
...: for k in range(arr.shape[0]):
...: a = arr[k]
...: o = out[k]
...: np.greater_equal(a, high, o)
...: np.logical_or(o, a < low, o)
...: return out
...:
First of all, let's make sure that both methods produce identical results:
In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))
Out[8]: True
And now some timing tests:
In [9]: %timeit method1(arr, 0.2, 0.8)
14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %timeit method2(arr, 0.2, 0.8)
11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
What is going on here?
EDIT 1: A similar behavior is observed in an older environment:
In [3]: print(sys.version)
2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
In [4]: print(np.__version__)
1.11.3
In [9]: %timeit method1(arr, 0.2, 0.8)
100 loops, best of 3: 14.3 ms per loop
In [10]: %timeit method2(arr, 0.2, 0.8)
100 loops, best of 3: 13 ms per loop
Outperforming both methods
In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.
Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.
To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.
import numba as nb
#nb.njit(fastmath=True, cache=True)
def method3(arr, low, high):
out = np.empty(arr.shape, dtype=nb.boolean)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(arr.shape[2]):
out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high
return out
Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)
This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.
In my own tests, the difference in performance was even more noticeable than in your question. The differences continued to be clearly observable after increasing second and third dimensions of the arr data. It also continued to be observable after commenting out one of the two comparison functions (greater_equal or logical_or), which means we can rule out some kind of strange interaction between the two.
By changing the implementation of the two methods to the following, I could significantly reduce the observable difference in performance (but not completely eliminate it):
def method1(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
high = np.ones_like(arr) * high
low = np.ones_like(arr) * low
np.greater_equal(arr, high, out)
np.logical_or(out, arr < low, out)
return out
def method2(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
high = np.ones_like(arr) * high
low = np.ones_like(arr) * low
for k in range(arr.shape[0]):
a = arr[k]
o = out[k]
h = high[k]
l = low[k]
np.greater_equal(a, h, o)
np.logical_or(o, a < l, o)
return out
I suppose that, when supplying high or low as a scalar to those numpy functions, they may internally first create a numpy array of the correct shape filled with that scalar. When we do this manually outside the functions, in both cases only once for the full shape, the performance difference becomes much less noticeable. This implies that, for whatever reason (maybe cache?), creating such a large array filled with the same constant once may be less efficient than creating k smaller arrays with the same constant (as done automatically by the implementation of method2 in the original question).
Note: in addition to reducing the performance gap, it also makes the performance of both methods much worse (affecting the second method more severly than the first). So, while this may give some indication of where the issue may be, it doesn't appear to explain everything.
EDIT
Here is a new version of method2, where we now manually pre-create smaller arrays within the loop every time, like I suspect is happening internally in numpy in the original implementation in the question:
def method2(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
for k in range(arr.shape[0]):
a = arr[k]
o = out[k]
h = np.full_like(a, high)
l = np.full_like(a, low)
np.greater_equal(a, h, o)
np.logical_or(o, a < l, o)
return out
This version is indeed much faster again than the one I have above (confirming that creating many smaller arrays inside the loop is more efficient than one big one outside the loop), but still slower than the original implementation in the question.
Under the hypothesis that these numpy functions are indeed converting the scalar bounds into these kinds of arrays first, the difference in performance between this last function and the one in the question could be due to creation of arrays in Python (my implementation) vs. doing so natively (original implementation)
I'm developing code that integrates an ODE using scipy's complex_ode, where the integrand includes a Fourier transform and exponential operator acting on a large array of complex values.
To optimize performance, I've profiled this and found the main bottleneck is (after optimizing FFTs using PyFFTW etc) in the line:
val = np.exp(float_value * arr)
I'm currently using numpy which I understand calls C code - and thus should be quick. But is there any way to further improve performance please?
I've looked into using Numba but since my main loop includes FFTs too, I don't think it can be compiled (nopython=True flag leads to errors) and thus, I suspect it offers no gain.
Here is a test example for the code I'd like to optimize:
arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
float_value = 0.5
%timeit np.exp(float_value * arr)
Any suggestions welcomed thanks.
We could leverage numexpr module, which works really efficiently on large data involving transcendental operations -
In [91]: arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
...: float_value = 0.5
...:
In [92]: %timeit np.exp(float_value * arr)
1000 loops, best of 3: 739 µs per loop
In [94]: import numexpr as ne
In [95]: %timeit ne.evaluate('exp(float_value*arr)')
1000 loops, best of 3: 241 µs per loop
This seems to be coherent with the expected performance as stated in the docs.
Can a single numpy einsum statement replicate gemm functionality? Scalar and matrix multiplication seem straightforward, but I haven't found how to get the "+" working. In case its simpler, D = alpha * A * B + beta * C would be acceptable (preferable actually)
alpha = 2
beta = 3
A = np.arange(9).reshape(3, 3)
B = A + 1
C = B + 1
left_part = alpha*np.dot(A, B)
print(left_part)
left_part = np.einsum(',ij,jk->ik', alpha, A, B)
print(left_part)
There seems to be some confusion here: np.einsum handles operations that can be cast in the following form: broadcast–multiply–reduce. Element-wise summation is not part of its scope.
The reason why you need this sort of thing for the multiplication is that writing these operations out "naively" may exceed memory or computing resources quickly. Consider, for example, matrix multiplication:
import numpy as np
x, y = np.ones((2, 2000, 2000))
# explicit loop - ridiculously slow
a = sum(x[:,j,np.newaxis] * y[j,:] for j in range(2000))
# explicit broadcast-multiply-reduce: throws MemoryError
a = (x[:,:,np.newaxis] * y[:,np.newaxis,:]).sum(1)
# einsum or dot: fast and memory-saving
a = np.einsum('ij,jk->ik', x, y)
The Einstein convention however factorizes for addition, so you
can write your BLAS-like problem simply as:
d = np.einsum(',ij,jk->ik', alpha, a, b) + np.einsum(',ik', beta, c)
with minimal memory overhead (you can rewrite most of it as in-place operations if you are really concerned about memory) and constant runtime overhead (the cost of two python-to-C calls).
So regarding performance, this seems, respectfully, like a case of premature optimization to me: have you actually verified that the split of GEMM-like operations into two separate numpy calls is a bottleneck in your code? If it indeed is, then I suggest the following (in order of increasing involvedness):
Try, carefully!, scipy.linalg.blas.dgemm. I would be surprised if you get
significantly better performance, since dgemms are usually only
building block themselves.
Try an expression compiler (essentially you are proposing
such a thing) like Theano.
Write your own generalised ufunc using Cython or C.
I have created a 3D median filter which does work and is the following:
def Median_Filter_3D(image,kernel):
window = np.zeros(shape=(kernel,kernel,kernel), dtype = np.uint8)
n = (kernel-1)/2 #Deals with Image border
imgout = np.empty_like(image)
w,h,l = image.shape()
%%Start Loop over each pixel
for y in np.arange(0,(w-n*2),1):
for x in np.arange(0,(h-n*2),1):
for z in np.arange(0,(l-n*2),1):
window[:,:,:] = image[x:x+kernel,y:y+kernel,z:z+kernel]
med = np.median(window)
imgout[x+n,y+n,z+n] = med
return(imgout)
So at every pixel, It creates a window of size kernelxkernelxkernel, finds the median value of the pixels in the window, and replaces the value of that pixel with the new medium value.
My problem is, its very slow, I have thousands of big images to process. There must be a faster way to iterate through all these pixels and still be able to get the same result.
Thanks in advance!!
First, looping a 3D matrix in python is a very very very bad idea. In order to loop a large 3D matrix you are better of going down to Cython or C/C++/Fortran and creating a python extension. However, for this particular case, scipy already contains an implementation of the median filter for n-dimensional arrays:
>>> from scipy.ndimage import median_filter
>>> median_filter(my_large_3d_array, radious)
In short, there is no a faster way of iterating through voxels in python (maybe numpy iterators would help a bit, but won't increase the performance considerably). If you need to perform more complicated 3D stuff in python, you should consider programming in Cython the loopy interface or, alternatively, using a chunking library such as Dask, which implements parallel operations for chunks of arrays.
The problem with Python if that for loops are extremely slow, specially if they are nested and with large arrays. Thus, there is no a standard pythonic method for obtaining efficient iterations over arrays. Usually, the way of getting speed-ups is through vectorized operations and numpy-ticks, but those are very problem-specific and there is no generic trick, you will learn a lot of numpy tricks here in SO.
As a generic approach, if you really need to iterate over arrays, you can write your code in Cython. Cython is a C-like extension for Python. You write code in Python syntax, but specifying variable types (like in C, with int or float. That code is then compiled automatically to C and can be called from python. A quick example:
Example Python loopy function:
import numpy as np
def iter_A(A):
B = np.empty(A.shape, dtype=np.float64)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
B[i, j] = A[i, j] * 2
return B
I know that the above code is kinda redundant and could be written as B = A * 2, but its purpose is just to illustrate that python loops are extremely slow.
Cython version of the function:
import numpy as np
cimport numpy as np
def iter_A_cy(double[:, ::1] A):
cdef Py_ssize_t H = A.shape[0], W = A.shape[1]
cdef double[:, ::1] B = np.empty((H, W), dtype=np.float64)
cdef Py_ssize_t i, j
for i in range(H):
for j in range(W):
B[i, j] = A[i, j] * 2
return np.asarray(B)
Test speeds of both implementations:
>>> import numpy as np
>>> A = np.random.randn(1000, 1000)
>>> %timeit iter_A(A)
1 loop, best of 3: 399 ms per loop
>>> %timeit iter_A_cy(A)
100 loops, best of 3: 2.11 ms per loop
NOTE: you cannot run the Cython function as it is. You need to put it in a separate file and compile it first (or use %%cython magic in IPython Notebook).
It shows that the raw python version took 400ms to iterate the whole array, while it was only 2ms for the Cython version (x200 speedup).
I've got a performance bottleneck. I'm computing the column-wise mean of large arrays (250 rows & 1.3 million columns), and I do so more than a million times in my application.
My test case in Python:
import numpy as np
big_array = np.random.random((250, 1300000))
%timeit mean = big_array.mean(axis = 0) # ~400 milliseconds
Numpy takes around 400 milliseconds on my machine, running on a single core. I've tried several other matrix libraries across different languages (Cython, R, Julia, Torch), but found only Julia to beat Numpy, by taking around 250 milliseconds.
Can anyone provide evidence of substantial improvements in performance in this task? Perhaps this is a task suited for the GPU?
Edit: My application is evidently memory-constrained, and its performance is dramatically improved by accessing elements of a large array only once, rather than repeatedly. (See comment below.)
Julia, if I'm not mistaken, uses fortran ordering in memory as opposed to numpy which uses C memory layout by default. So if you rearrange things to adhere to the same layout so that the mean is happening along contiguous memory, you get better performance:
In [1]: import numpy as np
In [2]: big_array = np.random.random((250, 1300000))
In [4]: big_array_f = np.asfortranarray(big_array)
In [5]: %timeit mean = big_array.mean(axis = 0)
1 loop, best of 3: 319 ms per loop
In [6]: %timeit mean = big_array_f.mean(axis = 0)
1 loop, best of 3: 205 ms per loop
Or you can just change you dimensions and take the mean over the other axis:
In [10]: big_array = np.random.random((1300000, 250))
In [11]: %timeit mean = big_array.mean(axis = 1)
1 loop, best of 3: 205 ms per loop