Consider a numpy array A of dimensionality NxM. The goal is to compute Euclidean distance matrix D, where each element D[i,j] is Eucledean distance between rows i and j. What is the fastest way of doing it? This is not exactly the problem I need to solve, but it's a good example of what I'm trying to do (in general, other distance metrics could be used).
This is the fastest I could come up with so far:
n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
D[i] = np.sqrt(np.square(A-A[i]).sum(1))
But is it the fastest way? I'm mainly concerned about the for loop. Can we beat this with, say, Cython?
To avoid looping, I tried to use broadcasting, and do something like this:
D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
But it turned out to be a bad idea, because there's some overhead in construction an intermediate 3D array of dimensionality NxNxM, so the performance is worse.
I tried Cython. But I am a newbie in Cython, so I don't know how good is my attempt:
def dist(np.ndarray[np.int32_t, ndim=2] A):
cdef int n = A.shape[0]
cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)
cdef int i = 0
for i in range(n):
dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)
return dm
The above code was a bit slower than Python's for loop. I don't know much about Cython, but I assume I could achieve at least the same performance as the for loop + numpy. And I am wondering whether it is possible to achieve some noticeable performance improvement when done the right way? Or whether there's some other way to speed this up (not involving parallel computations)?
The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.
Here's a quick stab at a more optimized Cython version of your distance function:
import numpy as np
cimport numpy as np
cimport cython
# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt
# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
#cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
#cython.wraparound(False)
def dist(double [:, :] A):
# declare C types for as many of our variables as possible. note that we
# don't necessarily need to assign a value to them at declaration time.
cdef:
# Py_ssize_t is just a special platform-specific type for indices
Py_ssize_t nrow = A.shape[0]
Py_ssize_t ncol = A.shape[1]
Py_ssize_t ii, jj, kk
# this line is particularly expensive, since creating a numpy array
# involves unavoidable Python API overhead
np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)
double tmpss, diff
# another advantage of using Cython rather than broadcasting is that we can
# exploit the symmetry of D by only looping over its upper triangle
for ii in range(nrow):
for jj in range(ii + 1, nrow):
# we use tmpss to accumulate the SSD over each pair of rows
tmpss = 0
for kk in range(ncol):
diff = A[ii, kk] - A[jj, kk]
tmpss += diff * diff
tmpss = sqrt(tmpss)
D[ii, jj] = tmpss
D[jj, ii] = tmpss # because D is symmetric
return D
I saved this in a file called fastdist.pyx. We can use pyximport to simplify the build process:
import pyximport
pyximport.install()
import fastdist
import numpy as np
A = np.random.randn(100, 200)
D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)
print np.allclose(D1, D2)
# True
So it works, at least. Let's do some benchmarking using the %timeit magic:
%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop
%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop
A ~9x speed-up is nice, but not really a game-changer. As you said, though, the big problem with the broadcasting approach is the memory requirements of constructing the intermediate array.
A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop
I wouldn't recommend trying that using broadcasting...
Another thing we could do is parallelize this over the outermost loop, using the prange function:
from cython.parallel cimport prange
...
for ii in prange(nrow, nogil=True, schedule='guided'):
...
In order to compile the parallel version you'll need to tell the compiler to enable OpenMP. I haven't figured out how to do this using pyximport, but if you're using gcc you could compile it manually like this:
$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
-Wall -fno-strict-aliasing -I/usr/include/python2.7 -o fastdist.so fastdist.c
With parallelism, using 8 threads:
%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop
Related
Suppose I have two vectors and wish to take their dot product; this is simple,
import numpy as np
a = np.random.rand(3)
b = np.random.rand(3)
result = np.dot(a,b)
If I have stacks of vectors and I want each one dotted, the most naive code is
# 5 = number of vectors
a = np.random.rand(5,3)
b = np.random.rand(5,3)
result = [np.dot(aa,bb) for aa, bb in zip(a,b)]
Two ways to batch this computation are using a multiply and sum, and einsum,
result = np.sum(a*b, axis=1)
# or
result = np.einsum('ij,ij->i', a, b)
However, neither of these dispatch to the BLAS backend, and so use only a single core. This is not super great when N is very large, say 1 million.
tensordot does dispatch to the BLAS backend. A terrible way to do this computation with tensordot is
np.diag(np.tensordot(a,b, axes=[1,1])
This is terrible because it allocates an N*N matrix, and the majority of the elements are waste work.
Another (brilliantly fast) approach is the hidden inner1d function
from numpy.core.umath_tests import inner1d
result = inner1d(a,b)
but it seems this isn't going to be viable, since the issue that might export it publicly has gone stale. And this still boils down to writing the loop in C, instead of using multiple cores.
Is there a way to get dot, matmul, or tensordot to do all these dot products at once, on multiple cores?
First of all, there is no direct BLAS function to do that. Using many level 1 BLAS function calls is not very efficient since using multiple threads for a very short-timed computation tends to introduce a pretty-big overhead and not using multiple threads may be sub-optimal. Still, such computation is mainly memory-bound and so it scales poorly on platform with many cores (few cores are often enough to saturate the memory bandwidth).
One simple solution is to use the Numexpr package which should do that quite efficiently (it should avoid the creation of temporary arrays and should also use multiple threads). However, the performance is somewhat disappointing for big arrays in this case.
The best solution appears to use Numba (or Cython). Numba can generate a fast code for both small and big input arrays and it is easy to parallelize the code. However, please note that managing threads introduces an overhead which can be quite big for small arrays (up to few ms on some many-core platforms).
Here is a Numexpr implementation:
import numexpr as ne
expr = ne.NumExpr('sum(a * b, axis=1)')
result = expr.run(a, b)
Here is a (sequential) Numba implementation:
import numba as nb
# Use `parallel=True` for a parallel implementation
#nb.njit('float64[:](float64[:,::1], float64[:,::1])')
def multiDots(a, b):
assert a.shape == b.shape
n, m = a.shape
res = np.empty(n, dtype=np.float64)
# Use `nb.prange` instead of `range` to run the loop in parallel
for i in range(n):
s = 0.0
for j in range(m):
s += a[i,j] * b[i,j]
res[i] = s
return res
result = multiDots(a, b)
Here are some benchmarks on a (old) 2-core machine:
On small 5x3 arrays:
np.einsum('ij,ij->i', a, b, optimize=True): 45.2 us
Numba (parallel): 12.1 us
np.sum(a*b, axis=1): 9.5 us
np.einsum('ij,ij->i', a, b): 6.5 us
Numexpr: 3.2 us
inner1d(a, b): 1.8 us
Numba (sequential): 1.3 us
On small 1000000x3 arrays:
np.sum(a*b, axis=1): 27.8 ms
Numexpr: 15.3 ms
np.einsum('ij,ij->i', a, b, optimize=True): 9.0 ms
np.einsum('ij,ij->i', a, b): 8.8 ms
Numba (sequential): 6.8 ms
inner1d(a, b): 6.5 ms
Numba (parallel): 5.3 ms
The sequential Numba implementation gives a good trade-off. You can use a switch if you really want the best performance. Choosing the best n threshold in a platform-independent way is not so easy though.
I have a segment of codes which is based on a large numpy array and then to operate another array. Because this is a very large array, could you please let me know whether there is an efficient way to achieve my goal? (I think the efficient way should be achieved by directly operating on the array but not through the for-loop).
Thanks in advance, please below find my codes:
N = 1000000000
rand = np.random.rand(N)
beta = np.zeros(N)
for i in range(0, N):
if rand[i] < 0.5:
beta[i] = 2.0*rand[i]
else:
beta[i] = 1.0/(2.0*(1.0-rand[i]))
You are here basically losing the efficiency of numpy, by performing the processing in Python. The idea of numpy is process the items in bulk, since it has efficiency algorithms in C++ behind the curtains that do the actual processing. You can see the Python end of numpy more as an "interface".
Now to answer your question, we can basically first construct an array of random numbers between 0 and 2, by multiplying it with 2 already:
rand = 2.0 * np.random.rand(N)
next we can use np.where(..) [numpy-doc] that acts like a conditional selector: we here pass it three "arrays": the first one is an array of booleans that encodes the truthiness of the "condition", the second is the array of values to fill in in case the related condition is true, and the third value is the array of values to plug in in case the condition is false, so we can then write it like:
N = 1000000000
rand = 2 * np.random.rand(N)
beta = np.where(rand < 1.0, rand, 1.0 / (2.0 - rand))
N = 1000000000 caused a MemoryError for me. Reducing to 100 for a minimal example.
You can use np.where routine.
In both case, fundamentally you are iterating over your array and applying a function. However, np.where uses a way faster loop (it's compiled code basically), while your "python" loop is interpreted and thus really slow for a big N.
Here's an example of implementation.
N = 100
rand = np.random.rand(N)
beta = np.where(rand < 0.5, 2.0 * rand, 1.0/(2.0*(1.0-rand))
As other answers have pointed out, iterating over the elements of a numpy array in a Python loop should (and can) almost always be avoided. In most cases going from a Python loop to an array operation gives a speedup of ~100x.
However, if performance is absolutely critical, you can often squeeze out another factor of between 2x and 10x (in my experience) by using Cython.
Here's an example:
%%cython
cimport numpy as np
import numpy as np
cimport cython
from cython cimport floating
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cpdef np.ndarray[floating, ndim=1] beta(np.ndarray[floating, ndim=1] arr):
cdef:
Py_ssize_t i
Py_ssize_t N = arr.shape[0]
np.ndarray[floating, ndim=1] result = np.zeros(N)
for i in range(N):
if arr[i] < 0.5:
result[i] = 2.0*arr[i]
else:
result[i] = 1.0/(2.0*(1.0-arr[i]))
return result
You would then call it as beta(rand).
As you can see, this allows you to use your original loop structure, but now using efficient typed native code. I get a speedup of ~2.5x compared to np.where.
It should be noted that in many cases this is not worth the extra effort compared to the one-liner in numpy -- but it may well be where performance is critical.
In my project I need to compute euclidian distance beetween each points stored in an array.
The entry array is a 2D numpy array with 3 columns which are the coordinates(x,y,z) and each rows define a new point.
I'm usualy working with 5000 - 6000 points in my test cases.
My first algorithm use Cython and my second numpy. I find that my numpy algorithm is faster than cython.
edit: with 6000 points :
numpy 1.76 s / cython 4.36 s
Here's my cython code:
cimport cython
from libc.math cimport sqrt
#cython.boundscheck(False)
#cython.wraparound(False)
cdef void calcul1(double[::1] M,double[::1] R):
cdef int i=0
cdef int max = M.shape[0]
cdef int x,y
cdef int start = 1
for x in range(0,max,3):
for y in range(start,max,3):
R[i]= sqrt((M[y] - M[x])**2 + (M[y+1] - M[x+1])**2 + (M[y+2] - M[x+2])**2)
i+=1
start += 1
M is a memory view of the initial entry array but flatten() by numpy before the call of the function calcul1(), R is a memory view of a 1D output array to store all the results.
Here's my Numpy code :
def calcul2(M):
return np.sqrt(((M[:,:,np.newaxis] - M[:,np.newaxis,:])**2).sum(axis=0))
Here M is the initial entry array but transpose() by numpy before the function call to have coordinates(x,y,z) as rows and points as columns.
Moreover this numpy function is quite convinient because the array it returns is well organise. It's a n by n array with n the number of points and each points has a row and a column. So for example the distance AB is stored at the intersection index of row A and column B.
Here's how I call them (cython function):
cpdef test():
cdef double[::1] Mf
cdef double[::1] out = np.empty(17998000,dtype=np.float64) # (6000² - 6000) / 2
M = np.arange(6000*3,dtype=np.float64).reshape(6000,3) # Example array with 6000 points
Mf = M.flatten() #because my cython algorithm need a 1D array
Mt = M.transpose() # because my numpy algorithm need coordinates as rows
calcul2(Mt)
calcul1(Mf,out)
Am I doing something wrong here ? For my project both are not fast enough.
1: Is there a way to improve my cython code in order to beat numpy's speed ?
2: Is there a way to improve my numpy code to compute even faster ?
3: Or any other solutions, but it must be a python/cython (like parallel computing) ?
Thank you.
Not sure where you are getting your timings, but you can use scipy.spatial.distance:
M = np.arange(6000*3, dtype=np.float64).reshape(6000,3)
np_result = calcul2(M)
sp_result = sd.cdist(M.T, M.T) #Scipy usage
np.allclose(np_result, sp_result)
>>> True
Timings:
%timeit calcul2(M)
1000 loops, best of 3: 313 µs per loop
%timeit sd.cdist(M.T, M.T)
10000 loops, best of 3: 86.4 µs per loop
Importantly, its also useful to realize that your output is symmetric:
np.allclose(sp_result, sp_result.T)
>>> True
An alternative is to only compute the upper triangular of this array:
%timeit sd.pdist(M.T)
10000 loops, best of 3: 39.1 µs per loop
Edit: Not sure which index you want to zip, looks like you may be doing it both ways? Zipping the other index for comparison:
%timeit sd.pdist(M)
10 loops, best of 3: 135 ms per loop
Still about 10x faster than your current NumPy implementation.
I have created a 3D median filter which does work and is the following:
def Median_Filter_3D(image,kernel):
window = np.zeros(shape=(kernel,kernel,kernel), dtype = np.uint8)
n = (kernel-1)/2 #Deals with Image border
imgout = np.empty_like(image)
w,h,l = image.shape()
%%Start Loop over each pixel
for y in np.arange(0,(w-n*2),1):
for x in np.arange(0,(h-n*2),1):
for z in np.arange(0,(l-n*2),1):
window[:,:,:] = image[x:x+kernel,y:y+kernel,z:z+kernel]
med = np.median(window)
imgout[x+n,y+n,z+n] = med
return(imgout)
So at every pixel, It creates a window of size kernelxkernelxkernel, finds the median value of the pixels in the window, and replaces the value of that pixel with the new medium value.
My problem is, its very slow, I have thousands of big images to process. There must be a faster way to iterate through all these pixels and still be able to get the same result.
Thanks in advance!!
First, looping a 3D matrix in python is a very very very bad idea. In order to loop a large 3D matrix you are better of going down to Cython or C/C++/Fortran and creating a python extension. However, for this particular case, scipy already contains an implementation of the median filter for n-dimensional arrays:
>>> from scipy.ndimage import median_filter
>>> median_filter(my_large_3d_array, radious)
In short, there is no a faster way of iterating through voxels in python (maybe numpy iterators would help a bit, but won't increase the performance considerably). If you need to perform more complicated 3D stuff in python, you should consider programming in Cython the loopy interface or, alternatively, using a chunking library such as Dask, which implements parallel operations for chunks of arrays.
The problem with Python if that for loops are extremely slow, specially if they are nested and with large arrays. Thus, there is no a standard pythonic method for obtaining efficient iterations over arrays. Usually, the way of getting speed-ups is through vectorized operations and numpy-ticks, but those are very problem-specific and there is no generic trick, you will learn a lot of numpy tricks here in SO.
As a generic approach, if you really need to iterate over arrays, you can write your code in Cython. Cython is a C-like extension for Python. You write code in Python syntax, but specifying variable types (like in C, with int or float. That code is then compiled automatically to C and can be called from python. A quick example:
Example Python loopy function:
import numpy as np
def iter_A(A):
B = np.empty(A.shape, dtype=np.float64)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
B[i, j] = A[i, j] * 2
return B
I know that the above code is kinda redundant and could be written as B = A * 2, but its purpose is just to illustrate that python loops are extremely slow.
Cython version of the function:
import numpy as np
cimport numpy as np
def iter_A_cy(double[:, ::1] A):
cdef Py_ssize_t H = A.shape[0], W = A.shape[1]
cdef double[:, ::1] B = np.empty((H, W), dtype=np.float64)
cdef Py_ssize_t i, j
for i in range(H):
for j in range(W):
B[i, j] = A[i, j] * 2
return np.asarray(B)
Test speeds of both implementations:
>>> import numpy as np
>>> A = np.random.randn(1000, 1000)
>>> %timeit iter_A(A)
1 loop, best of 3: 399 ms per loop
>>> %timeit iter_A_cy(A)
100 loops, best of 3: 2.11 ms per loop
NOTE: you cannot run the Cython function as it is. You need to put it in a separate file and compile it first (or use %%cython magic in IPython Notebook).
It shows that the raw python version took 400ms to iterate the whole array, while it was only 2ms for the Cython version (x200 speedup).
I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.
Here's an example that gives me what I want with an array of 1000 numbers.
import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])
What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.
Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.
Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.
Here's some code:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
Timing with IPython:
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).
Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.
I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:
File my_cython.pyx:
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
#cython.wraparound(False)
#cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
The answer is a 1-D array containing all non-repeated evaluations.
To import into Python:
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
Timing with IPython:
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
Using half the memory, but 6 times slower than np.abs(r - r[:, None]):
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])