Optimization and speedup of a mathematical function in python

Optimization and speedup of a mathematical function in python - python

The purpose of this mathematical function is to compute a distance between two (or more) protein structures using dihedral angles:
It is very useful in structural biology, for example. And I already code this function in python using numpy, but the goal is to have a faster implementation. As computation time reference, I use the euclidean distance function available in the scikit-learn package.
Here the code I have for the moment:
import numpy as np
import numexpr as ne
from sklearn.metrics.pairwise import euclidean_distances
# We have 10000 structures with 100 dihedral angles
n = 10000
m = 100
# Generate some random data
c = np.random.rand(n,m)
# Generate random int number
x = np.random.randint(c.shape[0])
print c.shape, x
# First version with numpy of the dihedral_distances function
def dihedral_distances(a, b):
l = 1./a.shape[0]
return np.sqrt(l* np.sum((0.5)*(1. - np.cos(a-b)), axis=1))
# Accelerated version with numexpr
def dihedral_distances_ne(a, b):
l = 1./a.shape[0]
tmp = ne.evaluate('sum((0.5)*(1. - cos(a-b)), axis=1)')
return ne.evaluate('sqrt(l* tmp)')
# The function of reference I try to be close as possible
# in term of computation time
%timeit euclidean_distances(c[x,:], c)[0]
1000 loops, best of 3: 1.07 ms per loop
# Computation time of the first version of the dihedral_distances function
# We choose randomly 1 structure among the 10000 structures.
# And we compute the dihedral distance between this one and the others
%timeit dihedral_distances(c[x,:], c)
10 loops, best of 3: 21.5 ms per loop
# Computation time of the accelerated function with numexpr
%timeit dihedral_distances_ne(c[x,:], c)
100 loops, best of 3: 9.44 ms per loop
9.44 ms it's very fast, but it's very slow if you need to run it a million times. Now the question is, how to do that? What is the next step? Cython? PyOpenCL? I have some experience with PyOpenCL, however I never code something as elaborate as this one. I don't know if it's possible to compute the dihedral distances in one step on GPU as I do with numpy and how to proceed.
Thank you for helping me!
EDIT:
Thank you guys! I am currently working on the full solution and once it's finished I will put the code here.
CYTHON VERSION:
%load_ext cython
import numpy as np
np.random.seed(1234)
n = 10000
m = 100
c = np.random.rand(n,m)
x = np.random.randint(c.shape[0])
print c.shape, x
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import numpy as np
cimport numpy as np
from libc.math cimport sqrt, cos
cimport cython
from cython.parallel cimport parallel, prange
# Define a function pointer to a metric
ctypedef double (*metric)(double[: ,::1], np.intp_t, np.intp_t)
cdef extern from "math.h" nogil:
double cos(double x)
double sqrt(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double dihedral_distances(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
cdef double res
cdef int m
cdef int j
res = 0.
m = a.shape[1]
for j in range(m):
res += 1. - cos(a[i1, j] - a[i2, j])
res /= 2.*m
return sqrt(res)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double dihedral_distances_p(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
cdef double res
cdef int m
cdef int j
res = 0.
m = a.shape[1]
with nogil, parallel(num_threads=2):
for j in prange(m, schedule='dynamic'):
res += 1. - cos(a[i1, j] - a[i2, j])
res /= 2.*m
return sqrt(res)
#cython.boundscheck(False)
#cython.wraparound(False)
def pairwise(double[: ,::1] c not None, np.intp_t x, p = True):
cdef metric dist_func
if p:
dist_func = &dihedral_distances_p
else:
dist_func = &dihedral_distances
cdef np.intp_t i, n_structures
n_samples = c.shape[0]
cdef double[::1] res = np.empty(n_samples)
for i in range(n_samples):
res[i] = dist_func(c, x, i)
return res
%timeit pairwise(c, x, False)
100 loops, best of 3: 17 ms per loop
# Parallel version
%timeit pairwise(c, x, True)
10 loops, best of 3: 37.1 ms per loop
So I follow your link to create the cython version of the dihedral distances function. We gain some speed, not so much, but it is still slower than the numexpr version (17ms vs 9.44ms). So I tried to parallelize the function using prange and it is worse (37.1ms vs 17ms vs 9.4ms)!
Do I miss something?

If you're willing to use http://pythran.readthedocs.io/, you can leverage on the numpy implementation and get better performance than cython for that case:
#pythran export np_cos_norm(float[], float[])
import numpy as np
def np_cos_norm(a, b):
val = np.sum(1. - np.cos(a-b))
return np.sqrt(val / 2. / a.shape[0])
And compile it with:
pythran fast.py
To get an average x2 over the cython version.
If using:
pythran fast.py -march=native -DUSE_BOOST_SIMD -fopenmp
You'll get a vectorized, parallel version that runs slightly faster:
100000 loops, best of 3: 2.54 µs per loop
1000000 loops, best of 3: 674 ns per loop
100000 loops, best of 3: 16.9 µs per loop
100000 loops, best of 3: 4.31 µs per loop
10000 loops, best of 3: 176 µs per loop
10000 loops, best of 3: 42.9 µs per loop
(using the same testbed as ev-br)

Here's a quick-and-dirty try with cython, for just a pair of 1D arrays:
(in an IPython notebook)
%%cython
cimport cython
cimport numpy as np
cdef extern from "math.h":
double cos(double x) nogil
double sqrt(double x) nogil
def cos_norm(a, b):
return cos_norm_impl(a, b)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double cos_norm_impl(double[::1] a, double[::1] b) nogil:
cdef double res = 0., val
cdef int m = a.shape[0]
# XXX: shape of b not checked
cdef int j
for j in range(m):
val = a[j] - b[j]
res += 1. - cos(val)
res /= 2.*m
return sqrt(res)
Comparing with a straightforward numpy implementation,
def np_cos_norm(a, b):
val = np.add.reduce(1. - np.cos(a-b))
return np.sqrt(val / 2. / a.shape[0])
I get
np.random.seed(1234)
for n in [100, 1000, 10000]:
x = np.random.random(n)
y = np.random.random(n)
%timeit cos_norm(x, y)
%timeit np_cos_norm(x, y)
print '\n'
100000 loops, best of 3: 3.04 µs per loop
100000 loops, best of 3: 12.4 µs per loop
100000 loops, best of 3: 18.8 µs per loop
10000 loops, best of 3: 30.8 µs per loop
1000 loops, best of 3: 196 µs per loop
1000 loops, best of 3: 223 µs per loop
So, depending on the dimensionality of your vectors, you can get from a factor of 4 to nil of a speedup.
For computing pairwise distances, you can probably do much better, as shown in this blog post, but of course YMMV.

Related

Can I do better on filtering numpy array

I have a somewhat contrived example to cytonize, where I want a function to:
accept a 1D numpy array of arbitrary length (~100'000 ÷ 1'000'000 np.float64's)
do some filtering on it
return results as a new [numpy?] array of the same length
The code and profiling is as follows:
%%cython -a
from libc.stdlib cimport malloc, free
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview(double[:] arr):
cdef:
int N = arr.shape[0], i
double *out_ptr = <double *> malloc(N * sizeof(double))
double[:] out = <double[:N]>out_ptr
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.
free(out_ptr)
return np.asarray(out)
My question is can I do any better with this?

As DavidW has pointed out, your code has some issues with memory management and it would be better to use a numpy-array directly:
%%cython
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct(double[:] arr):
cdef:
int N = arr.shape[0], i
double[:] out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
It is about as fast as the faulty original version:
import numpy as np
np.random.seed(0)
k= np.random.rand(5*10**7)
%timeit func_memview(k) # 413 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_memview_correct(k) # 412 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The question is how this code could be made faster? Most obvious options are
Parallelization.
Using vectorization/SIMD instructions.
It is notoriously hard to ensure that the C-code generated by Cython gets vectorized, see for example this SO-post. For many compilers it is necessary to use contiguous memory view to improve the situation, i.e.:
%%cython -c=/O3
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct_cont(double[::1] arr): // <---- HERE
cdef:
int N = arr.shape[0], i
double[::1] out = np.empty(N) // <--- HERE
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
On my machine it is not really much faster
%timeit func_memview_correct_cont(k) # 402 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Other compilers might do better. However, I've often seen gcc and msvc struggling with producing optimal assembler for code typical for filtering (see for example this SO-question). Clang is much better at this, so the easiest solution would be probably to use numba:
import numba as nb
#nb.njit
def nb_func(arr):
N = arr.shape[0]
out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return out
which outperforms the cython code by almost factor of 3:
%timeit nb_func(k) # 151 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is easy to parallelize the numba version using prange, but the win is not that much: parallelized version runs in 116ms on my machine.
To summarize: For such type of tasks my advice is to use numba. Using cython is trickier and the final performance will be down to the compiler used in the background.

Cython Numpy Array Manipulation Slower than Python

I would like to optimize this Python code with Cython:
def updated_centers(point, start, center):
return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
def __cluster_mean(point, center):
return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)
My Cython code:
cimport cython
cimport numpy as np
import numpy as np
# C-compatible Numpy integer type.
DTYPE = np.intc
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
#cython.cdivision(True) # Deactivate division by 0 checking.
def updated_centers(double [:,:] point, int [:] label, double [:,:] center):
if (point.shape[0] != label.size) or (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]):
raise ValueError("Incompatible dimensions")
cdef Py_ssize_t i, c, j
cdef Py_ssize_t n = point.shape[0]
cdef Py_ssize_t m = point.shape[1]
cdef Py_ssize_t nc = center.shape[0]
# Updated centers. We accumulate point and center contributions into this array.
# Start by adding the (unscaled) center contributions.
new_center = np.zeros([nc, m])
new_center[:] = center
# Counter array. Will contain cluster sizes (including center, whose contribution
# is again added here) at the end of the point loop.
cluster_size = np.ones([nc], dtype=DTYPE)
# Add point contributions.
for i in range(n):
c = label[i]
cluster_size[c] += 1
for j in range(m):
new_center[c, j] += point[i, j]
# Scale center+point summation to be a mean.
for c in range(nc):
for j in range(m):
new_center[c, j] /= cluster_size[c]
return new_center
However, Cython is slower than python:
Python: %timeit f.updated_centers(point, start, center)
331 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cython: %timeit fx.updated_centers(point, label, center)
433 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The HTML reveals that almost all lines are yellow: allocating the array, +=, /=. I expected Cython to be an order of magnitude faster. What am I doing wrong?

You need to tell Cython that new_center and cluster_size are arrays:
cdef double[:, :] new_center = np.zeros((nc, m))
...
cdef int[:] cluster_size = np.ones((nc,), dtype=DTYPE)
...
Without these type annotations Cython cannot generate efficient C code, and has to call into the Python interpreter when you access those arrays.This is why the lines in the HTML output of cython -a where you access these arrays were yellow.
With just these two small modifications we immediately see the speedup we want:
%timeit python_updated_centers(point, start, center)
392 ms ± 41.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit cython_updated_centers(point, start, center)
1.18 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For such simple kernels, you can also use pythran to get nice speedups:
#pythran export updated_centers(float64 [:, :], int32 [:] , float64 [:, :] )
import numpy as np
def updated_centers(point, start, center):
return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
def __cluster_mean(point, center):
return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)
Compiled with pythran updated_centers.py and one get the following timings:
Numpy code (same code, not compiled):
$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 271 ms +- 12 ms
Pythran (after compilation):
$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 12.8 ms +- 0.3 ms

The key is to write the Cython code like the Python code, to access arrays only when necessary.
cimport cython
cimport numpy as np
import numpy as np
# C-compatible Numpy integer type.
DTYPE = np.intc
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
#cython.cdivision(True) # Deactivate division by 0 checking.
def updated_centers(double [:, :] point, int [:] start, double [:, :] center):
"""Returns the updated list of cluster centers (damped center of mass Pahkira scheme). Cluster c
(and center[c]) corresponds to the point range point[start[c]:start[c+1]]."""
if (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]) or (start.size != center.shape[0] + 1):
raise ValueError("Incompatible dimensions")
# Py_ssize_t is the proper C type for Python array indices.
cdef Py_ssize_t i, c, j, cluster_start, cluster_stop, cluster_size
cdef Py_ssize_t n = point.shape[0]
cdef Py_ssize_t m = point.shape[1]
cdef Py_ssize_t nc = center.shape[0]
cdef double center_of_mass
# Updated centers. We accumulate point and center contributions into this array.
# Start by adding the (unscaled) center contributions.
new_center = np.zeros([nc, m])
cluster_start = start[0]
for c in range(nc):
cluster_stop = start[c + 1]
cluster_size = cluster_stop - cluster_start + 1
for j in range(m):
center_of_mass = center[c, j]
for i in range(cluster_start, cluster_stop):
center_of_mass += point[i, j]
new_center[c, j] = center_of_mass / cluster_size
cluster_start = cluster_stop
return np.asarray(new_center)
With the same API we get
n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.intc); center=np.random.rand(k, m);
%timeit fx.updated_centers(point, start, center)
31 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f.updated_centers(point, start, center)
734 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comparing scipy.stats.t.sf vs GSL using Cython

I would like to calculate the p values of a large 2D numpy t values array. However, this takes long time and I would like to improve its speed. I tried using GSL.
Although a single gsl_cdf_tdist_P is much much faster than scipy.stats.t.sf, when iterating over the ndarray, the process is very slow. I would like help to improve this.
See the code below.
GSL_Test.pyx
import cython
cimport cython
import numpy
cimport numpy
from cython_gsl cimport *
DTYPE = numpy.float32
ctypedef numpy.float32_t DTYPE_t
cdef get_gsl_p(double t, double nu):
return (1 - gsl_cdf_tdist_P(t, nu)) * 2
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef get_gsl_p_for_2D_matrix(numpy.ndarray[DTYPE_t, ndim=2] t_matrix, int n):
cdef unsigned int rows = t_matrix.shape[0]
cdef numpy.ndarray[DTYPE_t, ndim=2] out = numpy.zeros((rows, rows), dtype='float32')
cdef unsigned int row, col
for row in range(rows):
for col in range(rows):
out[row, col] = get_gsl_p(t_matrix[row, col], n-2)
return out
def get_gsl_p_for_2D_matrix_def(numpy.ndarray[DTYPE_t, ndim=2] t_matrix, int n):
return get_gsl_p_for_2D_matrix(t_matrix, n)
ipython
import GSL_Test
import numpy
import scipy.stats
a = numpy.random.rand(3544, 3544).astype('float32')
%timeit -n 1 GSL_Test.get_gsl_p_for_2D_matrix(a, 25)
1 loop, best of 3: 7.87 s per loop
%timeit -n 1 scipy.stats.t.sf(a, 25)*2
1 loop, best of 3: 4.66 s per loop
UPDATE: Adding cdef declarations I was able to reduce the computational time but not lower than scipy still. I modified the code to have the cdef declarations.
%timeit -n 1 GSL_Test.get_gsl_p_for_2D_matrix_def(a, 25)
1 loop, best of 3: 6.73 s per loop

You can get some small gain in raw performance by using a raw special function instead of stats.t.sf. Looking at the source, you find (https://github.com/scipy/scipy/blob/master/scipy/stats/_continuous_distns.py#L3849)
def _sf(self, x, df):
return sc.stdtr(df, -x)
So that you can use stdtr directly:
np.random.seed(1234)
x = np.random.random((3740, 374))
t1 = stats.t.sf(x, 25)
t2 = stdtr(25, -x)
1 loop, best of 3: 653 ms per loop
1 loop, best of 3: 562 ms per loop
If you do reach out for cython, the typed memoryview syntax often gives you faster code than the old ndarray syntax:
from scipy.special.cython_special cimport stdtr
from numpy cimport npy_intp
import numpy as np
def tsf(double [:, ::1] x, int df=25):
cdef double[:, ::1] out = np.empty_like(x)
cdef npy_intp i, j
cdef double tmp, xx
for i in range(x.shape[0]):
for j in range(x.shape[1]):
xx = x[i, j]
out[i, j] = stdtr(df, -xx)
return np.asarray(out)
Here I'm also using the cython_special interface, which is only avaialble in the dev version of scipy (http://scipy.github.io/devdocs/special.cython_special.html#module-scipy.special.cython_special), but you can use GSL if you want.
Finally, if you suspect a bottleneck in iterations, don't forget to inspect the output of cython -a to see if there's some python overhead in the hot loops.

Speed up Python/Cython loops.

I've trying to get a loop in python to run as fast as possible. So I've dived into NumPy and Cython.
Here's the original Python code:
def calculate_bsf_u_loop(uvel,dy,dz):
"""
Calculate barotropic stream function from zonal velocity
uvel (t,z,y,x)
dy (y,x)
dz (t,z,y,x)
bsf (t,y,x)
"""
nt = uvel.shape[0]
nz = uvel.shape[1]
ny = uvel.shape[2]
nx = uvel.shape[3]
bsf = np.zeros((nt,ny,nx))
for jn in range(0,nt):
for jk in range(0,nz):
for jj in range(0,ny):
for ji in range(0,nx):
bsf[jn,jj,ji] = bsf[jn,jj,ji] + uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji]
return bsf
It's just a sum over k indices. Array sizes are nt=12, nz=75, ny=559, nx=1442, so ~725 million elements.
That took 68 seconds. Now, I've done it in cython as
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False) # turn off bounds-checking for entire function
#cython.wraparound(False) # turn off negative index wrapping for entire function
## Use cpdef instead of def
## Define types for arrays
cpdef calculate_bsf_u_loop(np.ndarray[np.float64_t, ndim=4] uvel, np.ndarray[np.float64_t, ndim=2] dy, np.ndarray[np.float64_t, ndim=4] dz):
"""
Calculate barotropic stream function from zonal velocity
uvel (t,z,y,x)
dy (y,x)
dz (t,z,y,x)
bsf (t,y,x)
"""
## cdef the constants
cdef int nt = uvel.shape[0]
cdef int nz = uvel.shape[1]
cdef int ny = uvel.shape[2]
cdef int nx = uvel.shape[3]
## cdef loop indices
cdef ji,jj,jk,jn
## cdef. Note that the cdef is followed by cython type
## but the np.zeros function as python (numpy) type
cdef np.ndarray[np.float64_t, ndim=3] bsf = np.zeros([nt,ny,nx], dtype=np.float64)
for jn in xrange(0,nt):
for jk in xrange(0,nz):
for jj in xrange(0,ny):
for ji in xrange(0,nx):
bsf[jn,jj,ji] += uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji]
return bsf
and that took 49 seconds.
However, swapping the loop for
for jn in range(0,nt):
for jk in range(0,nz):
bsf[jn,:,:] = bsf[jn,:,:] + uvel[jn,jk,:,:] * dz[jn,jk,:,:] * dy[:,:]
only takes 0.29 seconds! Unfortunately, I can't do this in my full code.
Why is NumPy slicing so much faster than the Cython loop?
I thought NumPy was fast because it is Cython under the hood. So shouldn't they be of similar speed?
As you can see, I've disabled boundary checks in cython, and I've also compiled using "fast math". However, this only gives a tiny speedup.
Is there anyway to get a loop to be of similar speed as NumPy slicing, or is looping always slower than slicing?
Any help is greatly appreciated!
/Joakim

That code is screaming for numpy.einsum's's intervention, given that you are doing elementwise-multiplication and then sum-reduction on the second axis of the 4D product array, which essenti
ally numpy.einsum does in a highly efficient manner. To solve your case, you can use numpy.einsum in two ways -
bsf = np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)
bsf = np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy
Runtime tests & Verify outputs -
In [100]: # Take a (1/5)th of original input shapes
...: original_shape = [12,75, 559,1442]
...: m,n,p,q = (np.array(original_shape)/5).astype(int)
...:
...: # Generate random arrays with given shapes
...: uvel = np.random.rand(m,n,p,q)
...: dy = np.random.rand(p,q)
...: dz = np.random.rand(m,n,p,q)
...:
In [101]: bsf = calculate_bsf_u_loop(uvel,dy,dz)
In [102]: print(np.allclose(bsf,np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)))
True
In [103]: print(np.allclose(bsf,np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy))
True
In [104]: %timeit calculate_bsf_u_loop(uvel,dy,dz)
1 loops, best of 3: 2.16 s per loop
In [105]: %timeit np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)
100 loops, best of 3: 3.94 ms per loop
In [106]: %timeit np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy
100 loops, best of 3: 3.96 ms per loo

Numba code slower than pure python

I've been working on speeding up a resampling calculation for a particle filter. As python has many ways to speed it up, I though I'd try them all. Unfortunately, the numba version is incredibly slow. As Numba should result in a speed up, I assume this is an error on my part.
I tried 4 different versions:
Numba
Python
Numpy
Cython
The code for each is below:
import numpy as np
import scipy as sp
import numba as nb
from cython_resample import cython_resample
#nb.autojit
def numba_resample(qs, xs, rands):
n = qs.shape[0]
lookup = np.cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
def python_resample(qs, xs, rands):
n = qs.shape[0]
lookup = np.cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
def numpy_resample(qs, xs, rands):
results = np.empty_like(qs)
lookup = sp.cumsum(qs)
for j, key in enumerate(rands):
i = sp.argmax(lookup>key)
results[j] = xs[i]
return results
#The following is the code for the cython module. It was compiled in a
#separate file, but is included here to aid in the question.
"""
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
def cython_resample(np.ndarray[DTYPE_t, ndim=1] qs,
np.ndarray[DTYPE_t, ndim=1] xs,
np.ndarray[DTYPE_t, ndim=1] rands):
if qs.shape[0] != xs.shape[0] or qs.shape[0] != rands.shape[0]:
raise ValueError("Arrays must have same shape")
assert qs.dtype == xs.dtype == rands.dtype == DTYPE
cdef unsigned int n = qs.shape[0]
cdef unsigned int i, j
cdef np.ndarray[DTYPE_t, ndim=1] lookup = np.cumsum(qs)
cdef np.ndarray[DTYPE_t, ndim=1] results = np.zeros(n, dtype=DTYPE)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
"""
if __name__ == '__main__':
n = 100
xs = np.arange(n, dtype=np.float64)
qs = np.array([1.0/n,]*n)
rands = np.random.rand(n)
print "Timing Numba Function:"
%timeit numba_resample(qs, xs, rands)
print "Timing Python Function:"
%timeit python_resample(qs, xs, rands)
print "Timing Numpy Function:"
%timeit numpy_resample(qs, xs, rands)
print "Timing Cython Function:"
%timeit cython_resample(qs, xs, rands)
This results in the following output:
Timing Numba Function:
1 loops, best of 3: 8.23 ms per loop
Timing Python Function:
100 loops, best of 3: 2.48 ms per loop
Timing Numpy Function:
1000 loops, best of 3: 793 µs per loop
Timing Cython Function:
10000 loops, best of 3: 25 µs per loop
Any idea why the numba code is so slow? I assumed it would be at least comparable to Numpy.
Note: if anyone has any ideas on how to speed up either the Numpy or Cython code samples, that would be nice too:) My main question is about Numba though.

The problem is that numba can't intuit the type of lookup. If you put a print nb.typeof(lookup) in your method, you'll see that numba is treating it as an object, which is slow. Normally I would just define the type of lookup in a locals dict, but I was getting a strange error. Instead I just created a little wrapper, so that I could explicitly define the input and output types.
#nb.jit(nb.f8[:](nb.f8[:]))
def numba_cumsum(x):
return np.cumsum(x)
#nb.autojit
def numba_resample2(qs, xs, rands):
n = qs.shape[0]
#lookup = np.cumsum(qs)
lookup = numba_cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
Then my timings are:
print "Timing Numba Function:"
%timeit numba_resample(qs, xs, rands)
print "Timing Revised Numba Function:"
%timeit numba_resample2(qs, xs, rands)
Timing Numba Function:
100 loops, best of 3: 8.1 ms per loop
Timing Revised Numba Function:
100000 loops, best of 3: 15.3 µs per loop
You can go even a little faster still if you use jit instead of autojit:
#nb.jit(nb.f8[:](nb.f8[:], nb.f8[:], nb.f8[:]))
For me that lowers it from 15.3 microseconds to 12.5 microseconds, but it's still impressive how well autojit does.

Faster numpy version (10x speedup compared to numpy_resample)
def numpy_faster(qs, xs, rands):
lookup = np.cumsum(qs)
mm = lookup[None,:]>rands[:,None]
I = np.argmax(mm,1)
return xs[I]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimization and speedup of a mathematical function in python - python

Related

Can I do better on filtering numpy array

Cython Numpy Array Manipulation Slower than Python

Comparing scipy.stats.t.sf vs GSL using Cython

Speed up Python/Cython loops.

Numba code slower than pure python

Categories

Resources