I've trying to get a loop in python to run as fast as possible. So I've dived into NumPy and Cython.
Here's the original Python code:
def calculate_bsf_u_loop(uvel,dy,dz):
"""
Calculate barotropic stream function from zonal velocity
uvel (t,z,y,x)
dy (y,x)
dz (t,z,y,x)
bsf (t,y,x)
"""
nt = uvel.shape[0]
nz = uvel.shape[1]
ny = uvel.shape[2]
nx = uvel.shape[3]
bsf = np.zeros((nt,ny,nx))
for jn in range(0,nt):
for jk in range(0,nz):
for jj in range(0,ny):
for ji in range(0,nx):
bsf[jn,jj,ji] = bsf[jn,jj,ji] + uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji]
return bsf
It's just a sum over k indices. Array sizes are nt=12, nz=75, ny=559, nx=1442, so ~725 million elements.
That took 68 seconds. Now, I've done it in cython as
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False) # turn off bounds-checking for entire function
#cython.wraparound(False) # turn off negative index wrapping for entire function
## Use cpdef instead of def
## Define types for arrays
cpdef calculate_bsf_u_loop(np.ndarray[np.float64_t, ndim=4] uvel, np.ndarray[np.float64_t, ndim=2] dy, np.ndarray[np.float64_t, ndim=4] dz):
"""
Calculate barotropic stream function from zonal velocity
uvel (t,z,y,x)
dy (y,x)
dz (t,z,y,x)
bsf (t,y,x)
"""
## cdef the constants
cdef int nt = uvel.shape[0]
cdef int nz = uvel.shape[1]
cdef int ny = uvel.shape[2]
cdef int nx = uvel.shape[3]
## cdef loop indices
cdef ji,jj,jk,jn
## cdef. Note that the cdef is followed by cython type
## but the np.zeros function as python (numpy) type
cdef np.ndarray[np.float64_t, ndim=3] bsf = np.zeros([nt,ny,nx], dtype=np.float64)
for jn in xrange(0,nt):
for jk in xrange(0,nz):
for jj in xrange(0,ny):
for ji in xrange(0,nx):
bsf[jn,jj,ji] += uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji]
return bsf
and that took 49 seconds.
However, swapping the loop for
for jn in range(0,nt):
for jk in range(0,nz):
bsf[jn,:,:] = bsf[jn,:,:] + uvel[jn,jk,:,:] * dz[jn,jk,:,:] * dy[:,:]
only takes 0.29 seconds! Unfortunately, I can't do this in my full code.
Why is NumPy slicing so much faster than the Cython loop?
I thought NumPy was fast because it is Cython under the hood. So shouldn't they be of similar speed?
As you can see, I've disabled boundary checks in cython, and I've also compiled using "fast math". However, this only gives a tiny speedup.
Is there anyway to get a loop to be of similar speed as NumPy slicing, or is looping always slower than slicing?
Any help is greatly appreciated!
/Joakim
That code is screaming for numpy.einsum's's intervention, given that you are doing elementwise-multiplication and then sum-reduction on the second axis of the 4D product array, which essenti
ally numpy.einsum does in a highly efficient manner. To solve your case, you can use numpy.einsum in two ways -
bsf = np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)
bsf = np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy
Runtime tests & Verify outputs -
In [100]: # Take a (1/5)th of original input shapes
...: original_shape = [12,75, 559,1442]
...: m,n,p,q = (np.array(original_shape)/5).astype(int)
...:
...: # Generate random arrays with given shapes
...: uvel = np.random.rand(m,n,p,q)
...: dy = np.random.rand(p,q)
...: dz = np.random.rand(m,n,p,q)
...:
In [101]: bsf = calculate_bsf_u_loop(uvel,dy,dz)
In [102]: print(np.allclose(bsf,np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)))
True
In [103]: print(np.allclose(bsf,np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy))
True
In [104]: %timeit calculate_bsf_u_loop(uvel,dy,dz)
1 loops, best of 3: 2.16 s per loop
In [105]: %timeit np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy)
100 loops, best of 3: 3.94 ms per loop
In [106]: %timeit np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy
100 loops, best of 3: 3.96 ms per loo
Related
I want to apply a "black box" Python function f to a large array arr. Additional assumptions are:
Function f is "pure", e.g. is deterministic with no side effects.
Array arr has a small number of unique elements.
I can achieve this with a decorator that computes f for each unique element of arr as follows:
import numpy as np
from time import sleep
from functools import wraps
N = 1000
np.random.seed(0)
arr = np.random.randint(0, 10, size=(N, 2))
def vectorize_pure(f):
#wraps(f)
def f_vec(arr):
uniques, ix = np.unique(arr, return_inverse=True)
f_range = np.array([f(x) for x in uniques])
return f_range[ix].reshape(arr.shape)
return f_vec
#np.vectorize
def usual_vectorize(x):
sleep(0.001)
return x
#vectorize_pure
def pure_vectorize(x):
sleep(0.001)
return x
# In [47]: %timeit usual_vectorize(arr)
# 1.33 s ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# In [48]: %timeit pure_vectorize(arr)
# 13.6 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My concern is that np.unique sorts arr under the hood, which seems inefficient given the assumptions. I am looking for a practical way of implementing a similar decorator that
Takes advantage of fast numpy vectorized operations.
Does not sort the input array.
I suspect that the answer is "yes" using numba, but I would be especially interested in a numpy solution.
Also, it seems that depending on the arr datatype, numpy may use radix sort, so performance of unique may be good in some cases.
I found a workaround below, using pandas.unique; however, it still requires two passes over the original array, and pandas.unique does some extra work. I wonder if a better solution exists with pandas._libs.hashtable and cython, or anything else.
You actually can do this in one-pass over the array, however it requires that you know the dtype of the result beforehand. Otherwise you need a second-pass over the elements to determine it.
Neglecting the performance (and the functools.wraps) for a moment an implementation could look like this:
def vectorize_cached(output_dtype):
def vectorize_cached_factory(f):
def f_vec(arr):
flattened = arr.ravel()
if output_dtype is None:
result = np.empty_like(flattened)
else:
result = np.empty(arr.size, output_dtype)
cache = {}
for idx, item in enumerate(flattened):
res = cache.get(item)
if res is None:
res = f(item)
cache[item] = res
result[idx] = res
return result.reshape(arr.shape)
return f_vec
return vectorize_cached_factory
It first creates the result array, then it iterates over the input array. The function is called (and the result stored) once an element is encountered that's not already in the dictionary - otherwise it simply uses the value stored in the dictionary.
#vectorize_cached(np.float64)
def t(x):
print(x)
return x + 2.5
>>> t(np.array([1,1,1,2,2,2,3,3,1,1,1]))
1
2
3
array([3.5, 3.5, 3.5, 4.5, 4.5, 4.5, 5.5, 5.5, 3.5, 3.5, 3.5])
However this isn't particularly fast because we're doing a Python loop over a NumPy array.
A Cython solution
To make it faster we can actually port this implementation to Cython (currently only supporting float32, float64, int32, int64, uint32, and uint64 but almost trivial to extend because it uses fused-types):
%%cython
cimport numpy as cnp
ctypedef fused input_type:
cnp.float32_t
cnp.float64_t
cnp.uint32_t
cnp.uint64_t
cnp.int32_t
cnp.int64_t
ctypedef fused result_type:
cnp.float32_t
cnp.float64_t
cnp.uint32_t
cnp.uint64_t
cnp.int32_t
cnp.int64_t
cpdef void vectorized_cached_impl(input_type[:] array, result_type[:] result, object func):
cdef dict cache = {}
cdef Py_ssize_t idx
cdef input_type item
for idx in range(array.size):
item = array[idx]
res = cache.get(item)
if res is None:
res = func(item)
cache[item] = res
result[idx] = res
With a Python decorator (the following code is not compiled with Cython):
def vectorize_cached_cython(output_dtype):
def vectorize_cached_factory(f):
def f_vec(arr):
flattened = arr.ravel()
if output_dtype is None:
result = np.empty_like(flattened)
else:
result = np.empty(arr.size, output_dtype)
vectorized_cached_impl(flattened, result, f)
return result.reshape(arr.shape)
return f_vec
return vectorize_cached_factory
Again this only does one-pass and only applies the function once per unique value:
#vectorize_cached_cython(np.float64)
def t(x):
print(x)
return x + 2.5
>>> t(np.array([1,1,1,2,2,2,3,3,1,1,1]))
1
2
3
array([3.5, 3.5, 3.5, 4.5, 4.5, 4.5, 5.5, 5.5, 3.5, 3.5, 3.5])
Benchmark: Fast function, lots of duplicates
But the question is: Does it make sense to use Cython here?
I did a quick benchmark (without sleep) to get an idea how different the performance is (using my library simple_benchmark):
def func_to_vectorize(x):
return x
usual_vectorize = np.vectorize(func_to_vectorize)
pure_vectorize = vectorize_pure(func_to_vectorize)
pandas_vectorize = vectorize_with_pandas(func_to_vectorize)
cached_vectorize = vectorize_cached(None)(func_to_vectorize)
cython_vectorize = vectorize_cached_cython(None)(func_to_vectorize)
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
b.add_function(alias='usual_vectorize')(usual_vectorize)
b.add_function(alias='pure_vectorize')(pure_vectorize)
b.add_function(alias='pandas_vectorize')(pandas_vectorize)
b.add_function(alias='cached_vectorize')(cached_vectorize)
b.add_function(alias='cython_vectorize')(cython_vectorize)
#b.add_arguments('array size')
def argument_provider():
np.random.seed(0)
for exponent in range(6, 20):
size = 2**exponent
yield size, np.random.randint(0, 10, size=(size, 2))
r = b.run()
r.plot()
According to these times the ranking would be (fastest to slowest):
Cython version
Pandas solution (from another answer)
Pure solution (original post)
NumPys vectorize
The non-Cython version using Cache
The plain NumPy solution is only a factor 5-10 slower if the function call is very inexpensive. The pandas solution also has a much bigger constant factor, making it the slowest for very small arrays.
Benchmark: expensive function (time.sleep(0.001)), lots of duplicates
In case the function call is actually expensive (like with time.sleep) the np.vectorize solution will be a lot slower, however there is much less difference between the other solutions:
# This shows only the difference compared to the previous benchmark
def func_to_vectorize(x):
sleep(0.001)
return x
#b.add_arguments('array size')
def argument_provider():
np.random.seed(0)
for exponent in range(5, 10):
size = 2**exponent
yield size, np.random.randint(0, 10, size=(size, 2))
Benchmark: Fast function, few duplicates
However if you don't have that many duplicates the plain np.vectorize is almost as fast as the pure and pandas solution and only a bit slower than the Cython version:
# Again just difference to the original benchmark is shown
#b.add_arguments('array size')
def argument_provider():
np.random.seed(0)
for exponent in range(6, 20):
size = 2**exponent
# Maximum value is now depending on the size to ensures there
# are less duplicates in the array
yield size, np.random.randint(0, size // 10, size=(size, 2))
This problem is actually quite interesting as it is a perfect example of a trade off between computation time and memory consumption.
From an algorithmic perspective finding the unique elements, and eventually computing only unique elements, can be achieved in two ways:
two-(or more) passes approach:
find out all unique elements
find out where the unique elements are
compute the function on the unique elements
put all computed unique elements into the right place
single-pass approach:
compute elements on the go and cache results
if an element is in the cache get it from there
The algorithmic complexity depends on the size of the input N and on the number of unique elements U. The latter can be formalized also using the r = U / N ratio of unique elements.
The more-passes approaches are theoretically slower. However, they are quite competitive for small N and U.
The single-pass approaches are theoretically faster, but this would also strongly depends on the caching approaches and how they do perform depending on U.
Of course, no matter how important is the asymptotic behavior, the actual timings do depend on the constant computation time factors.
The most relevant in this problem is the func() computation time.
Approaches
A number of approaches can be compared:
not cached
pure() this would be the base function and could be already vectorized
np.vectorized() this would be the NumPy standard vectorization decorator
more-passes approaches
np_unique(): the unique values are found using np.unique() and uses indexing (from np.unique() output) for constructing the result (essentially equivalent to vectorize_pure() from here)
pd_unique(): the unique values are found using pd.unique() and uses indexing (via np.searchsorted()) for constructing the result(essentially equivalent to vectorize_with_pandas() from here)
set_unique(): the unique values are found using simply set() and uses indexing (via np.searchsorted()) for constructing the result
set_unique_msk(): the unique values are found using simply set() (like set_unique()) and uses looping and masking for constructing the result (instead of indexing)
nb_unique(): the unique values and their indexes are found using explicit looping with numba JIT acceleration
cy_unique(): the unique values and their indexes are found using explicit looping with cython
single-pass approaches
cached_dict(): uses a Python dict for the caching (O(1) look-up)
cached_dict_cy(): same as above but with Cython (essentially equivalent to vectorized_cached_impl() from here)
cached_arr_cy(): uses an array for the caching (O(U) look-up)
pure()
def pure(x):
return 2 * x
np.vectorized()
import numpy as np
vectorized = np.vectorize(pure)
vectorized.__name__ = 'vectorized'
np_unique()
import functools
import numpy as np
def vectorize_np_unique(func):
#functools.wraps(func)
def func_vect(arr):
uniques, ix = np.unique(arr, return_inverse=True)
result = np.array([func(x) for x in uniques])
return result[ix].reshape(arr.shape)
return func_vect
np_unique = vectorize_np_unique(pure)
np_unique.__name__ = 'np_unique'
pd_unique()
import functools
import numpy as np
import pandas as pd
def vectorize_pd_unique(func):
#functools.wraps(func)
def func_vect(arr):
shape = arr.shape
arr = arr.ravel()
uniques = np.sort(pd.unique(arr))
f_range = np.array([func(x) for x in uniques])
return f_range[np.searchsorted(uniques, arr)].reshape(shape)
return func_vect
pd_unique = vectorize_pd_unique(pure)
pd_unique.__name__ = 'pd_unique'
set_unique()
import functools
def vectorize_set_unique(func):
#functools.wraps(func)
def func_vect(arr):
shape = arr.shape
arr = arr.ravel()
uniques = sorted(set(arr))
result = np.array([func(x) for x in uniques])
return result[np.searchsorted(uniques, arr)].reshape(shape)
return func_vect
set_unique = vectorize_set_unique(pure)
set_unique.__name__ = 'set_unique'
set_unique_msk()
import functools
def vectorize_set_unique_msk(func):
#functools.wraps(func)
def func_vect(arr):
result = np.empty_like(arr)
for x in set(arr.ravel()):
result[arr == x] = func(x)
return result
return func_vect
set_unique_msk = vectorize_set_unique_msk(pure)
set_unique_msk.__name__ = 'set_unique_msk'
nb_unique()
import functools
import numpy as np
import numba as nb
import flyingcircus as fc
#nb.jit(forceobj=False, nopython=True, nogil=True, parallel=True)
def numba_unique(arr, max_uniques):
ix = np.empty(arr.size, dtype=np.int64)
uniques = np.empty(max_uniques, dtype=arr.dtype)
j = 0
for i in range(arr.size):
found = False
for k in nb.prange(j):
if arr[i] == uniques[k]:
found = True
break
if not found:
uniques[j] = arr[i]
j += 1
uniques = np.sort(uniques[:j])
# : get indices
num_uniques = j
for j in nb.prange(num_uniques):
x = uniques[j]
for i in nb.prange(arr.size):
if arr[i] == x:
ix[i] = j
return uniques, ix
#fc.base.parametric
def vectorize_nb_unique(func, max_uniques=-1):
#functools.wraps(func)
def func_vect(arr):
nonlocal max_uniques
shape = arr.shape
arr = arr.ravel()
if max_uniques <= 0:
m = arr.size
elif isinstance(max_uniques, int):
m = min(max_uniques, arr.size)
elif isinstance(max_uniques, float):
m = int(arr.size * min(max_uniques, 1.0))
uniques, ix = numba_unique(arr, m)
result = np.array([func(x) for x in uniques])
return result[ix].reshape(shape)
return func_vect
nb_unique = vectorize_nb_unique()(pure)
nb_unique.__name__ = 'nb_unique'
cy_unique()
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
import cython as cy
cimport cython as ccy
cimport numpy as cnp
ctypedef fused arr_t:
cnp.uint16_t
cnp.uint32_t
cnp.uint64_t
cnp.int16_t
cnp.int32_t
cnp.int64_t
cnp.float32_t
cnp.float64_t
cnp.complex64_t
cnp.complex128_t
def sort_numpy(arr_t[:] a):
np.asarray(a).sort()
cpdef cnp.int64_t cython_unique(
arr_t[:] arr,
arr_t[::1] uniques,
cnp.int64_t[:] ix):
cdef size_t size = arr.size
cdef arr_t x
cdef cnp.int64_t i, j, k, num_uniques
j = 0
for i in range(size):
found = False
for k in range(j):
if arr[i] == uniques[k]:
found = True
break
if not found:
uniques[j] = arr[i]
j += 1
sort_numpy(uniques[:j])
num_uniques = j
for j in range(num_uniques):
x = uniques[j]
for i in range(size):
if arr[i] == x:
ix[i] = j
return num_uniques
import functools
import numpy as np
import flyingcircus as fc
#fc.base.parametric
def vectorize_cy_unique(func, max_uniques=0):
#functools.wraps(func)
def func_vect(arr):
shape = arr.shape
arr = arr.ravel()
if max_uniques <= 0:
m = arr.size
elif isinstance(max_uniques, int):
m = min(max_uniques, arr.size)
elif isinstance(max_uniques, float):
m = int(arr.size * min(max_uniques, 1.0))
ix = np.empty(arr.size, dtype=np.int64)
uniques = np.empty(m, dtype=arr.dtype)
num_uniques = cy_uniques(arr, uniques, ix)
uniques = uniques[:num_uniques]
result = np.array([func(x) for x in uniques])
return result[ix].reshape(shape)
return func_vect
cy_unique = vectorize_cy_unique()(pure)
cy_unique.__name__ = 'cy_unique'
cached_dict()
import functools
import numpy as np
def vectorize_cached_dict(func):
#functools.wraps(func)
def func_vect(arr):
result = np.empty_like(arr.ravel())
cache = {}
for i, x in enumerate(arr.ravel()):
if x not in cache:
cache[x] = func(x)
result[i] = cache[x]
return result.reshape(arr.shape)
return func_vect
cached_dict = vectorize_cached_dict(pure)
cached_dict.__name__ = 'cached_dict'
cached_dict_cy()
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
import cython as cy
cimport cython as ccy
cimport numpy as cnp
ctypedef fused arr_t:
cnp.uint16_t
cnp.uint32_t
cnp.uint64_t
cnp.int16_t
cnp.int32_t
cnp.int64_t
cnp.float32_t
cnp.float64_t
cnp.complex64_t
cnp.complex128_t
ctypedef fused result_t:
cnp.uint16_t
cnp.uint32_t
cnp.uint64_t
cnp.int16_t
cnp.int32_t
cnp.int64_t
cnp.float32_t
cnp.float64_t
cnp.complex64_t
cnp.complex128_t
cpdef void apply_cached_dict_cy(arr_t[:] arr, result_t[:] result, object func):
cdef size_t size = arr.size
cdef size_t i
cdef dict cache = {}
cdef arr_t x
cdef result_t y
for i in range(size):
x = arr[i]
if x not in cache:
y = func(x)
cache[x] = y
else:
y = cache[x]
result[i] = y
import functools
import flyingcircus as fc
#fc.base.parametric
def vectorize_cached_dict_cy(func, dtype=None):
#functools.wraps(func)
def func_vect(arr):
nonlocal dtype
shape = arr.shape
arr = arr.ravel()
result = np.empty_like(arr) if dtype is None else np.empty(arr.shape, dtype=dtype)
apply_cached_dict_cy(arr, result, func)
return np.reshape(result, shape)
return func_vect
cached_dict_cy = vectorize_cached_dict_cy()(pure)
cached_dict_cy.__name__ = 'cached_dict_cy'
cached_arr_cy()
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
import cython as cy
cimport cython as ccy
cimport numpy as cnp
ctypedef fused arr_t:
cnp.uint16_t
cnp.uint32_t
cnp.uint64_t
cnp.int16_t
cnp.int32_t
cnp.int64_t
cnp.float32_t
cnp.float64_t
cnp.complex64_t
cnp.complex128_t
ctypedef fused result_t:
cnp.uint16_t
cnp.uint32_t
cnp.uint64_t
cnp.int16_t
cnp.int32_t
cnp.int64_t
cnp.float32_t
cnp.float64_t
cnp.complex64_t
cnp.complex128_t
cpdef void apply_cached_arr_cy(
arr_t[:] arr,
result_t[:] result,
object func,
arr_t[:] uniques,
result_t[:] func_uniques):
cdef size_t i
cdef size_t j
cdef size_t k
cdef size_t size = arr.size
j = 0
for i in range(size):
found = False
for k in range(j):
if arr[i] == uniques[k]:
found = True
break
if not found:
uniques[j] = arr[i]
func_uniques[j] = func(arr[i])
result[i] = func_uniques[j]
j += 1
else:
result[i] = func_uniques[k]
import functools
import numpy as np
import flyingcircus as fc
#fc.base.parametric
def vectorize_cached_arr_cy(func, dtype=None, max_uniques=None):
#functools.wraps(func)
def func_vect(arr):
nonlocal dtype, max_uniques
shape = arr.shape
arr = arr.ravel()
result = np.empty_like(arr) if dtype is None else np.empty(arr.shape, dtype=dtype)
if max_uniques is None or max_uniques <= 0:
max_uniques = arr.size
elif isinstance(max_uniques, int):
max_uniques = min(max_uniques, arr.size)
elif isinstance(max_uniques, float):
max_uniques = int(arr.size * min(max_uniques, 1.0))
uniques = np.empty(max_uniques, dtype=arr.dtype)
func_uniques = np.empty_like(arr) if dtype is None else np.empty(max_uniques, dtype=dtype)
apply_cached_arr_cy(arr, result, func, uniques, func_uniques)
return np.reshape(result, shape)
return func_vect
cached_arr_cy = vectorize_cached_arr_cy()(pure)
cached_arr_cy.__name__ = 'cached_arr_cy'
Notes
The meta-decorator #parametric (inspired from here and available in FlyingCircus as flyingcircus.base.parametric) is defined as below:
def parametric(decorator):
#functools.wraps(decorator)
def _decorator(*_args, **_kws):
def _wrapper(func):
return decorator(func, *_args, **_kws)
return _wrapper
return _decorator
Numba would not be able to handle single-pass methods more efficiently than regular Python code because passing an arbitrary callable would require Python object support enabled, thereby excluding fast JIT looping.
Cython has some limitation in that you would need to specify the expected result data type. You could also tentatively guess it from the input data type, but that is not really ideal.
Some implementation requiring a temporary storage were implemented for simplicity using a static NumPy array. It would be possible to improve these implementations with dynamic arrays in C++, for example, without much loss in speed, but much improved memory footprint.
Benchmarks
Slow function with only 10 unique values (less than ~0.05%)
(This is essentially the use-case of the original post).
Fast function with ~0.05% unique values
Fast function with ~10% unique values
Fast function with ~20% unique values
The full benchmark code (based on this template) is available here.
Discussion and Conclusion
The fastest approach will depend on both N and U.
For slow functions, all cached approaches are faster than just vectorized(). This result should be taken with a grain of salt of course, because the slow function tested here is ~4 orders of magnitude slower than the fast function, and such slow analytical functions are not really too common.
If the function can be written in vectorized form right away, that is by far and large the fastest approach.
In general, cached_dict_cy() is quite memory efficient and faster than vectorized() (even for fast functions) as long as U / N is ~20% or less.
Its major drawback is that requires Cython, which is a somewhat complex dependency and it would also require specifying the result data type.
The np_unique() approach is faster than vectorized() (even for fast functions) as long as U / N is ~10% or less.
The pd_unique() approach is competitive only for very small U and slow func.
For very small U, hashing is marginally less beneficial and cached_arr_cy() is the fastest approach.
After poking around a bit, here is one approach that uses pandas.unique (based on hashing) instead of numpy.unique (based on sorting).
import pandas as pd
def vectorize_with_pandas(f):
#wraps(f)
def f_vec(arr):
uniques = np.sort(pd.unique(arr.ravel()))
f_range = np.array([f(x) for x in uniques])
return f_range[
np.searchsorted(uniques, arr.ravel())
].reshape(arr.shape)
return f_vec
Giving the following performance boost:
N = 1_000_000
np.random.seed(0)
arr = np.random.randint(0, 10, size=(N, 2)).astype(float)
#vectorize_with_pandas
def pandas_vectorize(x):
sleep(0.001)
return x
In [33]: %timeit pure_vectorize(arr)
152 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [34]: %timeit pandas_vectorize(arr)
76.8 ms ± 582 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Also, based on a suggestion by Warren Weckesser, you could go even faster if arr is an array of small integers, e.g. uint8. For example,
def unique_uint8(arr):
q = np.zeros(256, dtype=int)
q[arr.ravel()] = 1
return np.nonzero(q)[0]
def vectorize_uint8(f):
#wraps(f)
def f_vec(arr):
uniques = unique_uint8(arr)
f_range = np.array([f(x) for x in uniques])
return f_range[
np.searchsorted(uniques, arr.ravel())
].reshape(arr.shape)
return f_vec
The following decorator is:
10x faster than your usual_vectorize
10x slower than your vectorize_pure
not doing any sorting (to the best of my knowledge)
using numpy vectorized operations
Code:
def vectorize_pure2(f):
#wraps(f)
def f_vec(arr):
tups = [tuple(x) for x in arr]
tups_rows = dict(zip(tups, arr))
new_arr = np.ndarray(arr.shape)
for row in tups_rows.values():
row_ixs = (arr == row).all(axis=1)
new_arr[row_ixs] = f(row)
return new_arr
return f_vec
Performance:
#vectorize_pure2
def pure_vectorize2(x):
sleep(0.001)
return x
In [49]: %timeit pure_vectorize2(arr)
135 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Some credit due this answer: https://stackoverflow.com/a/16992881/4960855
After an intensive use of numba, I am coming back to cython to parallelize some time consuming functions. Hereafter, a base example :
import numpy as np
cimport numpy as np
from cython import boundscheck, wraparound
from cython.parallel import parallel, prange
#boundscheck(False)
#wraparound(False)
def cytest1(double[:,::1] a, double[:,::1] b, int ix1, int ix2, int iz1, int iz2):
cdef int ix
cdef int iz
for ix in range(ix1, ix2):
for iz in range(iz1, iz2):
b[ix, iz] = 0.5*(a[ix+1, iz] - a[ix-1, iz])
return b
#boundscheck(False)
#wraparound(False)
def cytest2(double[:,::1] a, double[:,::1] b, int ix1, int ix2, int iz1, int iz2):
cdef int ix
cdef int iz
with nogil, parallel():
for ix in prange(ix1, ix2):
for iz in range(iz1, iz2):
b[ix, iz] = 0.5*(a[ix+1, iz] - a[ix-1, iz])
return b
When compiling these two functions (with openmp flag), and calling them as follows :
nx, nz = 1024, 1024
a = np.random.rand(nx, nz)
b = np.zeros_like(a)
Nit = 1000
ti = time.time()
for i in range(Nit):
cytest1(a, b, 5, nx-5, 0, nz)
print('cytest1 : {:.3f} s.'.format(time.time() - ti))
ti = time.time()
for i in range(Nit):
cytest2(a, b, 5, nx-5, 0, nz)
print('cytest2 : {:.3f} s.'.format(time.time() - ti))
I obtain these execution times :
cytest1 : 1.757 s.
cytest2 : 1.861 s.
When the parallel function is executed, I can see my 4 cpu-s in action, but the execution time is nearly the same that the one obtained with the serial function. I tried to move prange to the inner loop, but for worst results. I also tried some different schedule options but without success.
I am clearly missing something, but what ? Is prange unable to chunk the loop with a code trying to access n+X/n-X elements ?
EDIT :
My setup :
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
MemTotal : 8052556 kB
Python : 3.5.2
cython : 0.28.2
Numpy : 1.14.2
Numba : 0.37.0
The setup.py :
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [
Extension("stencil",
["stencil.pyx"],
libraries=["m"],
extra_compile_args=["-O3", "-ffast-math", "-march=native", "-fopenmp"],
extra_link_args=['-fopenmp'],
)
]
setup(
name="stencil",
cmdclass={"build_ext": build_ext},
ext_modules=ext_modules
)
This answer will be a lot of guesswork, but as we will see: a lot depends on the hardware, so it is not easy to explain without having the same hardware at hand.
The first question is: What is the bottle-neck? By looking at the code I would assume, that this is a memory-bound task.
To make it more clear-cut, let's do only the following operation in the loop:
b[ix, iz] = (a[ix+1, iz])
So there is no calculation, only memory accesses.
I use Intel Xeon E5-2620 # 2.1 Ghz with 2 processors and %timeit-magic reports:
>>> %timeit cytest1(a,b,5, nx-5, 0, nz)
100 loops, best of 3: 1.99 ms per loop
>>> %timeit cytest2(a,b,5, nx-5, 0, nz)
The slowest run took 234.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 324 µs per loop
As we can see, some caching is going on. We have 2 arrays, each 8Mb - that means 16Mb of data which has to be "touched". Every processor on my machine has 15Mb cache - so for a single thread the data is evicted from cache before it can be reused,but if both processors are used there are 20Mb of fast cache - and thus big enough to keep all of the data.
That means the speed-up we see is due to larger amount of fast-memory (cache) which can be utilized by the parallelized version.
Let's increase the size of the arrays, so the cache isn't big enough even for the paralleliized version:
....
>>> nx, nz = 10240, 10240 #100 times bigger
....
>>> %timeit cytest1(a,b,5, nx-5, 0, nz)
1 loop, best of 3: 238 ms per loop
>>> %timeit cytest2(a,b,5, nx-5, 0, nz)
10 loops, best of 3: 99.3 ms per loop
Now it is about 2 times faster, which is easy to explain: two processors have twice the memory-bandwidth compared to one processor and both are utilized by the parallel version.
We get very similar results for your formula
b[ix, iz] = 0.5*(a[ix+1, iz] - a[ix-1, iz])
which is not surprisingly - there are not enough calculations to make it CPU-bound.
sin and cos are pretty CPU-intensive operations, so using them will make the calculation CPU-bound (see appendix for the whole code):
...
b[ix, iz] = sin(a[ix+1, iz])
...
>>> %timeit cytest1(a,b,5, nx-5, 0, nz)
1 loop, best of 3: 1.6 s per loop
>>> %timeit cytest2(a,b,5, nx-5, 0, nz)
1 loop, best of 3: 217 ms per loop
This yields speed-up of 8, which is quite reasonable for my machine.
Obviously, for other machines/architectures different behaviors can be observed. But in a nutshell:
I would not expect much speed-up for your formula - the task is memory-bound, so the question is, whether you can achieve a higher bandwidth of memory-accesses or not.
For more CPU-intensive calculation you should be able to see at least some speed-up, which yet depends on your hardware.
Listing (on windows, use -fopenmp on linux):
%%cython --compile-args=/openmp --link-args=/openmp
from cython.parallel import parallel, prange
from cython import boundscheck, wraparound
from libc.math cimport sin
#boundscheck(False)
#wraparound(False)
def cytest1(double[:,::1] a, double[:,::1] b, int ix1, int ix2, int iz1, int iz2):
cdef int ix
cdef int iz
for ix in range(ix1, ix2):
for iz in range(iz1, iz2):
b[ix, iz] =sin(a[ix+1, iz])
return b
#boundscheck(False)
#wraparound(False)
def cytest2(double[:,::1] a, double[:,::1] b, int ix1, int ix2, int iz1, int iz2):
cdef int ix
cdef int iz
with nogil, parallel():
for ix in prange(ix1, ix2):
for iz in range(iz1, iz2):
b[ix, iz] = sin(a[ix+1, iz])
return b
I would like to calculate the p values of a large 2D numpy t values array. However, this takes long time and I would like to improve its speed. I tried using GSL.
Although a single gsl_cdf_tdist_P is much much faster than scipy.stats.t.sf, when iterating over the ndarray, the process is very slow. I would like help to improve this.
See the code below.
GSL_Test.pyx
import cython
cimport cython
import numpy
cimport numpy
from cython_gsl cimport *
DTYPE = numpy.float32
ctypedef numpy.float32_t DTYPE_t
cdef get_gsl_p(double t, double nu):
return (1 - gsl_cdf_tdist_P(t, nu)) * 2
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef get_gsl_p_for_2D_matrix(numpy.ndarray[DTYPE_t, ndim=2] t_matrix, int n):
cdef unsigned int rows = t_matrix.shape[0]
cdef numpy.ndarray[DTYPE_t, ndim=2] out = numpy.zeros((rows, rows), dtype='float32')
cdef unsigned int row, col
for row in range(rows):
for col in range(rows):
out[row, col] = get_gsl_p(t_matrix[row, col], n-2)
return out
def get_gsl_p_for_2D_matrix_def(numpy.ndarray[DTYPE_t, ndim=2] t_matrix, int n):
return get_gsl_p_for_2D_matrix(t_matrix, n)
ipython
import GSL_Test
import numpy
import scipy.stats
a = numpy.random.rand(3544, 3544).astype('float32')
%timeit -n 1 GSL_Test.get_gsl_p_for_2D_matrix(a, 25)
1 loop, best of 3: 7.87 s per loop
%timeit -n 1 scipy.stats.t.sf(a, 25)*2
1 loop, best of 3: 4.66 s per loop
UPDATE: Adding cdef declarations I was able to reduce the computational time but not lower than scipy still. I modified the code to have the cdef declarations.
%timeit -n 1 GSL_Test.get_gsl_p_for_2D_matrix_def(a, 25)
1 loop, best of 3: 6.73 s per loop
You can get some small gain in raw performance by using a raw special function instead of stats.t.sf. Looking at the source, you find (https://github.com/scipy/scipy/blob/master/scipy/stats/_continuous_distns.py#L3849)
def _sf(self, x, df):
return sc.stdtr(df, -x)
So that you can use stdtr directly:
np.random.seed(1234)
x = np.random.random((3740, 374))
t1 = stats.t.sf(x, 25)
t2 = stdtr(25, -x)
1 loop, best of 3: 653 ms per loop
1 loop, best of 3: 562 ms per loop
If you do reach out for cython, the typed memoryview syntax often gives you faster code than the old ndarray syntax:
from scipy.special.cython_special cimport stdtr
from numpy cimport npy_intp
import numpy as np
def tsf(double [:, ::1] x, int df=25):
cdef double[:, ::1] out = np.empty_like(x)
cdef npy_intp i, j
cdef double tmp, xx
for i in range(x.shape[0]):
for j in range(x.shape[1]):
xx = x[i, j]
out[i, j] = stdtr(df, -xx)
return np.asarray(out)
Here I'm also using the cython_special interface, which is only avaialble in the dev version of scipy (http://scipy.github.io/devdocs/special.cython_special.html#module-scipy.special.cython_special), but you can use GSL if you want.
Finally, if you suspect a bottleneck in iterations, don't forget to inspect the output of cython -a to see if there's some python overhead in the hot loops.
The purpose of this mathematical function is to compute a distance between two (or more) protein structures using dihedral angles:
It is very useful in structural biology, for example. And I already code this function in python using numpy, but the goal is to have a faster implementation. As computation time reference, I use the euclidean distance function available in the scikit-learn package.
Here the code I have for the moment:
import numpy as np
import numexpr as ne
from sklearn.metrics.pairwise import euclidean_distances
# We have 10000 structures with 100 dihedral angles
n = 10000
m = 100
# Generate some random data
c = np.random.rand(n,m)
# Generate random int number
x = np.random.randint(c.shape[0])
print c.shape, x
# First version with numpy of the dihedral_distances function
def dihedral_distances(a, b):
l = 1./a.shape[0]
return np.sqrt(l* np.sum((0.5)*(1. - np.cos(a-b)), axis=1))
# Accelerated version with numexpr
def dihedral_distances_ne(a, b):
l = 1./a.shape[0]
tmp = ne.evaluate('sum((0.5)*(1. - cos(a-b)), axis=1)')
return ne.evaluate('sqrt(l* tmp)')
# The function of reference I try to be close as possible
# in term of computation time
%timeit euclidean_distances(c[x,:], c)[0]
1000 loops, best of 3: 1.07 ms per loop
# Computation time of the first version of the dihedral_distances function
# We choose randomly 1 structure among the 10000 structures.
# And we compute the dihedral distance between this one and the others
%timeit dihedral_distances(c[x,:], c)
10 loops, best of 3: 21.5 ms per loop
# Computation time of the accelerated function with numexpr
%timeit dihedral_distances_ne(c[x,:], c)
100 loops, best of 3: 9.44 ms per loop
9.44 ms it's very fast, but it's very slow if you need to run it a million times. Now the question is, how to do that? What is the next step? Cython? PyOpenCL? I have some experience with PyOpenCL, however I never code something as elaborate as this one. I don't know if it's possible to compute the dihedral distances in one step on GPU as I do with numpy and how to proceed.
Thank you for helping me!
EDIT:
Thank you guys! I am currently working on the full solution and once it's finished I will put the code here.
CYTHON VERSION:
%load_ext cython
import numpy as np
np.random.seed(1234)
n = 10000
m = 100
c = np.random.rand(n,m)
x = np.random.randint(c.shape[0])
print c.shape, x
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import numpy as np
cimport numpy as np
from libc.math cimport sqrt, cos
cimport cython
from cython.parallel cimport parallel, prange
# Define a function pointer to a metric
ctypedef double (*metric)(double[: ,::1], np.intp_t, np.intp_t)
cdef extern from "math.h" nogil:
double cos(double x)
double sqrt(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double dihedral_distances(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
cdef double res
cdef int m
cdef int j
res = 0.
m = a.shape[1]
for j in range(m):
res += 1. - cos(a[i1, j] - a[i2, j])
res /= 2.*m
return sqrt(res)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double dihedral_distances_p(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
cdef double res
cdef int m
cdef int j
res = 0.
m = a.shape[1]
with nogil, parallel(num_threads=2):
for j in prange(m, schedule='dynamic'):
res += 1. - cos(a[i1, j] - a[i2, j])
res /= 2.*m
return sqrt(res)
#cython.boundscheck(False)
#cython.wraparound(False)
def pairwise(double[: ,::1] c not None, np.intp_t x, p = True):
cdef metric dist_func
if p:
dist_func = &dihedral_distances_p
else:
dist_func = &dihedral_distances
cdef np.intp_t i, n_structures
n_samples = c.shape[0]
cdef double[::1] res = np.empty(n_samples)
for i in range(n_samples):
res[i] = dist_func(c, x, i)
return res
%timeit pairwise(c, x, False)
100 loops, best of 3: 17 ms per loop
# Parallel version
%timeit pairwise(c, x, True)
10 loops, best of 3: 37.1 ms per loop
So I follow your link to create the cython version of the dihedral distances function. We gain some speed, not so much, but it is still slower than the numexpr version (17ms vs 9.44ms). So I tried to parallelize the function using prange and it is worse (37.1ms vs 17ms vs 9.4ms)!
Do I miss something?
If you're willing to use http://pythran.readthedocs.io/, you can leverage on the numpy implementation and get better performance than cython for that case:
#pythran export np_cos_norm(float[], float[])
import numpy as np
def np_cos_norm(a, b):
val = np.sum(1. - np.cos(a-b))
return np.sqrt(val / 2. / a.shape[0])
And compile it with:
pythran fast.py
To get an average x2 over the cython version.
If using:
pythran fast.py -march=native -DUSE_BOOST_SIMD -fopenmp
You'll get a vectorized, parallel version that runs slightly faster:
100000 loops, best of 3: 2.54 µs per loop
1000000 loops, best of 3: 674 ns per loop
100000 loops, best of 3: 16.9 µs per loop
100000 loops, best of 3: 4.31 µs per loop
10000 loops, best of 3: 176 µs per loop
10000 loops, best of 3: 42.9 µs per loop
(using the same testbed as ev-br)
Here's a quick-and-dirty try with cython, for just a pair of 1D arrays:
(in an IPython notebook)
%%cython
cimport cython
cimport numpy as np
cdef extern from "math.h":
double cos(double x) nogil
double sqrt(double x) nogil
def cos_norm(a, b):
return cos_norm_impl(a, b)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef double cos_norm_impl(double[::1] a, double[::1] b) nogil:
cdef double res = 0., val
cdef int m = a.shape[0]
# XXX: shape of b not checked
cdef int j
for j in range(m):
val = a[j] - b[j]
res += 1. - cos(val)
res /= 2.*m
return sqrt(res)
Comparing with a straightforward numpy implementation,
def np_cos_norm(a, b):
val = np.add.reduce(1. - np.cos(a-b))
return np.sqrt(val / 2. / a.shape[0])
I get
np.random.seed(1234)
for n in [100, 1000, 10000]:
x = np.random.random(n)
y = np.random.random(n)
%timeit cos_norm(x, y)
%timeit np_cos_norm(x, y)
print '\n'
100000 loops, best of 3: 3.04 µs per loop
100000 loops, best of 3: 12.4 µs per loop
100000 loops, best of 3: 18.8 µs per loop
10000 loops, best of 3: 30.8 µs per loop
1000 loops, best of 3: 196 µs per loop
1000 loops, best of 3: 223 µs per loop
So, depending on the dimensionality of your vectors, you can get from a factor of 4 to nil of a speedup.
For computing pairwise distances, you can probably do much better, as shown in this blog post, but of course YMMV.
I've been working on speeding up a resampling calculation for a particle filter. As python has many ways to speed it up, I though I'd try them all. Unfortunately, the numba version is incredibly slow. As Numba should result in a speed up, I assume this is an error on my part.
I tried 4 different versions:
Numba
Python
Numpy
Cython
The code for each is below:
import numpy as np
import scipy as sp
import numba as nb
from cython_resample import cython_resample
#nb.autojit
def numba_resample(qs, xs, rands):
n = qs.shape[0]
lookup = np.cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
def python_resample(qs, xs, rands):
n = qs.shape[0]
lookup = np.cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
def numpy_resample(qs, xs, rands):
results = np.empty_like(qs)
lookup = sp.cumsum(qs)
for j, key in enumerate(rands):
i = sp.argmax(lookup>key)
results[j] = xs[i]
return results
#The following is the code for the cython module. It was compiled in a
#separate file, but is included here to aid in the question.
"""
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
def cython_resample(np.ndarray[DTYPE_t, ndim=1] qs,
np.ndarray[DTYPE_t, ndim=1] xs,
np.ndarray[DTYPE_t, ndim=1] rands):
if qs.shape[0] != xs.shape[0] or qs.shape[0] != rands.shape[0]:
raise ValueError("Arrays must have same shape")
assert qs.dtype == xs.dtype == rands.dtype == DTYPE
cdef unsigned int n = qs.shape[0]
cdef unsigned int i, j
cdef np.ndarray[DTYPE_t, ndim=1] lookup = np.cumsum(qs)
cdef np.ndarray[DTYPE_t, ndim=1] results = np.zeros(n, dtype=DTYPE)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
"""
if __name__ == '__main__':
n = 100
xs = np.arange(n, dtype=np.float64)
qs = np.array([1.0/n,]*n)
rands = np.random.rand(n)
print "Timing Numba Function:"
%timeit numba_resample(qs, xs, rands)
print "Timing Python Function:"
%timeit python_resample(qs, xs, rands)
print "Timing Numpy Function:"
%timeit numpy_resample(qs, xs, rands)
print "Timing Cython Function:"
%timeit cython_resample(qs, xs, rands)
This results in the following output:
Timing Numba Function:
1 loops, best of 3: 8.23 ms per loop
Timing Python Function:
100 loops, best of 3: 2.48 ms per loop
Timing Numpy Function:
1000 loops, best of 3: 793 µs per loop
Timing Cython Function:
10000 loops, best of 3: 25 µs per loop
Any idea why the numba code is so slow? I assumed it would be at least comparable to Numpy.
Note: if anyone has any ideas on how to speed up either the Numpy or Cython code samples, that would be nice too:) My main question is about Numba though.
The problem is that numba can't intuit the type of lookup. If you put a print nb.typeof(lookup) in your method, you'll see that numba is treating it as an object, which is slow. Normally I would just define the type of lookup in a locals dict, but I was getting a strange error. Instead I just created a little wrapper, so that I could explicitly define the input and output types.
#nb.jit(nb.f8[:](nb.f8[:]))
def numba_cumsum(x):
return np.cumsum(x)
#nb.autojit
def numba_resample2(qs, xs, rands):
n = qs.shape[0]
#lookup = np.cumsum(qs)
lookup = numba_cumsum(qs)
results = np.empty(n)
for j in range(n):
for i in range(n):
if rands[j] < lookup[i]:
results[j] = xs[i]
break
return results
Then my timings are:
print "Timing Numba Function:"
%timeit numba_resample(qs, xs, rands)
print "Timing Revised Numba Function:"
%timeit numba_resample2(qs, xs, rands)
Timing Numba Function:
100 loops, best of 3: 8.1 ms per loop
Timing Revised Numba Function:
100000 loops, best of 3: 15.3 µs per loop
You can go even a little faster still if you use jit instead of autojit:
#nb.jit(nb.f8[:](nb.f8[:], nb.f8[:], nb.f8[:]))
For me that lowers it from 15.3 microseconds to 12.5 microseconds, but it's still impressive how well autojit does.
Faster numpy version (10x speedup compared to numpy_resample)
def numpy_faster(qs, xs, rands):
lookup = np.cumsum(qs)
mm = lookup[None,:]>rands[:,None]
I = np.argmax(mm,1)
return xs[I]