I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:
import numpy as np
from numba import jit, vectorize, guvectorize
import numexpr as ne
import timeit
def add_two_2ds_naive(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#jit
def add_two_2ds_jit(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
'(n,m),(n,m)->(n,m)',target='cpu')
def add_two_2ds_cpu(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
'(n,m),(n,m)->(n,m)',target='parallel')
def add_two_2ds_parallel(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
def add_two_2ds_numexpr(A,B,res):
res = ne.evaluate('A+B')
if __name__=="__main__":
np.random.seed(69)
A = np.random.rand(10000,100)
B = np.random.rand(10000,100)
res = np.zeros((10000,100))
I can now run timeit on the various functions:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop
It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.
There are two issues with your #guvectorize implementations. The first is that you are are doing all the looping inside your #guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both #vectorize and #guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.
The best way to write the function you have above is to use a regular ufunc:
#vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Then on my system, I see these speeds:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
(This is a very similar OS X system to yours, but with OS X 10.11.)
Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.
Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).
Related
I was just revisiting some of my code to improve the performance and stumpled over something strange:
a = np.linspace(10,1000,1000000).reshape(1000,1000)
%timeit np.square(a)
100 loops, best of 3: 8.07 ms per loop
%timeit a*a
100 loops, best of 3: 8.18 ms per loop
%timeit a**2
100 loops, best of 3: 8.32 ms per loop
Ok it seems to have some overhead when using the power-operator (**) but otherwise they seem identical (I guess NumPy is doing that) but then it got strange:
In [46]: %timeit np.power(a, 2)
10 loops, best of 3: 121 ms per loop
So there is no problem but it seems a bit inconsistent to have a fallback for the magic pow but not for the UFUNC. But then I got interested since I am using third powers a lot:
%timeit a*a*a
100 loops, best of 3: 18.1 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
%timeit np.power(a, 3)
10 loops, best of 3: 121 ms per loop
There seems to be no "shortcut" in the third power and UFUNC and 'magic-pow' work the same (at least in regard to performance).
But that's not that good since I want a consistent method of using powers in my code and I'm not quite sure how to wrap the __pow__ of numpy.
So to get to the point, my question is :
Is there a way I can wrap the numpys __pow__ method? Because I want a consistent way of writing powers in my script not writing a**2 and at another place power(a, 3). Simply writing a**3, and redirecting this to my power function, would be preferred (but for that I would need to somehow wrap the ndarrays __pow__ or?).
Currently I am using a shortcut but that's not that beautiful (I even have to declare the exponent==2 case since np.power performs not optimal there):
def power(array, exponent):
if exponent == 2: #catch this, or it calls the slow np.power(array, exponent)
return np.square(array)
if exponent == 3:
return array * array * array
#As soon as np.cbrt is avaiable catch the exponent 4/3 here too
return np.power(array, exponent)
%timeit power(a, 3)
100 loops, best of 3: 17.8 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
I am using NumPy v1.9.3 and I do not want to subclass np.ndarray just for wrapping the __pow__ method. :-)
EDIT: I rewrote the part where I get to my question. To clarify it: I am not asking about why NumPy does it the way it does - that is just to explain why I ask the question.
This is a good catch. I too wonder why is that behavior. But to be short and concise answering the question, I would just do:
def mypower(array, exponent):
return reduce(lambda x,y: x*y, [array for _ in range(exponent)])
%timeit mypower(a,2)
100 loops, best of 3: 3.68 ms per loop
%timeit mypower(a,3)
100 loops, best of 3: 8.09 ms per loop
%timeit mypower(a,4)
100 loops, best of 3: 12.6 ms per loop
Obsviouly the overhead increases with the exponent but for low ones is better than 10x the time.
Note that this is different from the original numpy implementation which is not specific for a numeric exponent and supports an array of exponents as the second argument (check it out here).
Overloading the operator
The way to do what you want is to subclass ndarray and use views. See the following example:
import numexpr
import numpy as np
class MyArray(np.ndarray):
def __pow__(self, other):
return reduce(lambda x,y: x*y, [self for _ in range(other)])
class NumExprArray(np.ndarray):
def __pow__(self, other):
return numexpr.evaluate("self**%f" % other)
#This implies extra overhead, is as much as 4x slower:
#return numexpr.evaluate("self**other")
a = np.linspace(10,1000,1000000).reshape(1000,1000).view(MyArray)
na = np.linspace(10,1000,1000000).reshape(1000,1000).view(NumExprArray)
%timeit a**2
1000 loops, best of 3: 1.2 ms per loop
%timeit na**2
1000 loops, best of 3: 1.14 ms per loop
%timeit a**3
100 loops, best of 3: 4.69 ms per loop
%timeit na**3
100 loops, best of 3: 2.36 ms per loop
%timeit a**4
100 loops, best of 3: 6.59 ms per loop
%timeit na**4
100 loops, best of 3: 2.4 ms per loop
For more information on this method please follow this link. Another way would be to use a custom infix operator but for readability purposes is not so good. As one can see, numexpr should be the way to go.
If I read the source correctly, when numpy performs power, it checks whether the numerical value of the exponent is one of the special cases (-0.5, 0, 0.5, 1, and 2). If so, the operation is done using special routines. All other numerical values of the exponent are considered "general", and will be fed into the generic power function, which may be slow (especially if the exponent is promoted to floating-point type, but I'm not sure if this is the case with a ** 3).
How would you vectorize the evaluation of arrays of lambda functions?
Here's an example to understand what I'm talking about. (And even though I'm using numpy arrays, I'm not limiting myself to only using numpy.)
Let's say I have the following numpy arrays.
array1 = np.array(["hello", 9])
array2 = np.array([lambda s: s == "hello", lambda num: num < 10])
(You could store these kinds of objects in numpy without throwing an error, believe it or not.) What I want is something akin to the following.
array2 * array1
# Return np.array([True, True]). PS: An explanation of how to `AND` all of
# booleans together quickly would be nice too.
Of course, this seems impractical for arrays of size 2, but for arrays of arbitrary sizes, I'll assume this would yield a performance boost because of all of the low level optimizations.
So, anyone know how to write this weird kind of python code?
The simple answer, of course, is that you can't easily do this with numpy (or with standard Python, for that matter). Numpy doesn't actually vectorize most operations itself, to my knowledge: it uses libraries like BLAS/ATLAS/etc that do for certain situations. Even if it did, it would do so in C for specific situations: it certainly can't vectorize Python function execution.
If you want to involve multiprocessing in this, it is possible, but it depends on your situation. Are your individual function applications time-consuming, making them feasible to send out one-by-one, or do you need a very large number of fast function executions, in which case you'd probably want to send batches of them to each process?
In general, because of what could be argued as poor fundamental design (eg, the Global Interpreter Lock), it's very difficult with standard Python to have lightweight parallelization as you're hoping for here. There are significantly heavier methods, like the multiprocessing module or Ipython.parallel, but these require some work to use.
Alright guys, I have an answer: numpy's vectorize.
Please read the edited section though. You'll discover that python actually optimizes code for you, which actually defeats the purpose of using numpy arrays in this case. (But using numpy arrays does not decrease the performance.)
The last test really shows is that python lists are as efficient as they could be, and so this vectorization procedure is unnecessary. This is why I didn't mark this question as the "best answer".
Setup code:
def factory(i): return lambda num: num==i
array1 = list()
for i in range(10000): array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(10000))
The "unvectorized" version:
def evaluate(array1, array2):
return [func(val) for func, val in zip(array1, array2)]
%timeit evaluate(array1, array2)
# 100 loops, best of 3: 10 ms per loop
The vectorized version
def evaluate2(func, b): return func(b)
vec_evaluate = np.vectorize(evaluate2)
vec_evaluate(array1, array2)
# 100 loops, best of 3: 2.65 ms per loop
EDIT
Okay, I just wanted to paste more benchmarks that I received using the above tests, except with different test cases.
I made a third edit, showing what happens if you simply use python lists. The long story short, you actually won't regret much. This test case is on the very bottom.
Test cases only involving integers
In summary, if n is small, then the unvectorized version is better. Otherwise, vectorized is the way to go.
With n = 30
%timeit evaluate(array1, array2)
# 10000 loops, best of 3: 35.7 µs per loop
%timeit vec_evaluate(array1, array2)
# 10000 loops, best of 3: 27.6 µs per loop
With n = 7
%timeit evaluate(array1, array2)
100000 loops, best of 3: 9.93 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 21.6 µs per loop
Test cases involving strings
Vectorization wins.
Setup code:
def factory(i): return lambda num: str(num)==str(i)
array1 = list()
for i in range(7):
array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(7))
With n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 36.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 6.57 ms per loop
With n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 28.3 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 27.5 µs per loop
Random tests
Just to see how branch prediction played a role. From what I'm seeing, it didn't really change much. Vectorization still usually wins.
Setup code.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
When n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 25.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 4.67 ms per loop
When n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
Using python lists instead of numpy arrays
I ran this test to see what happened when I chose not to use the "optimized" numpy arrays, and I received some very surprising results.
The setup code is almost the same, except I'm choosing not to use numpy arrays. I'm also doing this test for only the "random" case.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
array1 = list()
for i in range(10000): array1.append(factory(i))
array2 = range(10000)
And the "unvectorized" version:
%timeit evaluate(array1, array2)
100 loops, best of 3: 4.93 ms per loop
You could see this is actually pretty surprising, because this is almost the same benchmark I was receiving with my random test case involving the vectorized evaluate.
%timeit vec_evaluate(array1, array2)
10 loops, best of 3: 19.8 ms per loop
Likewise, if you change these into numpy arrays before using vec_evaluate, you get the same 4.5 ms benchmark.
I am new to numba's jit. For a personal project, I need to speed up functions that are similar to what will be shown below, though different for the purpose of writing standalone examples.
import numpy as np
from numba import jit, autojit, double, float64, float32, void
def f(n):
k=0.
for i in range(n):
for j in range(n):
k+= i+j
def f_with_return(n):
k=0.
for i in range(n):
for j in range(n):
k+= i+j
return k
def f_with_arange(n):
k=0.
for i in np.arange(n):
for j in np.arange(n):
k+= i+j
def f_with_arange_and_return(n):
k=0.
for i in np.arange(n):
for j in np.arange(n):
k+= i+j
#jit decorators
jit_f = jit(void(int32))(f)
jit_f_with_return = jit(int32(int32))(f_with_return)
jit_f_with_arange = jit(void(double))(f_with_arange)
jit_f_with_arange_and_return = jit(double(double))(f_with_arange_and_return)
And the benchmarks:
%timeit f(1000)
%timeit jit_f(1000)
10 loops, best of 3: 73.9 ms per loop / 1000000 loops, best of 3: 212 ns per loop
%timeit f_with_return(1000)
%timeit jit_f_with_return(1000)
10 loops, best of 3: 74.9 ms per loop / 1000000 loops, best of 3: 220 ns per loop
I don't understand these two:
%timeit f_with_arange(1000.0)
%timeit jit_f_with_arange(1000.0)
10 loops, best of 3: 175 ms per loop / 1 loops, best of 3: 167 ms per loop
%timeit f_with_arange_with_return(1000.0)
%timeit jit_f_with_arange_with_return(1000.0)
10 loops, best of 3: 174 ms per loop / 1 loops, best of 3: 172 ms per loop
I think I'm not giving the jit function the correct types for the output and input ? Just because the for loop is now running over a numpy.arange, and not a simple range anymore, I cannot get jit to make it faster. What is the issue here ?
Simply, numba doesn't know how to convert np.arange into a low level native loop, so it defaults to the object layer which is much slower and usually the same speed as pure python.
A nice trick is to pass the nopython=True keyword argument to jit to see if it can compile everything without resorting to the object mode:
import numpy as np
import numba as nb
def f_with_return(n):
k=0.
for i in range(n):
for j in range(n):
k+= i+j
return k
jit_f_with_return = nb.jit()(f_with_return)
jit_f_with_return_nopython = nb.jit(nopython=True)(f_with_return)
%timeit f_with_return(1000)
%timeit jit_f_with_return(1000)
%timeit jit_f_with_return_nopython(1000)
The last two are the same speed on my machine and much faster than the un-jitted code. The two examples that you had questions about will raise an error with nopython=True since it can't compile np.arange at this point.
See the following for more details:
http://numba.pydata.org/numba-doc/0.17.0/user/troubleshoot.html#the-compiled-code-is-too-slow
and for a list of supported numpy features with indications of what is and is not supported in nopython mode:
http://numba.pydata.org/numba-doc/0.17.0/reference/numpysupported.html
I have a problem where I have to do the following calculation.
I wanted to avoid the loop version, so I vectorized it.
Why is the loop version actually fast than the vectorized version?
Does anybody have an explanation for this.
thx
import numpy as np
from numpy.core.umath_tests import inner1d
num_vertices = 40000
num_pca_dims = 1000
num_vert_coords = 3
a = np.arange(num_vert_coords * num_vertices * num_pca_dims).reshape((num_pca_dims, num_vertices*num_vert_coords)).T
#n-by-3
norms = np.arange(num_vertices * num_vert_coords).reshape(num_vertices,-1)
#Loop version
def slowversion(a,norms):
res_list = []
for c_idx in range(a.shape[1]):
curr_col = a[:,c_idx].reshape(-1,3)
res = inner1d(curr_col, norms)
res_list.append(res)
res_list_conc = np.column_stack(res_list)
return res_list_conc
#Fast version
def fastversion(a,norms):
a_3 = a.reshape(num_vertices, 3, num_pca_dims)
fast_res = np.sum(a_3 * norms[:,:,None], axis=1)
return fast_res
res_list_conc = slowversion(a,norms)
fast_res = fastversion(a,norms)
assert np.all(res_list_conc == fast_res)
Your "slow code" is likely doing better because inner1d is a single optimized C++ function that can* make use of your BLAS implementation. Lets look at comparable timings for this operation:
np.allclose(inner1d(a[:,0].reshape(-1,3), norms),
np.sum(a[:,0].reshape(-1,3)*norms,axis=1))
True
%timeit inner1d(a[:,0].reshape(-1,3), norms)
10000 loops, best of 3: 200 µs per loop
%timeit np.sum(a[:,0].reshape(-1,3)*norms,axis=1)
1000 loops, best of 3: 625 µs per loop
%timeit np.einsum('ij,ij->i',a[:,0].reshape(-1,3), norms)
1000 loops, best of 3: 325 µs per loop
Using inner is quite a bit faster then the pure numpy operations. Note that einsum is almost twice as fast as pure numpy expressions and for good reason. As your loop is not that large and most of the FLOPS are in the inner computations the saving for the inner operation outweigh the cost of looping.
%timeit slowversion(a,norms)
1 loops, best of 3: 991 ms per loop
%timeit fastversion(a,norms)
1 loops, best of 3: 1.28 s per loop
#Thanks to DSM for writing this out
%timeit np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)
1 loops, best of 3: 488 ms per loop
Putting this back together we can see the overall advantage of the "slow version" wins out; however, using an einsum implementation, which is fairly optimized for this sort of thing, gives us a further speed increase.
*I don't see it right off in the code, but it is clearly threaded.
TLDR: in cython, why (or when?) is iterating over a numpy array faster than iterating over a python list?
Generally:
I've used Cython before and was able to get tremendous speed ups over naive python impl',
However, figuring out what exactly needs to be done seems non-trivial.
Consider the following 3 implementations of a sum() function.
They reside in a cython file called 'cy' (obviously, there's np.sum(), but that's besides my point..)
Naive python:
def sum_naive(A):
s = 0
for a in A:
s += a
return s
Cython with a function that expects a python list:
def sum_list(A):
cdef unsigned long s = 0
for a in A:
s += a
return s
Cython with a function that expects a numpy array.
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
I would expect that in terms of running time, sum_np < sum_list < sum_naive, however, the following script demonstrates to the contrary (for completeness, I added np.sum() )
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit cy.sum_naive(v_list)
%timeit cy.sum_naive(v_np)
%timeit cy.sum_list(v_list)
%timeit cy.sum_np(v_np)
%timeit v_np.sum()
with results:
In [18]: %timeit cyMatching.sum_naive(v_list)
100 loops, best of 3: 18.7 ms per loop
In [19]: %timeit cyMatching.sum_naive(v_np)
1 loops, best of 3: 389 ms per loop
In [20]: %timeit cyMatching.sum_list(v_list)
10 loops, best of 3: 82.9 ms per loop
In [21]: %timeit cyMatching.sum_np(v_np)
1 loops, best of 3: 1.14 s per loop
In [22]: %timeit v_np.sum()
1000 loops, best of 3: 659 us per loop
What's going on?
Why is cython+numpy slow?
P.S.
I do use
#cython: boundscheck=False
#cython: wraparound=False
There is a better way to implement this in cython that at least on my machine beats np.sum because it avoids type checking and other things that numpy normally has to do when dealing with an arbitrary array:
#cython.wraparound=False
#cython.boundscheck=False
cimport numpy as np
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
def sum_np2(np.int64_t[::1] A):
cdef:
unsigned long s = 0
size_t k
for k in range(A.shape[0]):
s += A[k]
return s
And then the timings:
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit sum(v_list)
%timeit sum_naive(v_list)
%timeit np.sum(v_np)
%timeit sum_np(v_np)
%timeit sum_np2(v_np)
10 loops, best of 3: 19.5 ms per loop
10 loops, best of 3: 64.9 ms per loop
1000 loops, best of 3: 1.62 ms per loop
1 loops, best of 3: 1.7 s per loop
1000 loops, best of 3: 1.42 ms per loop
You don't want to iterate over the numpy array via the Python style, but rather access elements using indexing as that it can be translated into pure C, rather than relying on the Python API.
a is untyped and thus there will be lots of conversions from Python to C types and back. These can be slow.
JoshAdel correctly pointed out that instead of iterating though, you should iterate over a range. Cython will convert the indexing to C, which is fast.
Using cython -a myfile.pyx will highlight these sorts of things for you; you want all of your loop logic to be white for maximum speed.
PS: Note that np.ndarray[np.int64_t, ndim=1] is outdated and has been deprecated in favour of the faster and more general long[:].