Why is vectorized version slower?

Why is vectorized version slower? - python

I have a problem where I have to do the following calculation.
I wanted to avoid the loop version, so I vectorized it.
Why is the loop version actually fast than the vectorized version?
Does anybody have an explanation for this.
thx
import numpy as np
from numpy.core.umath_tests import inner1d
num_vertices = 40000
num_pca_dims = 1000
num_vert_coords = 3
a = np.arange(num_vert_coords * num_vertices * num_pca_dims).reshape((num_pca_dims, num_vertices*num_vert_coords)).T
#n-by-3
norms = np.arange(num_vertices * num_vert_coords).reshape(num_vertices,-1)
#Loop version
def slowversion(a,norms):
res_list = []
for c_idx in range(a.shape[1]):
curr_col = a[:,c_idx].reshape(-1,3)
res = inner1d(curr_col, norms)
res_list.append(res)
res_list_conc = np.column_stack(res_list)
return res_list_conc
#Fast version
def fastversion(a,norms):
a_3 = a.reshape(num_vertices, 3, num_pca_dims)
fast_res = np.sum(a_3 * norms[:,:,None], axis=1)
return fast_res
res_list_conc = slowversion(a,norms)
fast_res = fastversion(a,norms)
assert np.all(res_list_conc == fast_res)

Your "slow code" is likely doing better because inner1d is a single optimized C++ function that can* make use of your BLAS implementation. Lets look at comparable timings for this operation:
np.allclose(inner1d(a[:,0].reshape(-1,3), norms),
np.sum(a[:,0].reshape(-1,3)*norms,axis=1))
True
%timeit inner1d(a[:,0].reshape(-1,3), norms)
10000 loops, best of 3: 200 µs per loop
%timeit np.sum(a[:,0].reshape(-1,3)*norms,axis=1)
1000 loops, best of 3: 625 µs per loop
%timeit np.einsum('ij,ij->i',a[:,0].reshape(-1,3), norms)
1000 loops, best of 3: 325 µs per loop
Using inner is quite a bit faster then the pure numpy operations. Note that einsum is almost twice as fast as pure numpy expressions and for good reason. As your loop is not that large and most of the FLOPS are in the inner computations the saving for the inner operation outweigh the cost of looping.
%timeit slowversion(a,norms)
1 loops, best of 3: 991 ms per loop
%timeit fastversion(a,norms)
1 loops, best of 3: 1.28 s per loop
#Thanks to DSM for writing this out
%timeit np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)
1 loops, best of 3: 488 ms per loop
Putting this back together we can see the overall advantage of the "slow version" wins out; however, using an einsum implementation, which is fairly optimized for this sort of thing, gives us a further speed increase.
*I don't see it right off in the code, but it is clearly threaded.

Related

Code optimization python

I wrote the below function to estimate the orientation from a 3 axes accelerometer signal (X,Y,Z)
X.shape
Out[4]: (180000L,)
Y.shape
Out[4]: (180000L,)
Z.shape
Out[4]: (180000L,)
def estimate_orientation(self,X,Y,Z):
sigIn=np.array([X,Y,Z]).T
N=len(sigIn)
sigOut=np.empty(shape=(N,3))
sigOut[sigOut==0]=None
i=0
while i<N:
sigOut[i,:] = np.arccos(sigIn[i,:]/np.linalg.norm(sigIn[i,:]))*180/math.pi
i=i+1
return sigOut
Executing this function with a signal of 180000 samples takes quite a while (~2.2 seconds)... I know that it is not written in a "pythonic way"... Could you help me to optimize the execution time?
Thanks!

Starting approach
One approach following an usage of broadcasting, would be like so -
np.arccos(sigIn/np.linalg.norm(sigIn,axis=1,keepdims=1))*180/np.pi
Further optimization - I
We could use np.einsum to replace np.linalg.norm part. Thus :
np.linalg.norm(sigIn,axis=1,keepdims=1)
could be replaced by :
np.sqrt(np.einsum('ij,ij->i',sigIn,sigIn))[:,None]
Further optimization - II
Further boost could be brought in with numexpr module, which works really well with huge arrays and with operations involving trigonometrical functions. In our case that would be arcccos. So, we will use the einsum part as used in the previous optimization section and then use arccos from numexpr on it.
Thus, the implementation would look something like this -
import numexpr as ne
pi_val = np.pi
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
out = ne.evaluate('arccos(signIn/s)*180/pi_val')
Runtime test
Approaches -
def original_app(sigIn):
N=len(sigIn)
sigOut=np.empty(shape=(N,3))
sigOut[sigOut==0]=None
i=0
while i<N:
sigOut[i,:] = np.arccos(sigIn[i,:]/np.linalg.norm(sigIn[i,:]))*180/math.pi
i=i+1
return sigOut
def broadcasting_app(signIn):
s = np.linalg.norm(signIn,axis=1,keepdims=1)
return np.arccos(signIn/s)*180/np.pi
def einsum_app(signIn):
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
return np.arccos(signIn/s)*180/np.pi
def numexpr_app(signIn):
pi_val = np.pi
s = np.sqrt(np.einsum('ij,ij->i',signIn,signIn))[:,None]
return ne.evaluate('arccos(signIn/s)*180/pi_val')
Timings -
In [115]: a = np.random.rand(180000,3)
In [116]: %timeit original_app(a)
...: %timeit broadcasting_app(a)
...: %timeit einsum_app(a)
...: %timeit numexpr_app(a)
...:
1 loops, best of 3: 1.38 s per loop
100 loops, best of 3: 15.4 ms per loop
100 loops, best of 3: 13.3 ms per loop
100 loops, best of 3: 4.85 ms per loop
In [117]: 1380/4.85 # Speedup number
Out[117]: 284.5360824742268
280x speedup there!

numba guvectorize target='parallel' slower than target='cpu'

I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:
import numpy as np
from numba import jit, vectorize, guvectorize
import numexpr as ne
import timeit
def add_two_2ds_naive(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#jit
def add_two_2ds_jit(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
'(n,m),(n,m)->(n,m)',target='cpu')
def add_two_2ds_cpu(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
'(n,m),(n,m)->(n,m)',target='parallel')
def add_two_2ds_parallel(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
def add_two_2ds_numexpr(A,B,res):
res = ne.evaluate('A+B')
if __name__=="__main__":
np.random.seed(69)
A = np.random.rand(10000,100)
B = np.random.rand(10000,100)
res = np.zeros((10000,100))
I can now run timeit on the various functions:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop
It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.

There are two issues with your #guvectorize implementations. The first is that you are are doing all the looping inside your #guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both #vectorize and #guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.
The best way to write the function you have above is to use a regular ufunc:
#vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Then on my system, I see these speeds:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
(This is a very similar OS X system to yours, but with OS X 10.11.)
Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.
Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).

Wrapping np.arrays pow method

I was just revisiting some of my code to improve the performance and stumpled over something strange:
a = np.linspace(10,1000,1000000).reshape(1000,1000)
%timeit np.square(a)
100 loops, best of 3: 8.07 ms per loop
%timeit a*a
100 loops, best of 3: 8.18 ms per loop
%timeit a**2
100 loops, best of 3: 8.32 ms per loop
Ok it seems to have some overhead when using the power-operator (**) but otherwise they seem identical (I guess NumPy is doing that) but then it got strange:
In [46]: %timeit np.power(a, 2)
10 loops, best of 3: 121 ms per loop
So there is no problem but it seems a bit inconsistent to have a fallback for the magic pow but not for the UFUNC. But then I got interested since I am using third powers a lot:
%timeit a*a*a
100 loops, best of 3: 18.1 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
%timeit np.power(a, 3)
10 loops, best of 3: 121 ms per loop
There seems to be no "shortcut" in the third power and UFUNC and 'magic-pow' work the same (at least in regard to performance).
But that's not that good since I want a consistent method of using powers in my code and I'm not quite sure how to wrap the __pow__ of numpy.
So to get to the point, my question is :
Is there a way I can wrap the numpys __pow__ method? Because I want a consistent way of writing powers in my script not writing a**2 and at another place power(a, 3). Simply writing a**3, and redirecting this to my power function, would be preferred (but for that I would need to somehow wrap the ndarrays __pow__ or?).
Currently I am using a shortcut but that's not that beautiful (I even have to declare the exponent==2 case since np.power performs not optimal there):
def power(array, exponent):
if exponent == 2: #catch this, or it calls the slow np.power(array, exponent)
return np.square(array)
if exponent == 3:
return array * array * array
#As soon as np.cbrt is avaiable catch the exponent 4/3 here too
return np.power(array, exponent)
%timeit power(a, 3)
100 loops, best of 3: 17.8 ms per loop
%timeit a**3
10 loops, best of 3: 121 ms per loop
I am using NumPy v1.9.3 and I do not want to subclass np.ndarray just for wrapping the __pow__ method. :-)
EDIT: I rewrote the part where I get to my question. To clarify it: I am not asking about why NumPy does it the way it does - that is just to explain why I ask the question.

This is a good catch. I too wonder why is that behavior. But to be short and concise answering the question, I would just do:
def mypower(array, exponent):
return reduce(lambda x,y: x*y, [array for _ in range(exponent)])
%timeit mypower(a,2)
100 loops, best of 3: 3.68 ms per loop
%timeit mypower(a,3)
100 loops, best of 3: 8.09 ms per loop
%timeit mypower(a,4)
100 loops, best of 3: 12.6 ms per loop
Obsviouly the overhead increases with the exponent but for low ones is better than 10x the time.
Note that this is different from the original numpy implementation which is not specific for a numeric exponent and supports an array of exponents as the second argument (check it out here).
Overloading the operator
The way to do what you want is to subclass ndarray and use views. See the following example:
import numexpr
import numpy as np

class MyArray(np.ndarray):
def __pow__(self, other):
return reduce(lambda x,y: x*y, [self for _ in range(other)])

class NumExprArray(np.ndarray):
def __pow__(self, other):
return numexpr.evaluate("self**%f" % other)
#This implies extra overhead, is as much as 4x slower:
#return numexpr.evaluate("self**other")
a = np.linspace(10,1000,1000000).reshape(1000,1000).view(MyArray)
na = np.linspace(10,1000,1000000).reshape(1000,1000).view(NumExprArray)

%timeit a**2
1000 loops, best of 3: 1.2 ms per loop
%timeit na**2
1000 loops, best of 3: 1.14 ms per loop
%timeit a**3
100 loops, best of 3: 4.69 ms per loop
%timeit na**3
100 loops, best of 3: 2.36 ms per loop
%timeit a**4
100 loops, best of 3: 6.59 ms per loop
%timeit na**4
100 loops, best of 3: 2.4 ms per loop
For more information on this method please follow this link. Another way would be to use a custom infix operator but for readability purposes is not so good. As one can see, numexpr should be the way to go.

If I read the source correctly, when numpy performs power, it checks whether the numerical value of the exponent is one of the special cases (-0.5, 0, 0.5, 1, and 2). If so, the operation is done using special routines. All other numerical values of the exponent are considered "general", and will be fed into the generic power function, which may be slow (especially if the exponent is promoted to floating-point type, but I'm not sure if this is the case with a ** 3).

Python: Vectorizing evaluations of arrays of lambda functions

How would you vectorize the evaluation of arrays of lambda functions?
Here's an example to understand what I'm talking about. (And even though I'm using numpy arrays, I'm not limiting myself to only using numpy.)
Let's say I have the following numpy arrays.
array1 = np.array(["hello", 9])
array2 = np.array([lambda s: s == "hello", lambda num: num < 10])
(You could store these kinds of objects in numpy without throwing an error, believe it or not.) What I want is something akin to the following.
array2 * array1
# Return np.array([True, True]). PS: An explanation of how to `AND` all of
# booleans together quickly would be nice too.
Of course, this seems impractical for arrays of size 2, but for arrays of arbitrary sizes, I'll assume this would yield a performance boost because of all of the low level optimizations.
So, anyone know how to write this weird kind of python code?

The simple answer, of course, is that you can't easily do this with numpy (or with standard Python, for that matter). Numpy doesn't actually vectorize most operations itself, to my knowledge: it uses libraries like BLAS/ATLAS/etc that do for certain situations. Even if it did, it would do so in C for specific situations: it certainly can't vectorize Python function execution.
If you want to involve multiprocessing in this, it is possible, but it depends on your situation. Are your individual function applications time-consuming, making them feasible to send out one-by-one, or do you need a very large number of fast function executions, in which case you'd probably want to send batches of them to each process?
In general, because of what could be argued as poor fundamental design (eg, the Global Interpreter Lock), it's very difficult with standard Python to have lightweight parallelization as you're hoping for here. There are significantly heavier methods, like the multiprocessing module or Ipython.parallel, but these require some work to use.

Alright guys, I have an answer: numpy's vectorize.
Please read the edited section though. You'll discover that python actually optimizes code for you, which actually defeats the purpose of using numpy arrays in this case. (But using numpy arrays does not decrease the performance.)
The last test really shows is that python lists are as efficient as they could be, and so this vectorization procedure is unnecessary. This is why I didn't mark this question as the "best answer".
Setup code:
def factory(i): return lambda num: num==i
array1 = list()
for i in range(10000): array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(10000))
The "unvectorized" version:
def evaluate(array1, array2):
return [func(val) for func, val in zip(array1, array2)]
%timeit evaluate(array1, array2)
# 100 loops, best of 3: 10 ms per loop
The vectorized version
def evaluate2(func, b): return func(b)
vec_evaluate = np.vectorize(evaluate2)
vec_evaluate(array1, array2)
# 100 loops, best of 3: 2.65 ms per loop
EDIT
Okay, I just wanted to paste more benchmarks that I received using the above tests, except with different test cases.
I made a third edit, showing what happens if you simply use python lists. The long story short, you actually won't regret much. This test case is on the very bottom.
Test cases only involving integers
In summary, if n is small, then the unvectorized version is better. Otherwise, vectorized is the way to go.
With n = 30
%timeit evaluate(array1, array2)
# 10000 loops, best of 3: 35.7 µs per loop
%timeit vec_evaluate(array1, array2)
# 10000 loops, best of 3: 27.6 µs per loop
With n = 7
%timeit evaluate(array1, array2)
100000 loops, best of 3: 9.93 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 21.6 µs per loop
Test cases involving strings
Vectorization wins.
Setup code:
def factory(i): return lambda num: str(num)==str(i)
array1 = list()
for i in range(7):
array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(7))
With n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 36.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 6.57 ms per loop
With n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 28.3 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 27.5 µs per loop
Random tests
Just to see how branch prediction played a role. From what I'm seeing, it didn't really change much. Vectorization still usually wins.
Setup code.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
When n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 25.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 4.67 ms per loop
When n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
Using python lists instead of numpy arrays
I ran this test to see what happened when I chose not to use the "optimized" numpy arrays, and I received some very surprising results.
The setup code is almost the same, except I'm choosing not to use numpy arrays. I'm also doing this test for only the "random" case.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
array1 = list()
for i in range(10000): array1.append(factory(i))
array2 = range(10000)
And the "unvectorized" version:
%timeit evaluate(array1, array2)
100 loops, best of 3: 4.93 ms per loop
You could see this is actually pretty surprising, because this is almost the same benchmark I was receiving with my random test case involving the vectorized evaluate.
%timeit vec_evaluate(array1, array2)
10 loops, best of 3: 19.8 ms per loop
Likewise, if you change these into numpy arrays before using vec_evaluate, you get the same 4.5 ms benchmark.

How are are NumPy's in-place operators implemented to explain the significant performance gain

I know that in Python, the in-place operators use the __iadd__ method for in-place operators. For immutable types, the __iadd__ is a workaround using the __add__, e.g., like tmp = a + b; a = tmp, but mutable types (like lists) are modified in-place, which causes a slight speed boost.
However, if I have a NumPy array where I modify its contained immutable types, e.g., integers or floats, there is also an even more significant speed boost. How does this work? I did some example benchmarks below:
import numpy as np
def inplace(a, b):
a += b
return a
def assignment(a, b):
a = a + b
return a
int1 = 1
int2 = 1
list1 = [1]
list2 = [1]
npary1 = np.ones((1000,1000))
npary2 = np.ones((1000,1000))
print('Python integers')
%timeit inplace(int1, 1)
%timeit assignment(int2, 1)
print('\nPython lists')
%timeit inplace(list1, [1])
%timeit assignment(list2, [1])
print('\nNumPy Arrays')
%timeit inplace(npary1, 1)
%timeit assignment(npary2, 1)
What I would expect is a similar difference as for the Python integers when I used the in-place operators on NumPy arrays, however the results are completely different:
Python integers
1000000 loops, best of 3: 265 ns per loop
1000000 loops, best of 3: 249 ns per loop
Python lists
1000000 loops, best of 3: 449 ns per loop
1000000 loops, best of 3: 638 ns per loop
NumPy Arrays
100 loops, best of 3: 3.76 ms per loop
100 loops, best of 3: 6.6 ms per loop

Each call to assignment(npary2, 1) requires creating a new one million element array. Consider how much time it takes just to allocate a (1000, 1000)-shaped array of ones:
In [21]: %timeit np.ones((1000, 1000))
100 loops, best of 3: 3.84 ms per loop
This allocation of a new temporary array requires on my machine about 3.84 ms, and is on the right order of magnitude to explain the entire difference between inplace(npary1, 1) and assignment(nparay2, 1):
In [12]: %timeit inplace(npary1, 1)
1000 loops, best of 3: 1.8 ms per loop
In [13]: %timeit assignment(npary2, 1)
100 loops, best of 3: 4.04 ms per loop
So, given that allocation is a relatively slow process, it makes sense that in-place addition is significantly faster than assignment to a new array.
NumPy operations on NumPy arrays may be fast, but creation of NumPy arrays is relatively slow. Consider, for example, how much more time it takes to create a NumPy array than a Python list:
In [14]: %timeit list()
10000000 loops, best of 3: 106 ns per loop
In [15]: %timeit np.array([])
1000000 loops, best of 3: 563 ns per loop
This is one reason why it is generally better to use one large NumPy array (allocated once) rather than thousands of small NumPy arrays.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is vectorized version slower? - python

Related

Code optimization python

numba guvectorize target='parallel' slower than target='cpu'

Wrapping np.arrays pow method

Python: Vectorizing evaluations of arrays of lambda functions

How are are NumPy's in-place operators implemented to explain the significant performance gain

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is vectorized version slower? - python

Related

Code optimization python

numba guvectorize target='parallel' slower than target='cpu'

Wrapping np.arrays __pow__ method

Python: Vectorizing evaluations of arrays of lambda functions

How are are NumPy's in-place operators implemented to explain the significant performance gain

Categories

Resources

Wrapping np.arrays pow method