I would like to split a n-d numpy array based on a internal axis.
I have a array of shape (6,150,29,29,29,1)
I would like a list of arrays as - [150 arrays of shape (6,29,29,29,1)]
I have used the list(a), but this has given me a list over axis 0.
arr.transpose(1,0,2,3,4,5) or np.swapaxes(arr,0,1) put the 150 dimension first. Then you can use list.
Or you could use a list comprehension
[a[:,i] for i in range(150)]
The transpose is somewhat better
In [28]: timeit list(arr.transpose(1,0,2,3,4,5))
47.7 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [29]: timeit [arr[:,i] for i in range(150)]
88.7 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [32]: timeit list(np.swapaxes(arr,0,1))
49.2 µs ± 51.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Related
I have this two code that are doing the same but for different data structs
res = np.array([np.array([2.0, 4.0, 6.0]), np.array([8.0, 10.0, 12.0])], dtype=np.int)
%timeit np.sum(res, axis=1)
4.08 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
list_obj_array = np.ndarray((2,), dtype=np.object)
list_obj_array[0] = [2.0, 4.0, 6.0]
list_obj_array[1] = [8.0, 10.0, 12.0]
v_func = np.vectorize(np.sum, otypes=[np.int])
%timeit v_func(list_obj_array)
20.6 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the second one is 5 times slower , is there a better way to optimize this?
#nb.jit()
def nb_np_sum(arry_list):
return [np.sum(row) for row in arry_list]
%timeit nb_np_sum(list_obj_array)
30.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#nb.jit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
%timeit nb_sum(list_obj_array)
13.6 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Best so far (thanks #hpaulj)
%timeit [sum(l) for l in list_obj_array]
850 ns ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#nb.njit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'sum': cannot determine Numba type of <class 'builtin_function_or_method'>
File "<ipython-input-54-3bb48c5273bb>", line 3:
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
for longer array
list_obj_array = np.ndarray((n,), dtype=np.object)
for i in range(n):
list_obj_array[i] = list(range(7))
the vectorized version come closer to the best option (list Comprehension)
%timeit [sum(l) for l in list_obj_array]
23.4 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit v_func(list_obj_array)
29.6 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba still is slower
%timeit nb_sum(list_obj_array)
74.4 µs ± 6.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Since you used otypes you read enough of the vectorize docs to know that it is not a performance tool.
In [430]: timeit v_func(list_obj_array)
38.3 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A list comprehension is faster:
In [431]: timeit [sum(l) for l in list_obj_array]
2.08 µs ± 62.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even better if you start with a list of list instead on of an object dtype array:
In [432]: alist = list_obj_array.tolist()
In [433]: timeit [sum(l) for l in alist]
542 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
edit
np.frompyfunc is faster than np.vectorize, especially when working with object dtype arrays:
In [459]: np.frompyfunc(sum,1,1)(list_obj_array)
Out[459]: array([12.0, 30.0], dtype=object)
In [460]: timeit np.frompyfunc(sum,1,1)(list_obj_array)
2.22 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As I've seen elsewhere frompyfunc is competitive with the list comprehension.
Interestingly, using np.sum instead of sum slows it down. I think that's because np.sum applied to lists has the overhead of converting the lists to arrays. sum applied to lists of numbers is pretty good, using python's own compiled code.
In [461]: timeit np.frompyfunc(np.sum,1,1)(list_obj_array)
30.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So let's try sum in your vectorize:
In [462]: v_func = np.vectorize(sum, otypes=[int])
In [463]: timeit v_func(list_obj_array)
8.7 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much better.
I have to do a lot of dot products in my data processing pipeline. So, I was experimenting with the following two pieces of code where one is 3 times efficient (in terms of runtime) when compared to its slowest counterpart.
slowest method (with arrays created on the fly)
In [33]: %timeit np.dot(np.arange(200000), np.arange(200000, 400000))
352 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fastest method (with static arrays)
In [34]: vec1_arr = np.arange(200000)
In [35]: vec2_arr = np.arange(200000, 400000)
In [36]: %timeit np.dot(vec1_arr, vec2_arr)
121 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
why is the first method of dynamically generating arrays 3x slower when compared to second method? Is it because in the first method much of these extra time is spent in allocating memory for the elements? Or some other factors contributing to this degradation?
To gain little more understanding, I also replicated the setup in pure Python. And surprisingly there is no performance difference between doing it one way or the other, although it is slower than the numpy implementation, which is obvious and expected.
In [42]: %timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
12.5 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: vec1 = range(200000)
In [39]: vec2 = range(200000, 400000)
In [40]: %timeit sum(map(operator.mul, vec1, vec2))
12.5 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The behaviour in the case of pure Python is clear because range function doesn't actually create all those elements. It does lazy evaluation (i.e. it is generated on the fly).
Note: The pure Python impl. is just to make myself convinced that the array allocation might be the factor that is causing the drag. It's not meant to compare it with NumPy implementation.
The pure Python test is not fair. Because np.arange(200000) really returns an array while range(200000) only returns a generator. So these two methods both create arrays on the fly.
import operator
%timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
# 15.1 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = range(200000)
vec2 = range(200000, 400000)
%timeit sum(map(operator.mul, vec1, vec2))
# 15.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = list(range(200000))
vec2 = list(range(200000, 400000))
%timeit sum(map(operator.mul, vec1, vec2))
# 12.4 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And we can see the time cost on allocation:
import numpy as np
%timeit np.arange(200000), np.arange(200000, 400000)
# 632 µs ± 9.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 703 µs ± 5.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 77.7 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make sense.
The difference in speed is due to allocating the arrays in the slower case. I pasted the output of %timeit that takes into account the allocation of arrays in the two cases. The OP's timeit commands only took into account allocation in the slower case but not in the faster case.
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 524 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000); np.dot(vec1_arr, vec2_arr)
# 523 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The allocation of both arrays takes about 360 microseconds on my machine, and the np.dot operation takes 169 microseconds. The sum of those two durations is 529 microseconds, which is equivalent to the output of %timeit output above.
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000)
# 360 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 169 µs ± 5.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I was comparing the performance of counting how many letters 'C' are in a very long string, using a numpy array of characters and the string method count.
genome is a very long string.
g1 = genome
g2 = np.array([i for i in genome])
%timeit np.sum(g2=='C')
4.43 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g1.count('C')
955 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).
I expected that a numpy array would compute it faster but I am wrong.
Can someone explain me how the count method works and what is it faster than using a numpy array?
Thank you!
Let's explore some variations on the problem. I won't try to make as large a string as yours.
In [393]: astr = 'ABCDEF'*10000
First the string count:
In [394]: astr.count('C')
Out[394]: 10000
In [395]: timeit astr.count('C')
70.2 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now try a 1 element array with that string:
In [396]: arr = np.array(astr)
In [397]: arr.shape
Out[397]: ()
In [398]: np.char.count(arr, 'C')
Out[398]: array(10000)
In [399]: timeit np.char.count(arr, 'C')
200 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [400]: arr.dtype
Out[400]: dtype('<U60000')
My experience with other uses of char is that it iterates on the array elements and applies the string method. So it can't be faster than applying the string method directly. I suppose the rest of the time is some sort of numpy overhead.
Make a list from the string - one character per list element:
In [402]: alist = list(astr)
In [403]: alist.count('C')
Out[403]: 10000
In [404]: timeit alist.count('C')
955 µs ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The list count has to loop through the elements, and do the test against C each time. Still it is faster than sum(i=='C' for i in alist) (and variants).
Now make an array from that list - single character elements:
In [405]: arr1 = np.array(alist)
In [406]: arr1.shape
Out[406]: (60000,)
In [407]: timeit arr1=='C'
634 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [408]: timeit np.sum(arr1=='C')
740 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The np.sum is relatively fast. It's the check against 'C' that takes the most time.
If I construct a numeric array of the same size, the count time is quite a bit faster. The equality test against a number is faster than the equivalent string test.
In [431]: arr2 = np.resize(np.array([1,2,3,4,5,6]),arr1.shape[0])
In [432]: np.sum(arr2==3)
Out[432]: 10000
In [433]: timeit np.sum(arr2==3)
155 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy does not promise to be faster for all Python operations. For the most part when working string elements, it is heavily dependent on Python's own string code.
I have a numpy array of shape(64,64,3). How can we get three arrays of dimension (64,64) each.
You can use moveaxis to move the to split axis all the way to the left and then use sequence unpacking:
x,y,z = np.moveaxis(arr, -1, 0)
This is another case where iteration, for a few steps, is not bad.
In [145]: arr = np.ones((64,64,3))
Unpacking a list comprehension:
In [146]: a,b,c = [arr[:,:,i] for i in range(3)]
In [147]: a.shape
Out[147]: (64, 64)
Unpacking a transposed array:
In [148]: a,b,c = np.moveaxis(arr,-1,0)
In [149]: a.shape
Out[149]: (64, 64)
Timings for such a small example aren't ground breaking, but they hint at the relative advantages of the two approaches:
In [150]: timeit a,b,c = [arr[:,:,i] for i in range(3)]
3.02 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [151]: timeit a,b,c = np.moveaxis(arr,-1,0)
18.4 µs ± 946 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The array unpacking (on the 1st dimension) requires converting the array into a list of 3 subarrays, as can be seen by these similar timings:
In [154]: timeit a,b,c = list(np.moveaxis(arr,-1,0))
17.8 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [155]: timeit a,b,c = [i for i in np.moveaxis(arr,-1,0)]
17.9 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It's not the unpacking or iteration that taking time. It's the moveaxis:
In [156]: timeit np.moveaxis(arr,-1,0)
14 µs ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Looking at its code, we see that it uses transpose, after first constructing an order from the parameters:
Calling transpose directly is fast (since it just involves changing shape and strides):
In [157]: timeit arr.transpose(2,0,1)
688 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
And unpacking that transpose is a bit faster than my original list comprehension.
In [158]: timeit a,b,c = arr.transpose(2,0,1)
2.78 µs ± 9.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So moveaxis has a significant overhead relative to the rest of the task. That said, it probably does not increase with the side of the array. It's a fixed overhead.