Slice a 3D array in to multiple 2D arrays - python

I have a numpy array of shape(64,64,3). How can we get three arrays of dimension (64,64) each.

You can use moveaxis to move the to split axis all the way to the left and then use sequence unpacking:
x,y,z = np.moveaxis(arr, -1, 0)

This is another case where iteration, for a few steps, is not bad.
In [145]: arr = np.ones((64,64,3))
Unpacking a list comprehension:
In [146]: a,b,c = [arr[:,:,i] for i in range(3)]
In [147]: a.shape
Out[147]: (64, 64)
Unpacking a transposed array:
In [148]: a,b,c = np.moveaxis(arr,-1,0)
In [149]: a.shape
Out[149]: (64, 64)
Timings for such a small example aren't ground breaking, but they hint at the relative advantages of the two approaches:
In [150]: timeit a,b,c = [arr[:,:,i] for i in range(3)]
3.02 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [151]: timeit a,b,c = np.moveaxis(arr,-1,0)
18.4 µs ± 946 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The array unpacking (on the 1st dimension) requires converting the array into a list of 3 subarrays, as can be seen by these similar timings:
In [154]: timeit a,b,c = list(np.moveaxis(arr,-1,0))
17.8 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [155]: timeit a,b,c = [i for i in np.moveaxis(arr,-1,0)]
17.9 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It's not the unpacking or iteration that taking time. It's the moveaxis:
In [156]: timeit np.moveaxis(arr,-1,0)
14 µs ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Looking at its code, we see that it uses transpose, after first constructing an order from the parameters:
Calling transpose directly is fast (since it just involves changing shape and strides):
In [157]: timeit arr.transpose(2,0,1)
688 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
And unpacking that transpose is a bit faster than my original list comprehension.
In [158]: timeit a,b,c = arr.transpose(2,0,1)
2.78 µs ± 9.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So moveaxis has a significant overhead relative to the rest of the task. That said, it probably does not increase with the side of the array. It's a fixed overhead.

Related

Why is performance of in-place modification to a numpy array related to the order of dimension being modified?

import numpy as np
a = np.random.random((500, 500, 500))
b = np.random.random((500, 500))
%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b
107 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.59 ms ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Observations:
Performance across the three runs above is different: modifying the 1st and 3rd dimension (slicing the 2nd) is the fastest among the three, while modifying the 1st and 2nd dimension (slicing the 3rd) is the slowest.
There seems no monotonicity of speed w.r.t. the dimension being sliced.
Questions are:
What mechanism behind numpy makes my observations?
With answer to 1st question in mind, how to speed up my code by arranging dimensions properly, as some dimensions are modified in bulk and the rests are just being sliced?
As several comments have indicated, it's all about locality of reference. Think about what numpy has to do at the low-level, and how far away from each other in memory the consecutive lvalues are in the 3rd case.
Note also how the results of the timings change when the array are not C-contiguous, but F-contiguous instead:
a = np.asfortranarray(a)
b = np.asfortranarray(b)
%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b
892 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
169 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(very small side-note: for the same reason, it is sometimes advantageous to sort a DataFrame before doing a groupby and a bunch of repetitive operations on the groups, somewhat counter-intuitively since the sort itself takes O(nlogn)).

numpy sum slower than string count

I was comparing the performance of counting how many letters 'C' are in a very long string, using a numpy array of characters and the string method count.
genome is a very long string.
g1 = genome
g2 = np.array([i for i in genome])
%timeit np.sum(g2=='C')
4.43 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g1.count('C')
955 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).
I expected that a numpy array would compute it faster but I am wrong.
Can someone explain me how the count method works and what is it faster than using a numpy array?
Thank you!
Let's explore some variations on the problem. I won't try to make as large a string as yours.
In [393]: astr = 'ABCDEF'*10000
First the string count:
In [394]: astr.count('C')
Out[394]: 10000
In [395]: timeit astr.count('C')
70.2 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now try a 1 element array with that string:
In [396]: arr = np.array(astr)
In [397]: arr.shape
Out[397]: ()
In [398]: np.char.count(arr, 'C')
Out[398]: array(10000)
In [399]: timeit np.char.count(arr, 'C')
200 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [400]: arr.dtype
Out[400]: dtype('<U60000')
My experience with other uses of char is that it iterates on the array elements and applies the string method. So it can't be faster than applying the string method directly. I suppose the rest of the time is some sort of numpy overhead.
Make a list from the string - one character per list element:
In [402]: alist = list(astr)
In [403]: alist.count('C')
Out[403]: 10000
In [404]: timeit alist.count('C')
955 µs ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The list count has to loop through the elements, and do the test against C each time. Still it is faster than sum(i=='C' for i in alist) (and variants).
Now make an array from that list - single character elements:
In [405]: arr1 = np.array(alist)
In [406]: arr1.shape
Out[406]: (60000,)
In [407]: timeit arr1=='C'
634 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [408]: timeit np.sum(arr1=='C')
740 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The np.sum is relatively fast. It's the check against 'C' that takes the most time.
If I construct a numeric array of the same size, the count time is quite a bit faster. The equality test against a number is faster than the equivalent string test.
In [431]: arr2 = np.resize(np.array([1,2,3,4,5,6]),arr1.shape[0])
In [432]: np.sum(arr2==3)
Out[432]: 10000
In [433]: timeit np.sum(arr2==3)
155 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy does not promise to be faster for all Python operations. For the most part when working string elements, it is heavily dependent on Python's own string code.

Numpy to list over 2nd axis

I would like to split a n-d numpy array based on a internal axis.
I have a array of shape (6,150,29,29,29,1)
I would like a list of arrays as - [150 arrays of shape (6,29,29,29,1)]
I have used the list(a), but this has given me a list over axis 0.
arr.transpose(1,0,2,3,4,5) or np.swapaxes(arr,0,1) put the 150 dimension first. Then you can use list.
Or you could use a list comprehension
[a[:,i] for i in range(150)]
The transpose is somewhat better
In [28]: timeit list(arr.transpose(1,0,2,3,4,5))
47.7 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [29]: timeit [arr[:,i] for i in range(150)]
88.7 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [32]: timeit list(np.swapaxes(arr,0,1))
49.2 µs ± 51.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Performance bottleneck in Tensordot

While I was trying to understand numpy.tensordot(), I tried out the examples from the documentation and was convinced that we can get exactly same tensordoted result by different permutation of axes argument. For example, the below two permutations of the axes are equivalent (i.e. they both yield same result):
In [28]: a = np.arange(60.).reshape(3,4,5)
In [29]: b = np.arange(24.).reshape(4,3,2)
In [30]: perm1 = np.tensordot(a, b, axes=[(1, 0), (0, 1)])
In [31]: perm2 = np.tensordot(a, b, axes=[(0, 1), (1, 0)])
In [32]: np.all(perm1 == perm2)
Out[32]: True
However, while measuring the performance, I found that one permutation is little over 2x faster than the other and that puzzles me..
# setting up input arrays
In [19]: a = np.arange(30*40*50).reshape(30,40,50)
In [20]: b = np.arange(40*30*20).reshape(40,30,20)
# contracting the first two axes from the input tensors
In [21]: %timeit np.tensordot(a, b, axes=[(0, 1), (1, 0)])
3.23 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# an equivalent way of contraction of the first two
# axes from the input tensors as in the above case
In [22]: %timeit np.tensordot(a, b, axes=[(1, 0), (0, 1)])
1.62 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, what is the reason for the 2x speedup in the latter case? Does it have to do with how NumPy ndarrays are structured internally in memory? Or something else? Thanks in advance for your insights!
Without going into the details, these two calculations recreate the actions taken by tensordot, and produce the same perm values.
They show the same sort of 2x speed difference:
In [24]: timeit np.dot(a.transpose(2,0,1).reshape(50,-1), b.transpose(1,0,2).reshape(-1,20))
4.39 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [25]: timeit np.dot(a.transpose(2,1,0).reshape(50,-1), b.reshape(-1,20))
2.99 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My guess is that the 2nd is faster because the b.reshape(-1,20) does not require a copy, whereas the transpose followed by reshape in the 1st does.
And timing the different reshapes:
In [28]: timeit a.transpose(2,1,0).reshape(50,-1)
128 µs ± 978 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [29]: timeit a.transpose(2,0,1).reshape(50,-1)
1.04 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [30]: timeit b.reshape(-1,20)
501 ns ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [31]: timeit b.transpose(1,0,2).reshape(-1,20)
27.5 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There are significant differences in speed. [30] is just a view, so that explains why it is so fast. I'm guessing [28] is so much slower because it involves a full reversal of elements, where as [29] copy (40,50) blocks.

Performance of various numpy fancy indexing methods, also with numba

Since for my program fast indexing of Numpy arrays is quite necessary and fancy indexing doesn't have a good reputation considering performance, I decided to make a few tests. Especially since Numba is developing quite fast, I tried which methods work well with numba.
As inputs I've been using the following arrays for my small-arrays-test:
import numpy as np
import numba as nb
x = np.arange(0, 100, dtype=np.float64) # array to be indexed
idx = np.array((0, 4, 55, -1), dtype=np.int32) # fancy indexing array
bool_mask = np.zeros(x.shape, dtype=np.bool) # boolean indexing mask
bool_mask[idx] = True # set same elements as in idx True
y = np.zeros(idx.shape, dtype=np.float64) # output array
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64) #bool output array (only for convenience)
And the following arrays for my large-arrays-test (y_bool needed here to cope with dupe numbers from randint):
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/50))
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
y = np.zeros(idx.shape, dtype=np.float64)
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)
This yields the following timings without using numba:
%timeit x[idx]
#1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
#482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.take(x, idx)
#2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.take(x, idx, out=y)
#2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx)
#919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x.take(idx, out=y)
#1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.compress(bool_mask, x)
#1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.compress(bool_mask, x, out=y_bool)
#2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask)
#900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.compress(bool_mask, out=y_bool)
#1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.extract(bool_mask, x)
#5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And with numba, using jitting in nopython-mode, caching and nogil I decorated the ways of indexing, which are supported by numba:
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy(x, idx):
x[idx]
#nb.jit(nopython=True, cache=True, nogil=True)
def fancy_bool(x, bool_mask):
x[bool_mask]
#nb.jit(nopython=True, cache=True, nogil=True)
def taker(x, idx):
np.take(x, idx)
#nb.jit(nopython=True, cache=True, nogil=True)
def ndtaker(x, idx):
x.take(idx)
This yields the following results for small and large arrays:
%timeit fancy(x, idx)
#686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit fancy_bool(x, bool_mask)
#845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit taker(x, idx)
#814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ndtaker(x, idx)
#831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Summary
While for numpy without numba it is clear that small arrays are by far best indexed with boolean masks (about a factor 2 compared to ndarray.take(idx)), for larger arrays ndarray.take(idx) will perform best, in this case around 6 times faster than boolean indexing. The breakeven-point is at an array-size of around 1000 cells with and index-array-size of around 20 cells.
For arrays with 1e5 elements and 5e3 index array size, ndarray.take(idx) will be around 10 times faster than boolean mask indexing. So it seems that boolean indexing seems to slow down considerably with array size, but catches up a little after some array-size-threshold is reached.
For the numba jitted functions there is a small speedup for all indexing functions except for boolean mask indexing. Simple fancy indexing works best here, but is still slower than boolean masking without jitting.
For larger arrays boolean mask indexing is a lot slower than the other methods, and even slower than the non-jitted version. The three other methods all perform quite good and around 15% faster than the non-jitted version.
For my case with many arrays of different sizes, fancy indexing with numba is the best way to go. Perhaps some other people can also find some useful information in this quite lengthy post.
Edit:
I'm sorry that I forgot to ask my question, which I in fact have. I was just rapidly typing this at the end of my workday and completely forgot it...
Well, do you know any better and faster method than those that I tested? Using Cython my timings were between Numba and Python.
As the index array is predefined once and used without alteration in long iterations, any way of pre-defining the indexing process would be great. For this I thought about using strides. But I wasn't able to pre-define a custom set of strides. Is it possible to get a predefined view into the memory using strides?
Edit 2:
I guess I'll move my question about predefined constant index arrays which will be used on the same value array (where only the values change but not the shape) for a few million times in iterations to a new and more specific question. This question was too general and perhaps I also formulated the question a little bit misleading. I'll post the link here as soon as I opened the new question!
Here is the link to the followup question.
Your summary isn't completely correct, you already did tests with differently sized arrays but one thing that you didn't do was to change the number of elements indexed.
I restricted it to pure indexing and omitted take (which effectively is integer array indexing) and compress and extract (because these are effectively boolean array indexing). The only difference for these are the constant factors. The constant factor for the methods take and compress will be less than the overhead for the numpy functions np.take and np.compress but otherwise the effects will be negligible for reasonably sized arrays.
Just let me present it with different numbers:
# ~ every 500th element
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/500)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
# 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ~ every 50th element
idx = np.random.randint(0, 1000000, size=int(1000000/50)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x[bool_mask]
# 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ~ every 5th element
idx = np.random.randint(0, 1000000, size=int(1000000/5)) # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
%timeit x[idx]
# 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x[bool_mask]
# 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So what happened here? It's simple: Integer array indexing only needs to access as many elements as there are values in the index-array. That means if there are few matches it will be quite fast but slow if there are many indices. Boolean array indexing, however, always needs to walk through the whole boolean array and check for "true" values. That means it should be roughly "constant" for the array.
But, wait, it's not really constant for boolean arrays and why does integer array indexing take longer (last case) than boolean array indexing even if it has to process ~5 times less elements?
That's where it gets more complicated. In this case the boolean array had True at random places which means that it will be subject to branch prediction failures. These will be more likely if True and False will have equal occurrences but at random places. That's why the boolean array indexing got slower - because the ratio of True to False got more equal and thus more "random". Also the result array will be larger if there are more Trues which also consumes more time.
As an example for this branch prediction thing use this as example (could differ with different system/compilers):
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[:1000000//2] = True # first half True, second half False
%timeit x[bool_mask]
# 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True # True and False alternating
%timeit x[bool_mask]
# 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True
np.random.shuffle(bool_mask) # shuffled
%timeit x[bool_mask]
# 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the distribution of True and False will critically affect the runtime with boolean masks even if they contain the same amount of Trues! The same effect will be visible for the compress-functions.
For integer array indexing (and likewise np.take) another effect will be visible: cache locality. The indices in your case are randomly distributed, so your computer has to do a lot of "RAM" to "processor cache" loads because it's very unlikely two indices will be near to each other.
Compare this:
idx = np.random.randint(0, 1000000, size=int(1000000/5))
%timeit x[idx]
# 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
idx = np.random.randint(0, 1000000, size=int(1000000/5))
idx = np.sort(idx) # sort them
%timeit x[idx]
# 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
By sorting the indices the chances immensely increased that the next value will already be in the cache and this can lead to huge speedups. That's a very important factor if you know that the indices will be sorted (for example if they were created by np.where they are sorted, which makes the result of np.where especially efficient for indexing).
So, it's not like integer array indexing is slower for small arrays and faster for large arrays it depends on much more factors. Both do have their use-cases and depending on the circumstances one might be (considerably) faster than the other.
Let me also talk a bit about the numba functions. First some general statements:
cache won't make a difference, it just avoids recompiling the function. In interactive environments this is essentially useless. It's faster if you would package the functions in a module though.
nogil by itself won't provide any speed boost. It will be faster if it's called in different threads because each function execution can release the GIL and then multiple calls can run in parallel.
Otherwise I don't know how numba effectivly implements these functions, however when you use NumPy features in numba it could be slower or faster - but even if it's faster it won't be much faster (except maybe for small arrays). Because if it could be made faster the NumPy developers would also implement it. My rule of thumb is: If you can do it (vectorized) with NumPy don't bother with numba. Only if you can't do it with vectorized NumPy functions or NumPy would use too many temporary arrays then numba will shine!

Categories

Resources