what is the quickest way to iterate through a numpy array

what is the quickest way to iterate through a numpy array - python

I noticed a meaningful difference between iterating through a numpy array "directly" versus iterating through via the tolist method. See timing below:
directly
[i for i in np.arange(10000000)]
via tolist
[i for i in np.arange(10000000).tolist()]
considering I've discovered one way to go faster. I wanted to ask what else might make it go faster?
what is fastest way to iterate through a numpy array?

This is actually not surprising. Let's examine the methods one a time starting with the slowest.
[i for i in np.arange(10000000)]
This method asks python to reach into the numpy array (stored in the C memory scope), one element at a time, allocate a Python object in memory, and create a pointer to that object in the list. Each time you pipe between the numpy array stored in the C backend and pull it into pure python, there is an overhead cost. This method adds in that cost 10,000,000 times.
Next:
[i for i in np.arange(10000000).tolist()]
In this case, using .tolist() makes a single call to the numpy C backend and allocates all of the elements in one shot to a list. You then are using python to iterate over that list.
Finally:
list(np.arange(10000000))
This basically does the same thing as above, but it creates a list of numpy's native type objects (e.g. np.int64). Using list(np.arange(10000000)) and np.arange(10000000).tolist() should be about the same time.
So, in terms of iteration, the primary advantage of using numpy is that you don't need to iterate. Operation are applied in an vectorized fashion over the array. Iteration just slows it down. If you find yourself iterating over array elements, you should look into finding a way to restructure the algorithm you are attempting, in such a way that is uses only numpy operations (it has soooo many built-in!) or if really necessary you can use np.apply_along_axis, np.apply_over_axis, or np.vectorize.

These are my timings on a slower machine
In [1034]: timeit [i for i in np.arange(10000000)]
1 loop, best of 3: 2.16 s per loop
If I generate the range directly (Py3 so this is a genertor) times are much better. Take this a baseline for a list comprehension of this size.
In [1035]: timeit [i for i in range(10000000)]
1 loop, best of 3: 1.26 s per loop
tolist converts the arange to a list first; takes a bit longer, but the iteration is still on a list
In [1036]: timeit [i for i in np.arange(10000000).tolist()]
1 loop, best of 3: 1.6 s per loop
Using list() - same time as direct iteration on the array; that suggests that the direct iteration first does this.
In [1037]: timeit [i for i in list(np.arange(10000000))]
1 loop, best of 3: 2.18 s per loop
In [1038]: timeit np.arange(10000000).tolist()
1 loop, best of 3: 927 ms per loop
same times a iterating on the .tolist
In [1039]: timeit list(np.arange(10000000))
1 loop, best of 3: 1.55 s per loop
In general if you must loop, working on a list is faster. Access to elements of a list is simpler.
Look at the elements returned by indexing.
a[0] is another numpy object; it is constructed from the values in a, but not simply a fetched value
list(a)[0] is the same type; the list is just [a[0], a[1], a[2]]]
In [1043]: a = np.arange(3)
In [1044]: type(a[0])
Out[1044]: numpy.int32
In [1045]: ll=list(a)
In [1046]: type(ll[0])
Out[1046]: numpy.int32
but tolist converts the array into a pure list, in this case, as list of ints. It does more work than list(), but does it in compiled code.
In [1047]: ll=a.tolist()
In [1048]: type(ll[0])
Out[1048]: int
In general don't use list(anarray). It rarely does anything useful, and is not as powerful as tolist().
What's the fastest way to iterate through array - None. At least not in Python; in c code there are fast ways.
a.tolist() is the fastest, vectorized way of creating a list integers from an array. It iterates, but does so in compiled code.
But what is your real goal?

The speedup via tolist only holds for 1D arrays. Once you add a second axis, the performance gain disappears:
1D
import numpy as np
import timeit
num_repeats = 10
x = np.arange(10000000)
via_tolist = timeit.timeit("[i for i in x.tolist()]", number=num_repeats, globals={"x": x})
direct = timeit.timeit("[i for i in x]",number=num_repeats, globals={"x": x})
print(f"tolist: {via_tolist / num_repeats}")
print(f"direct: {direct / num_repeats}")
tolist: 0.430838281600154
direct: 0.49088368080047073
2D
import numpy as np
import timeit
num_repeats = 10
x = np.arange(10000000*10).reshape(-1, 10)
via_tolist = timeit.timeit("[i for i in x.tolist()]", number=num_repeats, globals={"x": x})
direct = timeit.timeit("[i for i in x]", number=num_repeats, globals={"x": x})
print(f"tolist: {via_tolist / num_repeats}")
print(f"direct: {direct / num_repeats}")
tolist: 2.5606724178003786
direct: 1.2158976945000177

My test case has an numpy array
[[ 34 107]
[ 963 144]
[ 921 1187]
[ 0 1149]]
I'm going through this only once using range and enumerate
USING range
loopTimer1 = default_timer()
for l1 in range(0,4):
print(box[l1])
print("Time taken by range: ",default_timer()-loopTimer1)
Result
[ 34 107]
[963 144]
[ 921 1187]
[ 0 1149]
Time taken by range: 0.0005405639985838206
USING enumerate
loopTimer2 = default_timer()
for l2,v2 in enumerate(box):
print(box[l2])
print("Time taken by enumerate: ", default_timer() - loopTimer2)
Result
[ 34 107]
[963 144]
[ 921 1187]
[ 0 1149]
Time taken by enumerate: 0.00025605700102460105
This test case I picked enumerate will works faster

Related

Indexing numpy array as slow as python append

I have always been told that python's native append is a slow function and should be avoided in for loops. However after a couple of small tests I have found it performs more poorly than a numpy array when iterating over it with a for loop:
First Test -array/list construction
python native list append
def pythonAppend(n):
x = []
for i in range(n):
x.append(i)
return x
%timeit pythonAppend(1000000)
Numpy allocate array then access
def numpyConstruct(n):
x = np.zeros(n)
for i in range(n):
x[i] = i
return x
%timeit numpyConstruct(1000000)
Results:
Python Time: 179 ms
Numpy Time: 189 ms
Second Test -Accesing Elements
n = 1000000
x = pythonAppend(n)
arr = numpyConstruct(n)
order = arr[:]; np.random.shuffle(order); order = list(order.astype(int))
def listAccess(x):
for i in range(len(x)):
x[i] = i/3.
return x
def listAccessOrder(x, order):
for i in order:
x[i] = i/3.
return x
%timeit listAccess(x)
%timeit listAccess(arr)
%timeit listAccessOrder(x, order)
%timeit listAccessOrder(arr, order)
%timeit arr / 3.
Results
python -sequential: 175 ms
numpy -sequential: 198 ms
python -shuffled access: 2.08 s
numpy -shuffled access: 2.15 s
numpy -vectorised access: 1.92ms
These results to me are quite surprising as I would have thought that at least due to python being linked lists accessing elements would be slower than numpy due to having to follow a chain of pointers. Or am I misunderstanding the python list implementation? Also why do the python lists perform slightly better than the numpy equivalents -I guess most of the inefficiency comes from using a python for loop but python's append still out-competes numpy accessing and allocating.

Python lists are not linked lists. The data buffer of the list object contains links/pointers to objects (else where in memory). So fetching the ith element is easy. The buffer also has growth headroom, so append is a simple matter of inserting the new element's link. The buffer has to be reallocated periodically, but Python manages that seamlessly.
A numpy array also has a 1d data buffer, but it contains numeric values (more generally the bytes as required by the dtype). Fetching is easy, but it has to create a new python object to 'contain' the value ('unboxing'). Assignment also requires conversion from the python object to bytes that will be stored.
Generally we've found that making a new array by appending to a list (with one np.array call at the end) is competitive with assignment to a preallocated array.
Iteration through a numpy array is normally slower than iterating through a list.
What we strongly discourage is using np.append (or some variant) to grow an array iteratively. That makes a new array each time, with full copy.
numpy arrays are fast when the iteration is done in the compiled code. Normally that's with the whole-array methods. The c code can iterate over the data buffer, operating on the values directly, rather than invoking python object methods at each step.

Fast way to select from numpy array without intermediate index array

Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.
import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
[3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
[8,6],[8,8]])
i.e. I want to find the result,
desired = np.array([4,6,6,4,2])
which is entries in a[:,1] corresponding to where a[:,0] changes.
One solution is,
b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]
which gives np.array([6,6,4,2]), I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:
c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]
This also gives [6,6,4,2]. Assuming a generator-based zip (true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.
Can you come up with a numpy-only version with the memory efficiency of c but the speed efficiency of b? Ideally only one pass over a is needed.
(Note that measuring the speed won't help much here, unless a is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a are streamed from a file and are slow to access -- another reason to avoid the b solution, as it requires a second random-access pass over a.)
Edit: a way to generate a large a matrix for testing:
from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))

I am afraid if you are looking to do this in a vectorized way, you can't avoid an intermediate array, as there's no built-in for it.
Now, let's look for vectorized approaches other than nonzero(), which might be more performant. Going by the same idea of performing differentiation as with the original code of (a[1:,0]-a[:-1,0]), we can use boolean indexing after looking for non-zero differentiations that correspond to "edges" or shifts.
Thus, we would have a vectorized approach like so -
a[np.append(True,np.diff(a[:,0])!=0),1]
Runtime test
The original solution a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1] would skip the first row. But, let's just say for the sake of timing purposes, it's a valid result. Here's the runtimes with it against the proposed solution in this post -
In [118]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [119]: %timeit a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1]
100 loops, best of 3: 6.31 ms per loop
In [120]: %timeit a[1:][np.diff(a[:,0])!=0,1]
100 loops, best of 3: 4.51 ms per loop
Now, let's say you want to include the first row too. The updated runtimes would look something like this -
In [123]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [124]: %timeit a[np.append(0,(a[1:,0]-a[:-1,0]).nonzero()[0]+1),1]
100 loops, best of 3: 6.8 ms per loop
In [125]: %timeit a[np.append(True,np.diff(a[:,0])!=0),1]
100 loops, best of 3: 5 ms per loop

Ok actually I found a solution, just learned about np.fromiter, which can build a numpy array based on a generator:
d = np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)
I think this does it, generates a numpy array without any intermediate arrays. However, the caveat is that it does not seem to be all that efficient! Forgetting what I said in the question about testing:
t = [lambda a: a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1],
lambda a: np.array([a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]),
lambda a: np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)]
from timeit import Timer
[Timer(x(a)).timeit(number=10) for x in t]
[0.16596235800034265, 1.811289312000099, 2.1662971739997374]
It seems the first solution is drastically faster! I assume this is because even though it generates intermediate data, it is able to perform the inner loop completely in numpy, while in the other it runs Python code for each item in the array.
Like I said, this is why I'm not sure this kind of benchmarking makes sense here -- if accesses to a were much slower, the benchmark wouldn't be CPU-loaded. Thoughts?
Not "accepting" this answer since I am hoping someone can come up with something faster.

If memory efficiency is your concern, that can be solved as such: The only intermediate of the same size-order as the input data can be made of type bool (a[1:,0] != a[:-1, 0]); and if your input data is int32, that is 8 times smaller than 'a' itself. You can count the nonzeros of that binary array to preallocate the output array as well; though that should not be very either significant if the output of the != is as sparse as your example suggests.

Python Typed Array of a Certain Size

This will create an empty array of type signed int:
import array
a = array.array('i')
What is an efficient (performance-wise) way to specify the array lengh (as well as the array's rank - number of dimensions)?
I understand that NumPy allows to specify array size at creation, but can it be done in standard Python?
Initialising an array of fixed size in python
This deals mostly with lists, as well as no consideration is given to performance. The main reason to use an array instead of a list is performance.

The array constructor accepts as a 2nd argument an iterable. So, the following works to efficiently create and initialize the array to 0..N-1:
x = array.array('i', range(N))
This does not create a separate N element vector or list.
(If using python 2, use xrange instead). Of course, if you need different initialization you may use generator object instead of range. For example, you can use generator expressions to fill the array with zeros:
a=array.array('i',(0 for i in range(N)))
Python has no 2D (or higher) array. You have to construct one from a list of 1D arrays.
The truth is, if you are looking for a high performance implementation, you should probably use Numpy.

It's simple and fast to just use:
array.array('i', [0]) * n
Timing of different ways to initialize an array in my machine:
n = 10 ** 7
array('i', [0]) * n # 21.9 ms
array('i', [0]*n) # 395.2 ms
array('i', range(n)) # 810.6 ms
array('i', (0 for _ in range(n))) # 1238.6 ms
You said
The main reason to use an array instead of a list is performance.
Surely arrays use less memory than lists.
But by my experiment, I found no evidence that an array is always faster than a normal list.

Scipy.sparse.csr_matrix: How to get top ten values and indices?

I have a large csr_matrix and I am interested in the top ten values and their indices each row. But I did not find a decent way to manipulate the matrix.
Here is my current solution and the main idea is to process them row by row:
row = csr_matrix.getrow(row_number).toarray()[0].ravel()
top_ten_indicies = row.argsort()[-10:]
top_ten_values = row[row.argsort()[-10:]]
By doing this, the advantages of csr_matrix is not fully used. It's more like a brute force solution.

I don't see what the advantages of csr format are in this case. Sure, all the nonzero values are collected in one .data array, with the corresponding column indexes in .indices. But they are in blocks of varying length. And that means they can't be processed in parallel or with numpy array strides.
One solution is the pad those blocks into common length blocks. That's what .toarray() does. Then you can find the maximum values with argsort(axis=1) or withargpartition`.
Another is to break them into row sized blocks, and process each of those. That's what you are doing with the .getrow. Another way of breaking them up is convert to lil format, and process the sublists of the .data and .rows arrays.
A possible third option is to use the ufunc reduceat method. This lets you apply ufunc reduction methods to sequential blocks of an array. There are established ufunc like np.add that take advantage of this. argsort is not such a function. But there is a way of constructing a ufunc from a Python function, and gain some modest speed over regular Python iteration. [I need to look up a recent SO question that illustrates this.]
I'll illustrate some of this with a simpler function, sum over rows.
If A2 is a csr matrix.
A2.sum(axis=1) # the fastest compile csr method
A2.A.sum(axis=1) # same, but with a dense intermediary
[np.sum(l.data) for l in A2] # iterate over the rows of A2
[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])] # iterate with index
[np.sum(l) for l in A2.tolil().data] # sum the sublists of lil format
np.add.reduceat(A2.data, A2.indptr[:-1]) # with reduceat
A2.sum(axis=1) is implemented as a matrix multiplication. That's not relevant to the sort problem, but still an interesting way of looking at the summation problem. Remember csr format was developed for efficient multiplication.
For a my current sample matrix (created for another SO sparse question)
<8x47752 sparse matrix of type '<class 'numpy.float32'>'
with 32 stored elements in Compressed Sparse Row format>
some comparative times are
In [694]: timeit np.add.reduceat(A2.data, A2.indptr[:-1])
100000 loops, best of 3: 7.41 µs per loop
In [695]: timeit A2.sum(axis=1)
10000 loops, best of 3: 71.6 µs per loop
In [696]: timeit [np.sum(l) for l in A2.tolil().data]
1000 loops, best of 3: 280 µs per loop
Everything else is 1ms or more.
I suggest focusing on developing your one-row function, something like:
def max_n(row_data, row_indices, n):
i = row_data.argsort()[-n:]
# i = row_data.argpartition(-n)[-n:]
top_values = row_data[i]
top_indices = row_indices[i] # do the sparse indices matter?
return top_values, top_indices, i
Then see how if fits in one of these iteration methods. tolil() looks most promising.
I haven't addressed the question of how to collect these results. Should they be lists of lists, array with 10 columns, another sparse matrix with 10 values per row, etc.?
sorting each row of a large sparse & saving top K values & column index - Similar question from several years back, but unanswered.
Argmax of each row or column in scipy sparse matrix - Recent question seeking argmax for rows of csr. I discuss some of the same issues.
how to speed up loop in numpy? - example of how to use np.frompyfunc to create a ufunc. I don't know if the resulting function has the .reduceat method.
Increasing value of top k elements in sparse matrix - get the top k elements of csr (not by row). Case for argpartition.
The row summation implemented with np.frompyfunc:
In [741]: def foo(a,b):
return a+b
In [742]: vfoo=np.frompyfunc(foo,2,1)
In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)
10000 loops, best of 3: 26.2 µs per loop
That's respectable speed. But I can't think of a way of writing a binary function (takes to 2 arguments) that would implement argsort via reduction. So this is probably a deadend for this problem.

Just to answer the original question (for people like me who found this question looking for copy-pasta), here's a solution using multiprocessing based on #hpaulj's suggestion of converting to lil_matrix, and iterating over rows
from multiprocessing import Pool
def _top_k(args):
"""
Helper function to process a single row of top_k
"""
data, row = args
data, row = zip(*sorted(zip(data, row), reverse=True)[:k])
return data, row
def top_k(m, k):
"""
Keep only the top k elements of each row in a csr_matrix
"""
ml = m.tolil()
with Pool() as p:
ms = p.map(_top_k, zip(ml.data, ml.rows))
ml.data, ml.rows = zip(*ms)
return ml.tocsr()

One would require to iterate over the rows and get the top indices for each row separately. But this loop can be jited(and parallelized) to get extremely fast function.
#nb.njit(cache=True)
def row_topk_csr(data, indices, indptr, K):
m = indptr.shape[0] - 1
max_indices = np.zeros((m, K), dtype=indices.dtype)
max_values = np.zeros((m, K), dtype=data.dtype)
for i in nb.prange(m):
top_inds = np.argsort(data[indptr[i] : indptr[i + 1]])[::-1][:K]
max_indices[i] = indices[indptr[i] : indptr[i + 1]][top_inds]
max_values[i] = data[indptr[i] : indptr[i + 1]][top_inds]
return max_indices, max_values
Call it like this:
top_pred_indices, _ = row_topk_csr(csr_mat.data, csr_mat.indices, csr_mat.indptr, K)
I need to frequently perform this operation, and this function is fast enough for me, executes in <1s on 1mil x 400k sparse matrix.
HTH.

Fastest way of finding the index of the closest element in a non-sorted Python list of floats

Given as input a list of floats that is not sorted, what would be the most efficient way of finding the index of the closest element to a certain value? Some potential solutions come to mind:
For:
x = random.sample([float(i) for i in range(1000000)], 1000000)
1) Own function:
def min_val(lst, val):
min_i = None
min_dist = 1000000.0
for i, v in enumerate(lst):
d = abs(v - val)
if d < min_dist:
min_dist = d
min_i = i
return min_i
Result:
%timeit min_val(x, 5000.56)
100 loops, best of 3: 11.5 ms per loop
2) Min
%timeit min(range(len(x)), key=lambda i: abs(x[i]-5000.56))
100 loops, best of 3: 16.8 ms per loop
3) Numpy (including conversion)
%timeit np.abs(np.array(x)-5000.56).argmin()
100 loops, best of 3: 3.88 ms per loop
From that test, it seems that converting the list to numpy array is the best solution. However two questions come to mind:
Was that indeed a realistic comparison?
Is the numpy solution the fastest way to achieve this in Python?

Consider the partition algorithm from QuickSort. The partition algorithm rearranges a list such that the pivot element is in its final location after invocation. Based on the value of the pivot, you could then partition the portion of the array that is likely to contain the element closest to your target. Once you've either found the element you're after or a have a partition of length 1 that contains your element you're done.
The general problem you're addressing is a selection problem.
In your question you were wondering about what sort of array/list implementation to use, and that will have an impact on performance. A bigger factor will be the search algorithm as opposed to the list/array representation.
Edit in light of comment from #Andrzej
Ah, then I misunderstood your question. Strictly speaking, linear search is always O(n), so efficiency within the bounds of Big-Oh analysis is the same regardless of underlying data structure. The gotcha here is that for linear search you want a nice simple data structure to make the run-time performance as good as possible.
A Python list is an array of references to objects while (to my understanding) a Numpy array is a contiguous array of objects. The Numpy array will perform better since it doesn't have to dereference the objects to get to the values.
Your comparison technique seems reasonable for Python list vs. Numpy array. I'd be reluctant to say that a Numpy array is the fastest way to solve the problem, but it should perform better than a Python list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.