Looping and Searching in Numpy Array - python

I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values.
In other words, I am looking for the index of the first occurence in npArray2
for every value of npArray1.
for id in np.nditer(npArray1):
newId=(np.where(npArray2==id))[0][0]
Is there anyway I can make the above faster using numpy? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2 are not necessarily the same size, but they are both 1d.
Thanks a lot for your help,

The function np.unique will do much of the work for you:
npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
newId = idxs[vals==val][0]
vals is an array containing the unique values in npArray2, while idxs gives the index of the first appearance of each value in npArray2. Searching in vals should be much faster than in npArray1 because it's smaller.
You can speed up the search further by taking advantage of the fact that vals is sorted:
import bisect #we can use binary search since vals is sorted
for val in npArray1:
newId = idxs[bisect.bisect_left(vals, val)]

Assuming the input arrays contain unique values, you can use np.searchsorted with its optional sorter option for a vectorized solution, like so -
arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]
Sample run to verify output -
In [154]: npArray1
Out[154]: array([77, 19, 0, 69])
In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20, 0])
In [156]: out = np.empty(npArray1.size,dtype=int)
...: for i,id in np.ndenumerate(npArray1):
...: out[i] = (np.where(npArray2==id))[0][0]
...:
In [157]: arr2_sortidx = npArray2.argsort()
...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
...: out1 = arr2_sortidx[idx]
...:
In [158]: out
Out[158]: array([4, 3, 9, 7])
In [159]: out1
Out[159]: array([4, 3, 9, 7])
Runtime test -
In [175]: def original_app(npArray1,npArray2):
...: out = np.empty(npArray1.size,dtype=int)
...: for i,id in np.ndenumerate(npArray1):
...: out[i] = (np.where(npArray2==id))[0][0]
...: return out
...:
...: def searchsorted_app(npArray1,npArray2):
...: arr2_sortidx = npArray2.argsort()
...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
...: return arr2_sortidx[idx]
...:
In [176]: # Setup inputs
...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
...: maxn = 200000
...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
...:
In [177]: out1 = original_app(npArray1,npArray2)
In [178]: out2 = searchsorted_app(npArray1,npArray2)
In [179]: np.allclose(out1,out2)
Out[179]: True
In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop
In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop

In the task you specified you have to iterate over the array one way or another. So you can just think of a considerable performance improvement without changing your algorithm too much. This is where numba might be of a great help:
import numpy as np
from numba import jit
#jit
def numba_iter(npa1, npa2):
for id in np.nditer(npa1):
newId=(np.where(npa2==id))[0][0]
This simple approach might make your program much faster. Look at some examples and benchmarks here.

Related

numpy: ndarray index with repeated values [duplicate]

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?
np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.
We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

Summing array from indexes and values from other array: better way to do it in numpy? [duplicate]

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?
np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.
We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

Cannot understand numpy argpartition output

I am trying to use arpgpartition from numpy, but it seems there is something going wrong and I cannot seem to figure it out. Here is what's happening:
These are first 5 elements of the sorted array norms
np.sort(norms)[:5]
array([ 53.64759445, 54.91434479, 60.11617279, 64.09630585, 64.75318909], dtype=float32)
But when I use indices_sorted = np.argpartition(norms, 5)[:5]
norms[indices_sorted]
array([ 60.11617279, 64.09630585, 53.64759445, 54.91434479, 64.75318909], dtype=float32)
When I think I should get the same result as the sorted array?
It works just fine when I use 3 as the parameter indices_sorted = np.argpartition(norms, 3)[:3]
norms[indices_sorted]
array([ 53.64759445, 54.91434479, 60.11617279], dtype=float32)
This isn't making much sense to me, hoping someone can offer some insight?
EDIT: Rephrasing this question as whether argpartition preserves order of the k partitioned elements makes more sense.
We need to use list of indices that are to be kept in sorted order instead of feeding the kth param as a scalar. Thus, to maintain the sorted nature across the first 5 elements, instead of np.argpartition(a,5)[:5], simply do -
np.argpartition(a,range(5))[:5]
Here's a sample run to make things clear -
In [84]: a = np.random.rand(10)
In [85]: a
Out[85]:
array([ 0.85017222, 0.19406266, 0.7879974 , 0.40444978, 0.46057793,
0.51428578, 0.03419694, 0.47708 , 0.73924536, 0.14437159])
In [86]: a[np.argpartition(a,5)[:5]]
Out[86]: array([ 0.19406266, 0.14437159, 0.03419694, 0.40444978, 0.46057793])
In [87]: a[np.argpartition(a,range(5))[:5]]
Out[87]: array([ 0.03419694, 0.14437159, 0.19406266, 0.40444978, 0.46057793])
Please note that argpartition makes sense on performance aspect, if we are looking to get sorted indices for a small subset of elements, let's say k number of elems which is a small fraction of the total number of elems.
Let's use a bigger dataset and try to get sorted indices for all elems to make the above mentioned point clear -
In [51]: a = np.random.rand(10000)*100
In [52]: %timeit np.argpartition(a,range(a.size-1))[:5]
10 loops, best of 3: 105 ms per loop
In [53]: %timeit a.argsort()
1000 loops, best of 3: 893 µs per loop
Thus, to sort all elems, np.argpartition isn't the way to go.
Now, let's say I want to get sorted indices for only the first 5 elems with that big dataset and also keep the order for those -
In [68]: a = np.random.rand(10000)*100
In [69]: np.argpartition(a,range(5))[:5]
Out[69]: array([1647, 942, 2167, 1371, 2571])
In [70]: a.argsort()[:5]
Out[70]: array([1647, 942, 2167, 1371, 2571])
In [71]: %timeit np.argpartition(a,range(5))[:5]
10000 loops, best of 3: 112 µs per loop
In [72]: %timeit a.argsort()[:5]
1000 loops, best of 3: 888 µs per loop
Very useful here!
Given the task of indirectly sorting a subset (the top k, top meaning first in sort order) there are two builtin solutions: argsort and argpartition cf. #Divakar's answer.
If, however, performance is a consideration then it may (depending on the sizes of the data and the subset of interest) be well worth resisting the "lure of the one-liner", investing one more line and applying argsort on the output of argpartition:
>>> def top_k_sort(a, k):
... return np.argsort(a)[:k]
...
>>> def top_k_argp(a, k):
... return np.argpartition(a, range(k))[:k]
...
>>> def top_k_hybrid(a, k):
... b = np.argpartition(a, k)[:k]
... return b[np.argsort(a[b])]
>>> k = 100
>>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k})
8.348663672804832
>>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k})
9.869098862167448
>>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k})
1.2305558240041137
argsort is O(n log n), argpartition with range argument appears to be O(nk) (?), and argpartition + argsort is O(n + k log k)
Therefore in an interesting regime n >> k >> 1 the hybrid method is expected to be fastest
UPDATE: ND version:
import numpy as np
from timeit import timeit
def top_k_sort(A,k,axis=-1):
return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
def top_k_partition(A,k,axis=-1):
return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
def top_k_hybrid(A,k,axis=-1):
B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis)
A = np.random.random((100,10000))
k = 100
from timeit import timeit
for f in globals().copy():
if f.startswith("top_"):
print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100)
Sample run:
top_k_sort 63.72379460372031
top_k_partition 99.30561298970133
top_k_hybrid 10.714635509066284
Let's describe the partition method in a simplified way which helps a lot understand argpartition
Following the example in the picture if we execute C=numpy.argpartition(A, 3) C will be the resulting array of getting the position of every element in B with respect to the A array. ie:
Idx(z) = index of element z in array A
then C would be
C = [ Idx(B[0]), Idx(B[1]), Idx(B[2]), Idx(X), Idx(B[4]), ..... Idx(B[N]) ]
As previously mentioned this method is very helpful and comes very handy when you have a huge array and you are only interested in a selected group of ordered elements, not the whole array.

Python numpy nonzero cumsum

I want to do nonzero cumsum with numpy array. Simply skip zeros in array and apply cumsum. Suppose I have a np. array
a = np.array([1,2,1,2,5,0,9,6,0,2,3,0])
my result should be
[1,3,4,6,11,0,20,26,0,28,31,0]
I have tried this
a = np.cumsum(a[a!=0])
but result is
[1,3,4,6,11,20,26,28,31]
Any ideas?
You need to mask the original array so only the non-zero elements are overwritten:
In [9]:
a = np.array([1,2,1,2,5,0,9,6,0,2,3,0])
a[a!=0] = np.cumsum(a[a!=0])
a
Out[9]:
array([ 1, 3, 4, 6, 11, 0, 20, 26, 0, 28, 31, 0])
Another method is to use np.where:
In [93]:
a = np.array([1,2,1,2,5,0,9,6,0,2,3,0])
a = np.where(a!=0,np.cumsum(a),a)
a
Out[93]:
array([ 1, 3, 4, 6, 11, 0, 20, 26, 0, 28, 31, 0])
timings
In [91]:
%%timeit
a = np.array([1,2,1,2,5,0,9,6,0,2,3,0])
a[a!=0] = np.cumsum(a[a!=0])
a
The slowest run took 4.93 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 12.6 µs per loop
In [94]:
%%timeit
a = np.array([1,2,1,2,5,0,9,6,0,2,3,0])
a = np.where(a!=0,np.cumsum(a),a)
a
The slowest run took 6.00 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 10.5 µs per loop
the above shows that np.where is marginally quicker than the first method
To my mind, jotasi's suggestion in a comment to the OP is the most idiomatic. Here are some timings, though note that Shawn. L's answer returns a Python list, not a NumPy array, so they are not strictly comparable.
import numpy as np
def jotasi(a):
b = np.cumsum(a)
b[a==0] = 0
return b
def EdChum(a):
a[a!=0] = np.cumsum(a[a!=0])
return a
def ShawnL(a):
b=np.cumsum(a)
b = [b[i] if ((i > 0 and b[i] != b[i-1]) or i==0) else 0 for i in range(len(b))]
return b
def Ed2(a):
return np.where(a!=0,np.cumsum(a),a)
To test, I generated a NumPy array of 1E5 integers in [0,100]. Therefore about 1% are 0. These results are from NumPy 1.9.2, Python 2.7.12, and are presented from slowest to fastest:
import timeit
a = np.random.random_integers(0,100,100000)
len(a[a==0]) #verify there are some 0's
1003
timeit.timeit("ShawnL(a)", "from __main__ import a,EdChum,ShawnL,jotasi,Ed2", number=250)
11.743098020553589
timeit.timeit("EdChum(a)", "from __main__ import a,EdChum,ShawnL,jotasi,Ed2", number=250)
0.1794271469116211
timeit.timeit("Ed2(a)", "from __main__ import a,EdChum,ShawnL,jotasi,Ed2", number=250)
0.1282949447631836
timeit.timeit("jotasi(a)", "from __main__ import a,EdChum,ShawnL,jotasi,Ed2", number=250)
0.09286999702453613
I'm a little surprised there's such a big difference between jotasi's and Ed Chum's answers - minimizing boolean operations is noticeable I guess. No surprise that a list comprehension is slow.
Just trying to simplify it:)
b=np.cumsum(a)
[b[i] if ((i > 0 and b[i] != b[i-1]) or i==0) else 0 for i in range(len(b))]

numpy: efficiently summing with index arrays

Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])

Categories

Resources