Filter numpy array of strings - python

I have a very large data set gotten from twitter. I am trying to figure out how to do the equivalent of python filtering like the below in numpy. The environment is the python interpreter
>>tweets = [['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'],
['is nice man that buhari']]
>>>filter(lambda x: 'buhari' in x[0].lower(), tweets)
[['buhari si good'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']]
I tried boolean indexing like the below, but the array turned up empty
>>>tweet_arr = np.array([['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']])
>>>flat_tweets = tweet_arr[:, 0]
>>>flat_tweets
array(['buhari si good', 'atiku is great', 'buhari nfd sdfa atiku',
'is nice man that buhari'], dtype='|S23')
>>>flat_tweets['buhari' in flat_tweets]
array([], shape=(0, 4), dtype='|S23')
I would like to know how to filter strings in a numpy array, the way I was easily able to filter even numbers here
>>> arr = np.arange(15).reshape((15,1))
>>>arr
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[13],
[14]])
>>>arr[:][arr % 2 == 0]
array([ 0, 2, 4, 6, 8, 10, 12, 14])
Thanks

If you want to stick to a solution based entirely on NumPy, you could do
from numpy.core.defchararray import find, lower
tweet_arr[find(lower(tweet_arr), 'buhari') != -1]
You mention in a comment that what you're looking for here is performance, so it should be noted that this appears to be a good deal slower than the solution you came up with yourself:
In [33]: large_arr = np.repeat(tweet_arr, 10000)
In [36]: %timeit large_arr[find(lower(large_arr), 'buhari') != -1]
54.6 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: %timeit list(filter(lambda x: 'buhari' in x.lower(), large_arr))
21.2 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In fact, an ordinary list comprehension beats both approaches:
In [44]: %timeit [x for x in large_arr if 'buhari' in x.lower()]
18.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

NumPy maxima of groups defined by a label array

I have two arrays, one is a list of values and one is a list of IDs corresponding to each value. Some IDs have multiple values. I want to create a new array that contains the maximum value recorded for each id, which will have a length equal to the number of unique ids.
Example using a for loop:
import numpy as np
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
uniq_ids = np.unique(ids)
maximums = np.ones_like(uniq_ids) * np.nan
for i, id in enumerate(uniq_ids):
maximums[i] = np.max(values[np.where(ids == id)])
print(uniq_ids)
print(maximums)
[0 1 3 5 6]
[5. 3. 6. 8. 8.]
Is it possible to vectorize this so it runs fast? I'm imagining a one-liner that can create the "maximums" array using only NumPy functions, but I haven't been able to come up with anything that works.
Use np.lexsort to sort both lists simultaneously:
idx = np.lexsort([values, ids])
The indices of the last unique ID are given by
last = np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]
You can use this to get the maxima of each group:
values[idx[last]]
This is the same as values[idx][last], but faster because you only need to extract len(last) elements this way, instead of rearranging the whole array and then extracting.
Keep in mind that np.unique is basically doing sort and flatnonzero when you do return_index=True.
Here's a collection of all the solutions so far, along with a benchmark. I suggest a modified solution to #bb1's pandas recipe implemented below.
def Shannon(values, ids):
uniq_ids = np.unique(ids)
maxima = np.ones_like(uniq_ids) * np.nan
for i, id in enumerate(uniq_ids):
maxima[i] = np.max(values[np.where(ids == id)])
return maxima
def richardec(values, ids):
return [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]
def MadPhysicist(values, ids):
idx = np.lexsort([values, ids])
return values[idx[np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]]]
def PeptideWitch(values, ids):
return np.vectorize(lambda id: np.max(values[np.where(ids == id)]))(np.unique(ids))
def mathfux(values, ids):
idx = np.argsort(ids)
return np.maximum.reduceat(values[idx], np.r_[0, np.flatnonzero(np.diff(ids[idx])) + 1])
def bb1(values, ids):
return DataFrame({'ids': ids, 'vals': values}).groupby('ids')['values'].max().to_numpy()
def bb1_modified(values, ids):
return Series(values).groupby(ids).max().to_numpy()
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
%timeit Shannon(values, ids)
42.1 µs ± 561 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit richardec(values, ids)
27.7 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit MadPhysicist(values, ids)
19.3 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit PeptideWitch(values, ids)
55.9 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mathfux(values, ids)
20.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bb1(values, ids)
865 µs ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
537 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)
%timeit Shannon(values, ids)
1.76 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit richardec(values, ids)
29.1 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit MadPhysicist(values, ids)
904 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit PeptideWitch(values, ids)
1.74 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit mathfux(values, ids)
372 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1(values, ids)
964 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
679 µs ± 7.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#mathfux's suggestion of using np.maximum.reduceat is by far the fastest all around, followed by the modified pandas solution and then lexsorting. The pandas solution is likely very similar to the reduceat one, but with a lot more overhead. You can see that the incremental change for the modified pandas solution is tiny because of that.
For large arrays (which are the only ones that matter), it seems that finding the maxima individually is comparable to the question, while richardec's comprehension is much slower, despite the fact that it does not sort the input values according to the IDs.
np.lexsort sorts by multiple columns. However, this is not compulsory. You can sort ids first and then choose maximum item of each divided group using numpy.maximum.reduceat
def mathfux(values, ids, return_groups=False):
argidx = np.argsort(ids) #70% time
ids_sort, values_sort = ids[argidx], values[argidx] #4% time
div_points = np.r_[0, np.flatnonzero(np.diff(ids_sort)) + 1] #11% time (the most part for np.flatnonzero)
if return_groups:
return ids[div_points], np.maximum.reduceat(values_sort, div_points)
else: return np.maximum.reduceat(values_sort, div_points)
mathfux(values, ids, return_groups=True)
>>> (array([0, 1, 3, 5, 6]), array([5, 3, 6, 8, 8]))
mathfux(values, ids)
>>> mathfux(values, ids)
array([5, 3, 6, 8, 8])
Usually, some parts of numpy codes could be optimised further in numba. Note that np.argsort is a bottleneck in majority of groupby problems which can't be replaced by any other method. It is unlikely to be improved soon in numba or numpy. So you are reaching an optimal performance here and can't do much in further optimisations.
Here's a solution, which, although not 100% vectorized, (per my benchmarks) takes about half the time as your does (using your sample data). The performance improvement probably becomes more drastic with more data:
maximums = [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]
Output:
>>> maximums
[5, 3, 6, 8, 8]
In trying to visualize the problem:
In [82]: [np.where(ids==id) for id in uniq_ids]
Out[82]:
[(array([0]),),
(array([1]),),
(array([2, 3, 4, 5]),),
(array([6]),),
(array([7, 8, 9]),)]
unique can also return:
In [83]: np.unique(ids, return_inverse=True)
Out[83]: (array([0, 1, 3, 5, 6]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))
Which is a variante on what richardec produced:
In [88]: [a for a in np.split(ids, np.arange(1, ids.shape[0])[(np.diff(ids) !=
...: 0)])]
Out[88]: [array([0]), array([1]), array([3, 3, 3, 3]), array([5]), array([6, 6, 6])]
That inverse is also produced by doing where on all == at once:
In [90]: ids[:,None] == uniq_ids
Out[90]:
array([[ True, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, False, True, False],
[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, False, True]])
In [91]: np.nonzero(ids[:,None] == uniq_ids)
Out[91]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))
I'm still thinking through this ...
EDIT: I'll leave this up as an example of why we can't always use np.vectorize() to make everything magically faster:
One solution is to use numpy's vectorize function:
import numpy as np
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
def my_func(id):
return np.max(values[np.where(ids==id)])
vector_func = np.vectorize(my_func)
maximums = vector_func(np.unique(ids))
which returns
array([5, 3, 6, 8, 8])
But as for speed, your version has about the same performance when we use
values = np.array([random.randint(1, 100) for i in range(1000000)])
ids = []
for i in range(100000):
r = random.randint(1, 4)
if r == 3:
for x in range(3):
ids.append(i)
elif r == 2:
for x in range(4):
ids.append(i)
else:
ids.append(i)
ids = np.array(ids)
It's about 12 seconds per execution.
With pandas:
import pandas as pd
def with_pandas(ids, vals):
df = pd.DataFrame({'ids': ids, 'vals': values})
return df.groupby('ids')['vals'].max().to_numpy()
Timing:
import numpy as np
values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)
%timeit with_pandas(ids, values)
692 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Jumping Multi-element slices in Numpy Arrays

So say i have an array:
arr = np.arange(12)
And at the end I want this array:
arr2 = [0,1,2,6,7,8]
So I want a jumping mulitple slice, something like:
arr2 = arr[(0:2):-1:6]
where the second array is a slice of three that jumps 6 everytime.
Is this possible in numpy?
My actual example is a more complex example where part of the math is applied for the slice (0:2) that jumps 6 and the other math is applied to the slice (3:5) with a goal to write in one line i.e. without a for-loop.
Sorry if this question has been asked before. I'm having trouble finding documentation on this and I think I might just be googling the wrong thing. Thanks!
You can't do this with slice notation, at least not directly.
But with some reshaping:
In [74]: arr = np.arange(12)
In [75]: arr.reshape(-1,3)
Out[75]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [76]: arr.reshape(-1,3)[::2,:]
Out[76]:
array([[0, 1, 2],
[6, 7, 8]])
In [77]: _.reshape(-1)
Out[77]: array([0, 1, 2, 6, 7, 8])
Individually slicing and reshaping make views, but at some point in this transition, it has to make a copy. So the timing relative to the advanced indexing that Divakar suggests is, at best, modest:
In [86]: timeit arr.reshape(-1,3)[::2,:].reshape(-1)
3.99 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [87]: timeit arr[(np.arange(len(arr))%6)<3]
8.91 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Get indexes of chosen array elements in the order of these elements from a different array [duplicate]

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How do I sort a 2D numpy array in this specific way

I realize there are quite a number of 'how to sort numpy array'-questions on here already. But I could not find how to do it in this specific way.
I have an array similar to this:
array([[1,0,1,],
[0,0,1],
[1,1,1],
[1,1,0]])
I want to sort the rows, keeping the order within the rows the same. So I expect the following output:
array([[0,0,1,],
[1,0,1],
[1,1,0],
[1,1,1]])
You can use dot and argsort:
a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()]
# array([[0, 0, 1],
# [1, 0, 1],
# [1, 1, 0],
# [1, 1, 1]])
The idea is to convert the rows into integers.
a.dot(2**np.arange(a.shape[1])[::-1])
# array([5, 1, 7, 6])
Then, find the sorted indices and use that to reorder a:
a.dot(2**np.arange(a.shape[1])[::-1]).argsort()
# array([1, 0, 3, 2])
My tests show this is slightly faster than lexsort.
a = a.repeat(1000, axis=0)
%timeit a[np.lexsort(a.T[::-1])]
%timeit a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()]
230 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
192 µs ± 4.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Verify correctness:
np.array_equal(a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()],
a[np.lexsort(a.T[::-1])])
# True

Index a NumPy array row-wise [duplicate]

This question already has answers here:
Indexing one array by another in numpy
(4 answers)
Closed 4 years ago.
Say I have a NumPy array:
>>> X = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> X
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
and an array of indexes that I want to select for each row:
>>> ixs = np.array([[1, 3], [0, 1], [1, 2]])
>>> ixs
array([[1, 3],
[0, 1],
[1, 2]])
How do I index the array X so that for every row in X I select the two indices specified in ixs?
So for this case, I want to select element 1 and 3 for the first row, element 0 and 1 for the second row, and so on. The output should be:
array([[2, 4],
[5, 6],
[10, 11]])
A slow solution would be something like this:
output = np.array([row[ix] for row, ix in zip(X, ixs)])
however this can get kinda slow for extremely long arrays. Is there a faster way to do this without a loop using NumPy?
EDIT: Some very approximate speed tests on a 2.5K * 1M array with 2K wide ixs (10GB):
np.array([row[ix] for row, ix in zip(X, ixs)]) 0.16s
X[np.arange(len(ixs)), ixs.T].T 0.175s
X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 33s
np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype).reshape(ixs.shape) 2.4s
You can use this:
X[np.arange(len(ixs)), ixs.T].T
Here is the reference for complex indexing.
I believe you can use .take thusly:
In [185]: X
Out[185]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [186]: idx
Out[186]:
array([[1, 3],
[0, 1],
[1, 2]])
In [187]: X.take(idx + (np.arange(X.shape[0]) * X.shape[1]).reshape(-1, 1))
Out[187]:
array([[ 2, 4],
[ 5, 6],
[10, 11]])
If your array dimensions are massive, it might be faster, albeit uglier, to do:
idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]
Just for fun, see how the following performs:
np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)
Edit to add timings
In [15]: X = np.arange(1000*10000, dtype=np.int32).reshape(1000,-1)
In [16]: ixs = np.random.randint(0, 10000, (1000, 2))
In [17]: ixs.sort(axis=1)
In [18]: ixs
Out[18]:
array([[2738, 3511],
[3600, 7414],
[7426, 9851],
...,
[1654, 8252],
[2194, 8200],
[5497, 8900]])
In [19]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)])
928 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [20]: %timeit X[np.arange(len(ixs)), ixs.T].T
23.6 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %timeit X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
20.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [22]: %timeit np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype, count=ixs.size).reshape(ixs.shape)
1.42 ms ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#mxbi I've added some timings and my results aren't really consistent with yours, you should check it out
Here's a larger array:
In [33]: X = np.arange(10000*100000, dtype=np.int32).reshape(10000,-1)
In [34]: ixs = np.random.randint(0, 100000, (10000, 2))
In [35]: ixs.sort(axis=1)
In [36]: X.shape
Out[36]: (10000, 100000)
In [37]: ixs.shape
Out[37]: (10000, 2)
With some results:
In [42]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)])
11.4 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %timeit X[np.arange(len(ixs)), ixs.T].T
596 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [44]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
540 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Now, we are using column 500 indices instead of two, and we see the list-comprehension start winning out:
In [45]: ixs = np.random.randint(0, 100000, (10000, 500))
In [46]: ixs.sort(axis=1)
In [47]: %timeit np.array([row[ix] for row, ix in zip(X, ixs)])
93 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit X[np.arange(len(ixs)), ixs.T].T
133 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [49]: %timeit X.take(ixs+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None])
87.5 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The usual suggestion for indexing items from rows is:
X[np.arange(X.shape[0])[:,None], ixs]
That is, make a row index of shape (n,1) (column vector), which will broadcast with the (n,m) shape of ixs to give a (n,m) solution.
This basically the same as:
X[np.arange(len(ixs)), ixs.T].T
which broadcasts a (n,) index against a (m,n), and transposes.
Timings are essentially the same:
In [299]: X = np.ones((1000,2000))
In [300]: ixs = np.random.randint(0,2000,(1000,200))
In [301]: timeit X[np.arange(len(ixs)), ixs.T].T
6.58 ms ± 71.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [302]: timeit X[np.arange(X.shape[0])[:,None], ixs]
6.57 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and for comparison:
In [307]: timeit np.array([row[ix] for row, ix in zip(X, ixs)])
6.63 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm a little surprised that this list comprehension does so well. I wonder how the relative advantages compare when the dimensions change, particularly in the relative shape of X and ixs (long, wide etc).
The first solution is the style of indexing produced by ix_:
In [303]: np.ix_(np.arange(3), np.arange(2))
Out[303]:
(array([[0],
[1],
[2]]), array([[0, 1]]))
This should work
[X[i][[y]] for i, y in enumerate(ixs)]
EDIT: I just noticed you wanted no loop solution.

Categories

Resources