Getting the indices of several elements in a NumPy array at once - python

Is there any way to get the indices of several elements in a NumPy array at once?
E.g.
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
I would like to find the index of each element of a in b, namely: [0,1,4].
I find the solution I am using a bit verbose:
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
c = np.zeros_like(a)
for i, aa in np.ndenumerate(a):
c[i] = np.where(b == aa)[0]
print('c: {0}'.format(c))
Output:
c: [0 1 4]

You could use in1d and nonzero (or where for that matter):
>>> np.in1d(b, a).nonzero()[0]
array([0, 1, 4])
This works fine for your example arrays, but in general the array of returned indices does not honour the order of the values in a. This may be a problem depending on what you want to do next.
In that case, a much better answer is the one #Jaime gives here, using searchsorted:
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([0, 1, 4])
This returns the indices for values as they appear in a. For instance:
a = np.array([1, 2, 4])
b = np.array([4, 2, 3, 1])
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([3, 1, 0]) # the other method would return [0, 1, 3]

This is a simple one-liner using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
idx = npi.indices(b, a)
The implementation is fully vectorized, and it gives you control over the handling of missing values. Moreover, it works for nd-arrays as well (for instance, finding the indices of rows of a in b).

All of the solutions here recommend using a linear search. You can use np.argsort and np.searchsorted to speed things up dramatically for large arrays:
sorter = b.argsort()
i = sorter[np.searchsorted(b, a, sorter=sorter)]

For an order-agnostic solution, you can use np.flatnonzero with np.isin (v 1.13+).
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
res = np.flatnonzero(np.isin(a, b)) # NumPy v1.13+
res = np.flatnonzero(np.in1d(a, b)) # earlier versions
# array([0, 1, 2], dtype=int64)

There are a bunch of approaches for getting the index of multiple items at once mentioned in passing in answers to this related question: Is there a NumPy function to return the first index of something in an array?. The wide variety and creativity of the answers suggests there is no single best practice, so if your code above works and is easy to understand, I'd say keep it.
I personally found this approach to be both performant and easy to read: https://stackoverflow.com/a/23994923/3823857
Adapting it for your example:
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
indices = [b_list.index(x) for x in a]
vals_at_indices = b_array[indices]
I personally like adding a little bit of error handling in case a value in a does not exist in b.
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
b_set = set(b_list)
indices = [b_list.index(x) if x in b_set else np.nan for x in a]
vals_at_indices = b_array[indices]
For my use case, it's pretty fast, since it relies on parts of Python that are fast (list comprehensions, .index(), sets, numpy indexing). Would still love to see something that's a NumPy equivalent to VLOOKUP, or even a Pandas merge. But this seems to work for now.

Related

How to calculate the difference b/w each element in an numpy array with a shape[0]>2

Please feel free to let me know whether it is a duplicate question.
From
in_arr1 = np.array([[2,0], [-6,0], [3,0]])
How can I get:
diffInElements = [[5,0]] ?
I tried np.diff(in_arr1, axis=0) but it does not generate what I want.
is there a NumPy function I can use ?
Cheers,
You can negate and then sum all but the first value, and then add the first value:
diff = (-a[1:]).sum(axis=0) + a[0]
Output:
>>> diff
array([5, 0])
You want to subtract the remaining rows from the first row. The straightforward answer does just that:
>>> arr = np.array([[2, 1], [-6, 3], [3, -4]])
>>> arr[0, :] - arr[1:, :].sum(0)
array([5, 2])
There is also, however, a more advanced option making use of the somewhat obscure reduce() method of numpy ufuncs:
>>> np.subtract.reduce(arr, axis=0)
array([5, 2])

Appending a new row to a numpy array

I am trying to append a new row to an existing numpy array in a loop. I have tried the methods involving append, concatenate and also vstack none of them end up giving me the result I want.
I have tried the following:
for _ in col_change:
if (item + 2 < len(col_change)):
arr=[col_change[item], col_change[item + 1], col_change[item + 2]]
array=np.concatenate((array,arr),axis=0)
item+=1
I have also tried it in the most basic format and it still gives me an empty array.
array=np.array([])
newrow = [1, 2, 3]
newrow1 = [4, 5, 6]
np.concatenate((array,newrow), axis=0)
np.concatenate((array,newrow1), axis=0)
print(array)
I want the output to be [[1,2,3][4,5,6]...]
The correct way to build an array incrementally is to not start with an array:
alist = []
alist.append([1, 2, 3])
alist.append([4, 5, 6])
arr = np.array(alist)
This is essentially the same as
arr = np.array([ [1,2,3], [4,5,6] ])
the most common way of making a small (or large) sample array.
Even if you have good reason to use some version of concatenate (hstack, vstack, etc), it is better to collect the components in a list, and perform the concatante once.
If you want [[1,2,3],[4,5,6]] I could present you an alternative without append: np.arange and then reshape it:
>>> import numpy as np
>>> np.arange(1,7).reshape(2, 3)
array([[1, 2, 3],
[4, 5, 6]])
Or create a big array and fill it manually (or in a loop):
>>> array = np.empty((2, 3), int)
>>> array[0] = [1,2,3]
>>> array[1] = [4,5,6]
>>> array
array([[1, 2, 3],
[4, 5, 6]])
A note on your examples:
In the second one you forgot to save the result, make it array = np.concatenate((array,newrow1), axis=0) and it works (not exactly like you want it but the array is not empty anymore). The first example seems badly indented and without know the variables and/or the problem there it's hard to debug.

Numpy - the best way to remove the last element from 1 dimensional array?

What is the most efficient way to remove the last element from a numpy 1 dimensional array? (like pop for list)
NumPy arrays have a fixed size, so you cannot remove an element in-place. For example using del doesn't work:
>>> import numpy as np
>>> arr = np.arange(5)
>>> del arr[-1]
ValueError: cannot delete array elements
Note that the index -1 represents the last element. That's because negative indices in Python (and NumPy) are counted from the end, so -1 is the last, -2 is the one before last and -len is actually the first element. That's just for your information in case you didn't know.
Python lists are variable sized so it's easy to add or remove elements.
So if you want to remove an element you need to create a new array or view.
Creating a new view
You can create a new view containing all elements except the last one using the slice notation:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> arr[:-1] # all but the last element
array([0, 1, 2, 3])
>>> arr[:-2] # all but the last two elements
array([0, 1, 2])
>>> arr[1:] # all but the first element
array([1, 2, 3, 4])
>>> arr[1:-1] # all but the first and last element
array([1, 2, 3])
However a view shares the data with the original array, so if one is modified so is the other:
>>> sub = arr[:-1]
>>> sub
array([0, 1, 2, 3])
>>> sub[0] = 100
>>> sub
array([100, 1, 2, 3])
>>> arr
array([100, 1, 2, 3, 4])
Creating a new array
1. Copy the view
If you don't like this memory sharing you have to create a new array, in this case it's probably simplest to create a view and then copy (for example using the copy() method of arrays) it:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> sub_arr = arr[:-1].copy()
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
2. Using integer array indexing [docs]
However, you can also use integer array indexing to remove the last element and get a new array. This integer array indexing will always (not 100% sure there) create a copy and not a view:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> indices_to_keep = [0, 1, 2, 3]
>>> sub_arr = arr[indices_to_keep]
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
This integer array indexing can be useful to remove arbitrary elements from an array (which can be tricky or impossible when you want a view):
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> arr[[0, 1, 3, 4]] # keep first, second, fourth and fifth element
array([5, 6, 8, 9])
If you want a generalized function that removes the last element using integer array indexing:
def remove_last_element(arr):
return arr[np.arange(arr.size - 1)]
3. Using boolean array indexing [docs]
There is also boolean indexing that could be used, for example:
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> keep = [True, True, True, True, False]
>>> arr[keep]
array([5, 6, 7, 8])
This also creates a copy! And a generalized approach could look like this:
def remove_last_element(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
If you would like more information on NumPys indexing the documentation on "Indexing" is quite good and covers a lot of cases.
4. Using np.delete()
Normally I wouldn't recommend the NumPy functions that "seem" like they are modifying the array in-place (like np.append and np.insert) but do return copies because these are generally needlessly slow and misleading. You should avoid them whenever possible, that's why it's the last point in my answer. However in this case it's actually a perfect fit so I have to mention it:
>>> arr = np.arange(10, 20)
>>> arr
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.delete(arr, -1)
array([10, 11, 12, 13, 14, 15, 16, 17, 18])
5.) Using np.resize()
NumPy has another method that sounds like it does an in-place operation but it really returns a new array:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> np.resize(arr, arr.size - 1)
array([0, 1, 2, 3])
To remove the last element I simply provided a new shape that is 1 smaller than before, which effectively removes the last element.
Modifying the array inplace
Yes, I've written previously that you cannot modify an array in place. But I said that because in most cases it's not possible or only by disabling some (completely useful) safety checks. I'm not sure about the internals but depending on the old size and the new size it could be possible that this includes an (internal-only) copy operation so it might be slower than creating a view.
Using np.ndarray.resize()
If the array doesn't share its memory with any other array, then it's possible to resize the array in place:
>>> arr = np.arange(5, 10)
>>> arr.resize(4)
>>> arr
array([5, 6, 7, 8])
However that will throw ValueErrors in case it's actually referenced by another array as well:
>>> arr = np.arange(5)
>>> view = arr[1:]
>>> arr.resize(4)
ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function
You can disable that safety-check by setting refcheck=False but that shouldn't be done lightly because you make yourself vulnerable for segmentation faults and memory corruption in case the other reference tries to access the removed elements! This refcheck argument should be treated as an expert-only option!
Summary
Creating a view is really fast and doesn't take much additional memory, so whenever possible you should try to work as much with views as possible. However depending on the use-cases it's not so easy to remove arbitrary elements using basic slicing. While it's easy to remove the first n elements and/or last n elements or remove every x element (the step argument for slicing) this is all you can do with it.
But in your case of removing the last element of a one-dimensional array I would recommend:
arr[:-1] # if you want a view
arr[:-1].copy() # if you want a new array
because these most clearly express the intent and everyone with Python/NumPy experience will recognize that.
Timings
Based on the timing framework from this answer:
# Setup
import numpy as np
def view(arr):
return arr[:-1]
def array_copy_view(arr):
return arr[:-1].copy()
def array_int_index(arr):
return arr[np.arange(arr.size - 1)]
def array_bool_index(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
def array_delete(arr):
return np.delete(arr, -1)
def array_resize(arr):
return np.resize(arr, arr.size - 1)
# Timing setup
timings = {view: [],
array_copy_view: [], array_int_index: [], array_bool_index: [],
array_delete: [], array_resize: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
print(size)
func_input = np.random.random(size=size)
for func in timings:
print(func.__name__.ljust(20), ' ', end='')
res = %timeit -o func(func_input) # if you use IPython, otherwise use the "timeit" module
timings[func].append(res)
# Plotting
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=func.__name__)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
I get the following timings as log-log plot to cover all the details, lower time still means faster, but the range between two ticks represents one order of magnitude instead of a fixed amount. In case you're interested in the specific values, I copied them into this gist:
According to these timings those two approaches are also the fastest. (Python 3.6 and NumPy 1.14.0)
If you want to quickly get array without last element (not removing explicit), use slicing:
array[:-1]
To delete the last element from a 1-dimensional NumPy array, use the numpy.delete method, like so:
import numpy as np
# Create a 1-dimensional NumPy array that holds 5 values
values = np.array([1, 2, 3, 4, 5])
# Remove the last element of the array using the numpy.delete method
values = np.delete(values, -1)
print(values)
Output:
[1 2 3 4]
The last value of the NumPy array, which was 5, is now removed.

How to sort a list based on the output of numpy's argsort function

I have a list like this:
myList = [10,30,40,20,50]
Now I use numpy's argsort function to get the indices for the sorted list:
import numpy as np
so = np.argsort(myList)
which gives me the output:
array([0, 3, 1, 2, 4])
When I want to sort an array using so it works fine:
myArray = np.array([1,2,3,4,5])
myArray[so]
array([1, 4, 2, 3, 5])
But when I apply it to another list, it does not work but throws an error
myList2 = [1,2,3,4,5]
myList2[so]
TypeError: only integer arrays with one element can be converted to an
index
How can I now use so to sort another list without using a for-loop and without converting this list to an array first?
myList2 is a normal python list, and it does not support that kind of indexing.
You would either need to convert that to a numpy.array , Example -
In [8]: np.array(myList2)[so]
Out[8]: array([1, 4, 2, 3, 5])
Or you can use list comprehension -
In [7]: [myList2[i] for i in so]
Out[7]: [1, 4, 2, 3, 5]
You can't. You have to convert it to an array then back.
myListSorted = list(np.array(myList)[so])
Edit: I ran some benchmarks comparing the NumPy way to the list comprehension. NumPy is ~27x faster
>>> from timeit import timeit
>>> import numpy as np
>>> myList = list(np.random.rand(100))
>>> so = np.argsort(myList) #converts list to NumPy internally
>>> timeit(lambda: [myList[i] for i in so])
12.29590070003178
>>> myArray = np.random.rand(100)
>>> so = np.argsort(myArray)
>>> timeit(lambda: myArray[so])
0.42915570305194706

Determining duplicate values in an array

Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.
I've come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a isn't actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]
Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.
I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.
As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.
It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.
I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.
People have already suggested Counter variants, but here's one which doesn't use a listcomp:
>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]
For Python 2.7+
>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]
Here's another approach using set operations that I think is a bit more straightforward than the ones you offer:
>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
I suppose you're asking for numpy-only solutions, since if that's not the case, it's very difficult to argue with just using a Counter instead. I think you should make that requirement explicit though.
If a is made up of small integers you can use numpy.bincount directly:
import numpy as np
a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
This is very similar your "histogram" method, which is the one I would use if a was not made up of small integers.
If the array is a sorted numpy array, then just do:
a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]
I'm adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.
import numpy as np
A = np.array([1,2,3,4,4,4,5,6,6,7,8])
# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]
# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)
for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]
>>> import numpy as np
>>> a=np.array([1,2,2,2,2,3])
>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ] # <--- Get duplicates
If you also want to get the orphans:
>>> orphans = a[ uniq_idx[counts==1] ]
Combination of Pandas and Numpy (Utilizing value_counts():
import pandas as pd
import numpy as np
arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
OUTPUT:
a 2
b 2
c 1

Categories

Resources