Determining duplicate values in an array - python

Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.
I've come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a isn't actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]
Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.
I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.
It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.

People have already suggested Counter variants, but here's one which doesn't use a listcomp:
>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]

For Python 2.7+
>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]

Here's another approach using set operations that I think is a bit more straightforward than the ones you offer:
>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
I suppose you're asking for numpy-only solutions, since if that's not the case, it's very difficult to argue with just using a Counter instead. I think you should make that requirement explicit though.

If a is made up of small integers you can use numpy.bincount directly:
import numpy as np
a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
This is very similar your "histogram" method, which is the one I would use if a was not made up of small integers.

If the array is a sorted numpy array, then just do:
a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]

I'm adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.
import numpy as np
A = np.array([1,2,3,4,4,4,5,6,6,7,8])
# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]
# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)
for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]

>>> import numpy as np
>>> a=np.array([1,2,2,2,2,3])
>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ] # <--- Get duplicates
If you also want to get the orphans:
>>> orphans = a[ uniq_idx[counts==1] ]

Combination of Pandas and Numpy (Utilizing value_counts():
import pandas as pd
import numpy as np
arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
OUTPUT:
a 2
b 2
c 1

Related

Quick way to iterate through two arrays python

I've had multiple scenarios, where i had to find a huge array's items in another huge array.
I usually solved it like this:
for i in range(0,len(arr1)):
for k in range(0,len(arr1)):
print(arr1[i],arr2[k])
Which works fine, but its kinda slow.
Can someone help me, how to make the iteration faster?
arr1 = [1,2,3,4,5]
arr2 = [4,5,6,7]
same_items = set(arr1).intersection(arr2)
print(same_items)
Out[5]: {4,5}
Sets hashes items so instead of O(n) look up time for any element it has O(1). Items inside need to be hashable for this to work. And if they are not, I highly suggest you find a way to make them hashable.
If you need to handle huge arrays, you may want to use Python's numpy library, which assists you with high-efficiency manipulation methods and avoids you to use loops at all in most cases.
if your array has duplicates and you want to keep them all:
arr1 = [1,2,3,4,5,7,5,4]
arr2 = [4,5,6,7]
res = [i for i in arr1 if i in arr2]
>>> res
'''
[4, 5, 7, 5, 4]
or using numpy:
import numpy as np
res = np.array(arr1)[np.isin(arr1, arr2)].tolist()
>>> res
'''
[4, 5, 7, 5, 4]

In numpy, how to efficiently build a mapping from each unique value to its indices, without using a for loop

In numpy, how to efficiently build a mapping from each unique value to its indices, without using a for loop
I considered the following alternatives, but they are not efficient enough for my use case because I use large arrays.
The first alternative, requires traversing the array with for loop, which may be slow when considering large numpy arrays.
import numpy as np
from collections import defaultdict
a = np.array([1, 2, 6, 4, 2, 3, 2])
inv = defaultdict(list)
for i, x in enumerate(a):
inv[x].append(i)
The second alternative is non-efficient because it requires travesing the array multiple times:
import numpy as np
a = np.array([1, 2, 6, 4, 2, 3, 2])
inv = {}
for x in np.unique(a):
inv[x] = np.flatnonzero(a == x)
EDIT: My numpy array consists of integers and the usage is for image segmentation. I was also looking for a method in skimage, but did not find any.
This should work.
a = np.array((1, 2, 6, 2, 4, 7, 25, 6))
fwd = np.argsort(a)
asorted = a[fwd]
keys = np.unique(asorted)
lower = np.searchsorted(asorted, keys)
# or higher = itertools.chain(lower[1:], (len(asorted),))
higher = np.append(lower[1:], len(asorted))
inv = {key: fwd[lower_i:higher_i]
for key, lower_i, higher_i
in zip(keys, lower, higher)}
assert all(np.all(a[indices] == key)
for key, indices in inv.items())
It runs in something like O(n log(n)). The only loop that remains is the one to build a dictionary. That step is optional, of course.
From a purely algorithmic standpoint, your first approach (defaultdict(list)) would be better since it runs in aggregated linear time but of course the python overhead may be significant.
I advise you to check out numba which can speed up numpy code on python significantly - it supports numpy.invert() and numpy.unique() - documentation
Here is a good video explaining how to use numba from youtube - Make Python code 1000x Faster with Numba
Maybe my answer comes a bit late, but if you are ok using pandas I think you can use groupby, which should run in linear time
import pandas as pd
a = np.array([1, 2, 6, 4, 2, 3, 2])
df = pd.DataFrame({'values': a})
inv = df.groupby(by='values').apply(lambda group: np.array(group.index))
inv = dict(inv)

NumPy - descending stable arg-sort of arrays of any dtype

NumPy's np.argsort is able to do stable sorting through passing kind = 'stable' argument.
Also np.argsort doesn't support reverse (descending) order.
If non-stable behavior is needed then descending order can be easily modeled through desc_ix = np.argsort(a)[::-1].
I'm looking for efficient/easy solution to descending-stable-sort NumPy's array a of any comparable dtype. See my meaning of "stability" in the last paragraph.
For the case when dtype is any numerical then stable descending arg-sorting can be easily done through sorting negated version of array:
print(np.argsort(-np.array([1, 2, 2, 3, 3, 3]), kind = 'stable'))
# prints: array([3, 4, 5, 1, 2, 0], dtype=int64)
But I need to support any comparable dtype including np.str_ and np.object_.
Just for clarification - maybe for descending orders classical meaning of stable means that equal elements are enumerated right to left. If so then in my question meaning of stable + descending is something different - equal ranges of elements should be enumerated left to right, while equal ranges between each other are ordered in descending order. I.e. same behavior should be achieved like in the last code above. I.e. I want stability in a sense same like Python achieves in next code:
print([e[0] for e in sorted(enumerate([1,2,2,3,3,3]), key = lambda e: e[1], reverse = True)])
# prints: [3, 4, 5, 1, 2, 0]
I think this formula should work:
import numpy as np
a = np.array([1, 2, 2, 3, 3, 3])
s = len(a) - 1 - np.argsort(a[::-1], kind='stable')[::-1]
print(s)
# [3 4 5 1 2 0]
We can make use of np.unique(..., return_inverse=True) -
u,tags = np.unique(a, return_inverse=True)
out = np.argsort(-tags, kind='stable')
One simplest solution would be through mapping sorted unique elements of any dtype to ascending integers and then stable ascending arg-sorting of negated integers.
Try it online!
import numpy as np
a = np.array(['a', 'b', 'b', 'c', 'c', 'c'])
u = np.unique(a)
i = np.searchsorted(u, a)
desc_ix = np.argsort(-i, kind = 'stable')
print(desc_ix)
# prints [3 4 5 1 2 0]
Similar to #jdehesa's clean solution, this solution allows specifying an axis.
indices = np.flip(
np.argsort(np.flip(x, axis=axis), axis=axis, kind="stable"), axis=axis
)
normalised_axis = axis if axis >= 0 else x.ndim + axis
max_i = x.shape[normalised_axis] - 1
indices = max_i - indices

Getting the indices of several elements in a NumPy array at once

Is there any way to get the indices of several elements in a NumPy array at once?
E.g.
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
I would like to find the index of each element of a in b, namely: [0,1,4].
I find the solution I am using a bit verbose:
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
c = np.zeros_like(a)
for i, aa in np.ndenumerate(a):
c[i] = np.where(b == aa)[0]
print('c: {0}'.format(c))
Output:
c: [0 1 4]
You could use in1d and nonzero (or where for that matter):
>>> np.in1d(b, a).nonzero()[0]
array([0, 1, 4])
This works fine for your example arrays, but in general the array of returned indices does not honour the order of the values in a. This may be a problem depending on what you want to do next.
In that case, a much better answer is the one #Jaime gives here, using searchsorted:
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([0, 1, 4])
This returns the indices for values as they appear in a. For instance:
a = np.array([1, 2, 4])
b = np.array([4, 2, 3, 1])
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([3, 1, 0]) # the other method would return [0, 1, 3]
This is a simple one-liner using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
idx = npi.indices(b, a)
The implementation is fully vectorized, and it gives you control over the handling of missing values. Moreover, it works for nd-arrays as well (for instance, finding the indices of rows of a in b).
All of the solutions here recommend using a linear search. You can use np.argsort and np.searchsorted to speed things up dramatically for large arrays:
sorter = b.argsort()
i = sorter[np.searchsorted(b, a, sorter=sorter)]
For an order-agnostic solution, you can use np.flatnonzero with np.isin (v 1.13+).
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
res = np.flatnonzero(np.isin(a, b)) # NumPy v1.13+
res = np.flatnonzero(np.in1d(a, b)) # earlier versions
# array([0, 1, 2], dtype=int64)
There are a bunch of approaches for getting the index of multiple items at once mentioned in passing in answers to this related question: Is there a NumPy function to return the first index of something in an array?. The wide variety and creativity of the answers suggests there is no single best practice, so if your code above works and is easy to understand, I'd say keep it.
I personally found this approach to be both performant and easy to read: https://stackoverflow.com/a/23994923/3823857
Adapting it for your example:
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
indices = [b_list.index(x) for x in a]
vals_at_indices = b_array[indices]
I personally like adding a little bit of error handling in case a value in a does not exist in b.
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
b_set = set(b_list)
indices = [b_list.index(x) if x in b_set else np.nan for x in a]
vals_at_indices = b_array[indices]
For my use case, it's pretty fast, since it relies on parts of Python that are fast (list comprehensions, .index(), sets, numpy indexing). Would still love to see something that's a NumPy equivalent to VLOOKUP, or even a Pandas merge. But this seems to work for now.

compare two following values in numpy array

What is the best way to touch two following values in an numpy array?
example:
npdata = np.array([13,15,20,25])
for i in range( len(npdata) ):
print npdata[i] - npdata[i+1]
this looks really messed up and additionally needs exception code for the last iteration of the loop.
any ideas?
Thanks!
numpy provides a function diff for this basic use case
>>> import numpy
>>> x = numpy.array([1, 2, 4, 7, 0])
>>> numpy.diff(x)
array([ 1, 2, 3, -7])
Your snippet computes something closer to -numpy.diff(x).
How about range(len(npdata) - 1) ?
Here's code (using a simple array, but it doesn't matter):
>>> ar = [1, 2, 3, 4, 5]
>>> for i in range(len(ar) - 1):
... print ar[i] + ar[i + 1]
...
3
5
7
9
As you can see it successfully prints the sums of all consecutive pairs in the array, without any exceptions for the last iteration.
You can use ediff1d to get differences of consecutive elements. More generally, a[1:] - a[:-1] will give the differences of consecutive elements and can be used with other operators as well.

Categories

Resources