how do you find and save duplicated rows in a numpy array?

how do you find and save duplicated rows in a numpy array? - python

I have an array e.g.
Array = [[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[1,1,1],[2,2,2]]
And i would like something that would output the following:
Repeated = [[1,1,1],[2,2,2]]
Preserving the number of repeated rows would work too, e.g.
Repeated = [[1,1,1],[1,1,1],[2,2,2],[2,2,2]]
I thought the solution might include numpy.unique, but i can't get it to work, is there a native python / numpy function?

Using the new axis functionality of np.unique alongwith return_counts=True that gives us the unique rows and the corresponding counts for each of those rows, we can mask out the rows with counts > 1 and thus have our desired output, like so -
In [688]: a = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[1,1,1],[2,2,2]])
In [689]: unq, count = np.unique(a, axis=0, return_counts=True)
In [690]: unq[count>1]
Out[690]:
array([[1, 1, 1],
[2, 2, 2]])

If you need to get indices of the repeated rows
import numpy as np
a = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[1,1,1],[2,2,2]])
unq, count = np.unique(a, axis=0, return_counts=True)
repeated_groups = unq[count > 1]
for repeated_group in repeated_groups:
repeated_idx = np.argwhere(np.all(a == repeated_group, axis=1))
print(repeated_idx.ravel())
# [0 5]
# [1 6]

You could use something like Repeated = list(set(map(tuple, Array))) if you didn't necessarily need order preserved. The advantage of this is you don't need additional dependencies like numpy. Depending on what you're doing next, you could probably get away with Repeated = set(map(tuple, Array)) and avoid a type conversion if you would like.

Related

Numpy unique with distinct members

I'm trying to find the unique elements of a N x 2 array irrespective of the order. For example, given the array
a = [[1,0],
[2,5],
[0,1],
[1,0]]
it would give me back
a = [[1,0],
[2,5]]
Currently I'm using an approach with numpy.sort()
a = np.unique(np.sort(a, axis=1), axis=0)
which gets the job done but I feel like this is an overly complicated way of achieving my goal and possibly slow especially for larger arrays. Are there better, "sort-avoiding" methods?

Following the comment, I don't think numpy can do this natively.
What you can do is use python sets to generate unique objects.
numpy.unique doesn't seem to handle sets very well, so an extra conversion step to tuple is needed:
np.unique(list(map(tuple, map(set, a))), axis=0)
output:
array([[0, 1],
[2, 5]])
If you want to recover the indices:
_, idx = np.unique(list(map(tuple, map(set, a))), axis=0, return_index=True)
a[idx]

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.

In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

NumPy - descending stable arg-sort of arrays of any dtype

NumPy's np.argsort is able to do stable sorting through passing kind = 'stable' argument.
Also np.argsort doesn't support reverse (descending) order.
If non-stable behavior is needed then descending order can be easily modeled through desc_ix = np.argsort(a)[::-1].
I'm looking for efficient/easy solution to descending-stable-sort NumPy's array a of any comparable dtype. See my meaning of "stability" in the last paragraph.
For the case when dtype is any numerical then stable descending arg-sorting can be easily done through sorting negated version of array:
print(np.argsort(-np.array([1, 2, 2, 3, 3, 3]), kind = 'stable'))
# prints: array([3, 4, 5, 1, 2, 0], dtype=int64)
But I need to support any comparable dtype including np.str_ and np.object_.
Just for clarification - maybe for descending orders classical meaning of stable means that equal elements are enumerated right to left. If so then in my question meaning of stable + descending is something different - equal ranges of elements should be enumerated left to right, while equal ranges between each other are ordered in descending order. I.e. same behavior should be achieved like in the last code above. I.e. I want stability in a sense same like Python achieves in next code:
print([e[0] for e in sorted(enumerate([1,2,2,3,3,3]), key = lambda e: e[1], reverse = True)])
# prints: [3, 4, 5, 1, 2, 0]

I think this formula should work:
import numpy as np
a = np.array([1, 2, 2, 3, 3, 3])
s = len(a) - 1 - np.argsort(a[::-1], kind='stable')[::-1]
print(s)
# [3 4 5 1 2 0]

We can make use of np.unique(..., return_inverse=True) -
u,tags = np.unique(a, return_inverse=True)
out = np.argsort(-tags, kind='stable')

One simplest solution would be through mapping sorted unique elements of any dtype to ascending integers and then stable ascending arg-sorting of negated integers.
Try it online!
import numpy as np
a = np.array(['a', 'b', 'b', 'c', 'c', 'c'])
u = np.unique(a)
i = np.searchsorted(u, a)
desc_ix = np.argsort(-i, kind = 'stable')
print(desc_ix)
# prints [3 4 5 1 2 0]

Similar to #jdehesa's clean solution, this solution allows specifying an axis.
indices = np.flip(
np.argsort(np.flip(x, axis=axis), axis=axis, kind="stable"), axis=axis
)
normalised_axis = axis if axis >= 0 else x.ndim + axis
max_i = x.shape[normalised_axis] - 1
indices = max_i - indices

delete all columns of a dimension except for a specific column

I want to make a function which takes a n-dimensional array, the dimension and the column index, and it will return the (n-1)-dimensional array after removing all the other columns of that specific dimension.
Here is the code I am using now
a = np.arange(6).reshape((2, 3)) # the n-dimensional array
axisApplied = 1
colToKeep = 0
colsToDelete = np.delete(np.arange(a.shape[axisApplied]), colToKeep)
a = np.squeeze(np.delete(a, colsToDelete, axisApplied), axis=axisApplied)
print(a)
# [0, 3]
Note that I have to manually calculate the n-1 indices (the complement of the specific column index) to use np.delete(), and I am wondering whether there is a more convenient way to achieve my goal, e.g. specify which column to keep directly.
Thank you for reading and I am welcome to any suggestions.

In [1]: arr = np.arange(6).reshape(2,3)
In [2]: arr
Out[2]:
array([[0, 1, 2],
[3, 4, 5]])
Simple indexing:
In [3]: arr[:,0]
Out[3]: array([0, 3])
Or if you need to used the general axis parameter, try take:
In [4]: np.take(arr,0,axis=1)
Out[4]: array([0, 3])
Picking one element, or a list of elements, along an axis is a lot easier than deleting some. Look at the code for np.delete.

Fast algorithm to find indices where multiple arrays have the same value

I'm looking for ways to speed up (or replace) my algorithm for grouping data.
I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)
This sounds kind of awkward, so have an example:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).
The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)
Here is my current algorithm:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.
I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?

Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Runtime test
Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Runtime results on a list with large arrays -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop

This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:
import numpy as np
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]
# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)
# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False
# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)
# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
if np.isnan(labelled[k]): # this element has not been labelled yet
labelled[k] = val # so label it
labelled[row] = val # and label every element satisfying the test
val += 1
print(labelled)
# outputs [ 0. 1. 2. 3. 0. 3. 4.]
It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.

The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:
import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)
The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:
unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)

If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.
So you actually want to hash on list(np.array(values).T).
This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!
>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))
I have converted your list of arrays into an array, and then into a string ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how do you find and save duplicated rows in a numpy array? - python

Related

Numpy unique with distinct members

Stable conversion of a multi-column (2D) numpy array to an indicator vector

NumPy - descending stable arg-sort of arrays of any dtype

delete all columns of a dimension except for a specific column

Fast algorithm to find indices where multiple arrays have the same value

Categories

Resources