Removing duplicate columns and rows from a NumPy 2D array

Removing duplicate columns and rows from a NumPy 2D array - python

I'm using a 2D shape array to store pairs of longitudes+latitudes. At one point, I have to merge two of these 2D arrays, and then remove any duplicated entry. I've been searching for a function similar to numpy.unique, but I've had no luck. Any implementation I've been
thinking on looks very "unoptimizied". For example, I'm trying with converting the array to a list of tuples, removing duplicates with set, and then converting to an array again:
coordskeys = np.array(list(set([tuple(x) for x in coordskeys])))
Are there any existing solutions, so I do not reinvent the wheel?
To make it clear, I'm looking for:
>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1], [2, 3],[5, 4]])
BTW, I wanted to use just a list of tuples for it, but the lists were so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are more memory efficient).

This should do the trick:
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
Example:
>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1],
[2, 3],
[5, 4]])

Here's one idea, it'll take a little bit of work but could be quite fast. I'll give you the 1d case and let you figure out how to extend it to 2d. The following function finds the unique elements of of a 1d array:
import numpy as np
def unique(a):
a = np.sort(a)
b = np.diff(a)
b = np.r_[1, b]
return a[b != 0]
Now to extend it to 2d you need to change two things. You will need to figure out how to do the sort yourself, the important thing about the sort will be that two identical entries end up next to each other. Second, you'll need to do something like (b != 0).all(axis) because you want to compare the whole row/column. Let me know if that's enough to get you started.
updated: With some help with doug, I think this should work for the 2d case.
import numpy as np
def unique(a):
order = np.lexsort(a.T)
a = a[order]
diff = np.diff(a, axis=0)
ui = np.ones(len(a), 'bool')
ui[1:] = (diff != 0).any(axis=1)
return a[ui]

My method is by turning a 2d array into 1d complex array, where the real part is 1st column, imaginary part is the 2nd column. Then use np.unique. Though this will only work with 2 columns.
import numpy as np
def unique2d(a):
x, y = a.T
b = x + y*1.0j
idx = np.unique(b,return_index=True)[1]
return a[idx]
Example -
a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
unique2d(a)
array([[1, 1],
[2, 3],
[5, 4]])

>>> import numpy as NP
>>> # create a 2D NumPy array with some duplicate rows
>>> A
array([[1, 1, 1, 5, 7],
[5, 4, 5, 4, 7],
[7, 9, 4, 7, 8],
[5, 4, 5, 4, 7],
[1, 1, 1, 5, 7],
[5, 4, 5, 4, 7],
[7, 9, 4, 7, 8],
[5, 4, 5, 4, 7],
[7, 9, 4, 7, 8]])
>>> # first, sort the 2D NumPy array row-wise so dups will be contiguous
>>> # and rows are preserved
>>> a, b, c, d, e = A.T # create the keys for to pass to lexsort
>>> ndx = NP.lexsort((a, b, c, d, e))
>>> ndx
array([1, 3, 5, 7, 0, 4, 2, 6, 8])
>>> A = A[ndx,]
>>> # now diff by row
>>> A1 = NP.diff(A, axis=0)
>>> A1
array([[0, 0, 0, 0, 0],
[4, 3, 3, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 1, 0, 0],
[2, 5, 0, 2, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
>>> # the index array holding the location of each duplicate row
>>> ndx = NP.any(A1, axis=1)
>>> ndx
array([False, True, False, True, True, True, False, False], dtype=bool)
>>> # retrieve the duplicate rows:
>>> A[1:,:][ndx,]
array([[7, 9, 4, 7, 8],
[1, 1, 1, 5, 7],
[5, 4, 5, 4, 7],
[7, 9, 4, 7, 8]])

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by user545424 in a nice and tested interface, plus many related features:
import numpy_indexed as npi
npi.unique(coordskeys)

since you refer to numpy.unique, you dont care to maintain the original order, correct? converting into set, which removes duplicate, and then back to list is often used idiom:
>>> x = [(1, 1), (2, 3), (1, 1), (5, 4), (2, 3)]
>>> y = list(set(x))
>>> y
[(5, 4), (2, 3), (1, 1)]
>>>

Related

Using Numpy to write a function that returns all rows of A that have completely distinct entries

For example:
For
A = np.array([
[1, 2, 3],
[4, 4, 4],
[5, 6, 6]])
I want to get the output
array([[1, 2, 3]]).
For A = np.arange(9).reshape(3, 3),
I want to get
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
WITHOUT using loops

You can use:
B = np.sort(A, axis=1)
out = A[(B[:,:-1] != B[:, 1:]).all(1)]
Or using pandas:
import pandas as pd
out = A[pd.DataFrame(A).nunique(axis=1).eq(A.shape[1])]
Output:
array([[1, 2, 3]])

Concatenation of 2 lists using numpy

a = np.array([[0, 1, 2, 3], [4, 5, 6, 7]], dtype=int)
b = np.array([[8], [9]], dtype=int)
result wanted:
alist = [[0, 1, 2, 3, 8], [4, 5, 6, 7, 9]] # as np.array
I tried:
np.concatenate(alist,blist)
np.concatenate((alist,blist))
np.concatenate(alist, blist[0])
for a,b in zip(alist,blist): np.concatenate(a,b)
alist = [*map(np.concatenate, alist, blist)])
This got me various error messages I tried to fix by using the next trial. Nothing worked so far.

You are just missing the axis=1 keyword argument.
np.concatenate((a, b), axis=1)
Normally np.concatenate works on axis 0 (going down the array). But in this case you want to concatenate along axis 1 (going across the array). See the glossary for more information.

You can achieve this by using np.hstack, this will concatenate the two arrays, but at the second axis.
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7]], dtype=int)
b = np.array([[8], [9]], dtype=int)
>>> np.hstack((a,b))
array([[0, 1, 2, 3, 8],
[4, 5, 6, 7, 9]])

How can I isolate rows in a 2d numpy matrix that match a specific criteria?

How can I isolate rows in a 2d numpy matrix that match a specific criteria? For example if I have some data and I only want to look at the rows where the 0 index has a value of 5 or less how would I retrieve those values?
I tried this approach:
import numpy as np
data = np.matrix([
[10, 8, 2],
[1, 4, 5],
[6, 5, 7],
[2, 2, 10]])
#My attempt to retrieve all rows where index 0 is less than 5
small_data = (data[:, 0] < 5)
The output is:
matrix([
[False],
[ True],
[False],
[ True]], dtype=bool)
However I'd like the output to be:
[[1, 4, 5],
[2, 2, 10]]
Another approach may be for me to loop through the matrix rows and if the 0 index is smaller than 5 append the row to a list but I am hoping there is a better way than that.
Note: I'm using Python 2.7.

First: Don't use np.matrix, use normal np.arrays.
import numpy as np
data = np.array([[10, 8, 2],
[1, 4, 5],
[6, 5, 7],
[2, 2, 10]])
Then you can always use boolean indexing (based on the boolean array you get when you do comparisons) to get the desired rows:
>>> data[data[:, 0] < 5]
array([[ 1, 4, 5],
[ 2, 2, 10]])
or integer array indexing:
>>> data[np.where(data[:, 0] < 5)]
array([[ 1, 4, 5],
[ 2, 2, 10]])

That way you got a logical array that you can use to select the desired rows.
>>> data = np.matrix([
[10, 8, 2],
[1, 4, 5],
[6, 5, 7],
[2, 2, 10]])
>>> data = np.array(data)
>>> data[(data[:, 0] < 5), :]
array([[ 1, 4, 5],
[ 2, 2, 10]])
You can also use np.squeeze to filter the rows.
>>> ind = np.squeeze(np.asarray(data [:,0]))<5
>>> data[ind,:]
array([[ 1, 4, 5],
[ 2, 2, 10]])

use the following code.
data[data[small_data,:]]
That would work

Replicating elements in numpy array

I have a numpy array say
a = array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
I have an array 'replication' of the same size where replication[i,j](>=0) denotes how many times a[i][j] should be repeated along the row. Obiviously, replication array follows the invariant that np.sum(replication[i]) have the same value for all i.
For example, if
replication = array([[1, 2, 1],
[1, 1, 2],
[2, 1, 1]])
then the final array after replicating is:
new_a = array([[1, 2, 2, 3],
[4, 5, 6, 6],
[7, 7, 8, 9]])
Presently, I am doing this to create new_a:
##allocate new_a
h = a.shape[0]
w = a.shape[1]
for row in range(h):
ll = [[a[row][j]]*replicate[row][j] for j in range(w)]
new_a[row] = np.array([item for sublist in ll for item in sublist])
However, this seems to be too slow as it involves using lists. Can I do the intended entirely in numpy, without the use of python lists?

You can flatten out your replication array, then use the .repeat() method of a:
import numpy as np
a = array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
replication = array([[1, 2, 1],
[1, 1, 2],
[2, 1, 1]])
new_a = a.repeat(replication.ravel()).reshape(a.shape[0], -1)
print(repr(new_a))
# array([[1, 2, 2, 3],
# [4, 5, 6, 6],
# [7, 7, 8, 9]])

Sort array by other array sort indexing in Python

I'm having a bit of a difficulty. I'm trying to vectorize some code in python in order to make it faster. I have an array which I sort (A) and get the index list (Ind). I have another array (B) which I would like to sort by the index list, without using loops which I think bottlenecks the computation.
A = array([[2, 1, 9],
[1, 1, 5],
[7, 4, 1]])
Ind = np.argsort(A)
This is the result of Ind:
Ind = array([[1, 0, 2],
[0, 1, 2],
[2, 1, 0]], dtype=int64)
B is the array i would like to sort by Ind:
B = array([[ 6, 3, 9],
[ 1, 5, 3],
[ 2, 7, 13]])
I would like to use Ind to rearrange my elements in B as such (B rows sorted by A rows indexes):
B = array([[ 3, 6, 9],
[ 1, 5, 3],
[13, 7, 2]])
Any Ideas? I would be glad to get any good suggestion. I want to mention I am using millions of values, I mean arrays of 30000*5000.
Cheers,
Robert

I would do something like this:
import numpy as np
from numpy import array
A = array([[2, 1, 9],
[1, 1, 5],
[7, 4, 1]])
Ind = np.argsort(A)
B = array([[ 3, 6, 9],
[ 1, 5, 3],
[13, 7, 2]])
# an array of the same shape as A and B with row numbers for each element
rownums = np.tile(np.arange(3), (3, 1)).T
new_B = np.take(B, rownums * 3 + Ind)
print(new_B)
# [[ 6 3 9]
# [ 1 5 3]
# [ 2 7 13]]
You can replace the magic number 3 with the array shape.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicate columns and rows from a NumPy 2D array - python

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by user545424 in a nice and tested interface, plus many related features: import numpy_indexed as npi npi.unique(coordskeys)

since you refer to numpy.unique, you dont care to maintain the original order, correct? converting into set, which removes duplicate, and then back to list is often used idiom: >>> x = [(1, 1), (2, 3), (1, 1), (5, 4), (2, 3)] >>> y = list(set(x)) >>> y [(5, 4), (2, 3), (1, 1)] >>>

Related

Using Numpy to write a function that returns all rows of A that have completely distinct entries

Concatenation of 2 lists using numpy

How can I isolate rows in a 2d numpy matrix that match a specific criteria?

Replicating elements in numpy array

Sort array by other array sort indexing in Python

Categories

Resources