Count occurrences of unique arrays in array - python

I have a numpy array of various one hot encoded numpy arrays, eg;
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
I would like to count the occurances of each unique one hot vector,
{[1, 0, 0]: 2, [0, 0, 1]: 1}

Approach #1
Seems like a perfect setup to use the new functionality of numpy.unique (v1.13 and newer) that lets us work along an axis of a NumPy array -
unq_rows, count = np.unique(x,axis=0, return_counts=1)
out = {tuple(i):j for i,j in zip(unq_rows,count)}
Sample outputs -
In [289]: unq_rows
Out[289]:
array([[0, 0, 1],
[1, 0, 0]])
In [290]: count
Out[290]: array([1, 2])
In [291]: {tuple(i):j for i,j in zip(unq_rows,count)}
Out[291]: {(0, 0, 1): 1, (1, 0, 0): 2}
Approach #2
For NumPy versions older than v1.13, we can make use of the fact that the input array is one-hot encoded array, like so -
_, idx, count = np.unique(x.argmax(1), return_counts=1, return_index=1)
out = {tuple(i):j for i,j in zip(x[idx],count)} # x[idx] is unq_rows

You could convert your arrays to tuples and use a Counter:
import numpy as np
from collections import Counter
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
Counter([tuple(a) for a in x])
# Counter({(1, 0, 0): 2, (0, 0, 1): 1})

The fastest way given your data format is:
x.sum(axis=0)
which gives:
array([2, 0, 1])
Where the 1st result is the count of arrays where the 1st is hot:
[1, 0, 0] [2
[0, 1, 0] 0
[0, 0, 1] 1]
This exploits the fact that only one can be on at a time, so we can decompose the direct sum.
If you absolutely need it expanded to the same format, it can be converted via:
sums = x.sum(axis=0)
{tuple(int(k == i) for k in range(len(sums))): e for i, e in enumerate(sums)}
or, similarly to tarashypka:
{tuple(row): count for row, count in zip(np.eye(len(sums), dtype=np.int64), sums)}
yields:
{(1, 0, 0): 2, (0, 1, 0): 0, (0, 0, 1): 1}

Here is another interesting solution with sum
>> {tuple(v): n for v, n in zip(np.eye(x.shape[1], dtype=int), np.sum(x, axis=0))
if n > 0}
{(0, 0, 1): 1, (1, 0, 0): 2}

Lists (including numpy arrays) are unhashable, i.e. they can't be keys of a dictionary. So your precise desired output, a dictionary with keys that look like [1, 0, 0] is never possible in Python. To deal with this you need to map your vectors to tuples.
from collections import Counter
import numpy as np
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
counts = Counter(map(tuple, x))
That will get you:
In [12]: counts
Out[12]: Counter({(0, 0, 1): 1, (1, 0, 0): 2})

Related

numpy make sub-arrays based of unique column

I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...
array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...
df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]
where y is ...
[array([[0, 0, 1, 2, 0]]),
array([[0, 1, 0, 0, 1],
[0, 1, 3, 2, 1],
[0, 1, 0, 0, 1]]),
array([[ 1, 1, 0, 1, 11],
[ 1, 1, 1, 0, 11],
[ 1, 1, 0, 1, 11]])]
This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.
Any ideas on how to speed this up?
What about using the numpy_indexed library:
import numpy as np
import numpy_indexed as npi
a = np.array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)
Output
y
[array([[0, 0, 1, 2]]),
array([[0, 1, 0, 0],
[0, 1, 3, 2],
[0, 1, 0, 0]]),
array([[ 1, 1, 0, 1],
[ 1, 1, 1, 0],
[ 1, 1, 0, 1]])]
You can easily install the library with:
> pip install numpy-indexed
Let me know if this performs better,
from collections import defaultdict
import numpy as np
outgen = defaultdict(lambda: [])
# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
outgen[key].append(val)
# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]
You can use np.unique directly here.
unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}
This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.
arr[:] = arr[np.argsort(indexer), :] # not sure if this is guaranteed to preserve the order within each group
EDIT:
Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).
columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)
It uses the internal algorithm of Dataframe.drop_duplicates.
Note that the ordering of the keys is not the sort order of the unique tuples.
There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms
import riptable as rt
columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns, return_inverse=True)
Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples

Changing items in numpy array

I want to change all items in array A (in axis=1) into 0, according to the following criteria (toy code):
import numpy as np
A = np.array([[1,3], [2,5], [6,2]] )
B = np.array([[1,1,0,0,0],[1,0,0,2,0],[0,0,2,2,2],[0,0,0,2,0],[6,6,0,0,0]])
for i in A:
if i[1]<=2:
B[B==i[0]]=0
# result
>>> B
array([[1, 1, 0, 0, 0],
[1, 0, 0, 2, 0],
[0, 0, 2, 2, 2],
[0, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
But, in numpy way, that is NO 'for' loops :) Thanks!
You can use a conditional list comprehension to create a list of the first value in a tuple pair where the second value is less than or equal to two (in the example for A, it is the last item which gives a value of 6).
Then use slicing with np.isin to find the elements in B what are contained within the values from the previous condition, and then set those values to zero.
target_val = 2
B[np.isin(B, [a[0] for a in A if a[1] <= target_val])] = 0
>>> B
array([[1, 1, 0, 0, 0],
[1, 0, 0, 2, 0],
[0, 0, 2, 2, 2],
[0, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
Alternatively, you could also use np.where instead of slicing.
np.where(np.isin(B, [a[0] for a in A if a[1] <= target_val]), 0, B)
In one line: B[np.isin(B, A[A[:, 1] <= 2][:, 0])] = 0
Explanation:
c = A[:, 1] <= 2 # broadcast the original `if i[1]<=2:` check along axis=1
# i.e., mask A according to where the second values of the pairs are <= 2
d = c[:, 0] # index with the mask, and select the old `i[0]` values, here just `6`
e = np.isin(B, d) # mask B according to where the values are in the above
B[e] = 0 # and zero out those positions, i.e. where the old B value is 6

Updating numpy 2-dimensional array according to conditions across different 2-D arrays

In the code that I am writing, I have three 2D numpy arrays with the same dimensions (m x n), with each 2D array containing info about a specific trait, but each corresponding cell (with a specific row/col value) across all three 2D arrays corresponding to a specific person. The three 2D arrays are trait1, trait2, and trait3. As an example, person (0, 0) will have traits 1, 2, but not three, if only trait1 and trait2 have a value of 1 at location (0,0), but trait3 does not.
What would be an efficient method of updating a 2D array at a specific location based on the values of other corresponding 2D arrays of the same dimension at the same location? That is, how can I efficiently update a 2D array at a specific location such that the other 2D arrays at this same location fulfill specific conditions?
I am currently trying to update the values of the 2D array trait1 and trait2 according to the current values of trait1 and trait2 (such that the corresponding trait1 value == 1, and the corresponding trait2 value == 0); I am also trying to update the values of trait3 according to the current values of trait1, and trait2 (under the same conditions as the previous). However, I am having trouble doing this without using nested for loops, which greatly slows down my program.
Below is my current approach, which works, but is much too slow for my purposes:
for i in range (0, m):
for j in range (0, n):
if trait1[i][j] == 1:
if trait2[i][j] == 0:
trait1[i][j] = 0
trait2[i][j] = 1
new_color(i, j, 1) #updates the color of the specific person on a grid
trait3[i][j] = 0
elif trait1[i][j] == 0:
if trait2[i][j] <= 0:
trait1[i][j] = 1
trait2[i][j] = 0
new_color(i, j, 0)
Numpy array are really slow if you use loop indeed. If you can use matrices operations / numpy function for everything, it will go much faster.
In your case, you could first extract the indices you're interested about, and then update your matrices like this:
import numpy as np
np.random.seed(1)
# Generate some sample data
trait1, trait2, trait3 = ( np.random.randint(0,2, [4,4]) for _ in range(3) )
In [4]: trait1
Out[4]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[0, 1, 1, 0]])
In [5]: trait2
Out[5]:
array([[0, 1, 0, 0],
[0, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]])
In [6]: trait3
Out[6]:
array([[1, 1, 1, 1],
[1, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 0, 1]])
And then:
cond1_idx = np.where((trait1 == 1) & (trait2==0))
cond2_idx = np.where((trait1 == 0) & (trait2<=0))
trait1[cond1_idx] = 0
trait2[cond1_idx] = 1
trait3[cond1_idx] = 0
[ new_color(i, j, 1) for i,j in zip(*cond1_idx) ]
trait1[cond2_idx] = 1
trait2[cond2_idx] = 0
[ new_color(i, j, 0) for i,j in zip(*cond2_idx) ]
Result:
In [2]: trait1
Out[2]:
array([[0, 1, 1, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 0, 1]])
In [3]: trait2
Out[3]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 0]])
In [4]: trait3
Out[4]:
array([[0, 1, 1, 1],
[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 0, 0, 1]])
I cannot really test the new_color though since I don't have the function

2-D Matrix: Finding and deleting columns that are subsets of other columns

I have a problem where I want to identify and remove columns in a logic matrix that are subsets of other columns. i.e. [1, 0, 1] is a subset of [1, 1, 1]; but neither of [1, 1, 0] and [0, 1, 1] are subsets of each other. I wrote out a quick piece of code that identifies the columns that are subsets, which does (n^2-n)/2 checks using a couple nested for loops.
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
for j in range(i+1,cols):
diff = A[:,i]-A[:,j]
if all(diff >= 0):
print "%d is a subset of %d" % (j, i)
columns[j] = False
elif all(diff <= 0):
print "%d is a subset of %d" % (i, j)
columns[i] = False
B = A[:,columns]
The solution should be
>>> print B
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
For massive matrices though, I'm sure there's a way that I could do this faster. One thought is to eliminate subset columns as I go so I'm not checking columns already known to be a subset. Another thought is to vectorize this so don't have O(n^2) operations. Thank you.
Since the A matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix A to list of sets D is not as fast it could be. Any ideas on how to do this better are appreciated.
Solution
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B
Here's another approach using NumPy broadcasting -
A[:,~((np.triu(((A[:,:,None] - A[:,None,:])>=0).all(0),1)).any(0))]
A detailed commented explanation is listed below -
# Perform elementwise subtractions keeping the alignment along the columns
sub = A[:,:,None] - A[:,None,:]
# Look for >=0 subtractions as they indicate non-subset criteria
mask3D = sub>=0
# Check if all elements along each column satisfy that criteria giving us a 2D
# mask which represent the relationship between all columns against each other
# for the non subset criteria
mask2D = mask3D.all(0)
# Finally get the valid column mask by checking for all columns in the 2D mas
# that have at least one element in a column san the diagonal elements.
# Index into input array with it for the final output.
colmask = ~(np.triu(mask2D,1).any(0))
out = A[:,colmask]
Define subset as col1.dot(col1) == col1.dot(col2) if and only if col1 is a subset of col2
Define col1 and col2 are the same if and only if col1 is subset of col2 and vice versa.
I split the work into two. First get rid of all but one equivalent columns. Then remove subsets.
Solution
import numpy as np
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops
def drop_strict(A):
A1, d1 = drop_duplicates(A)
A2, d2 = drop_subsets(A1)
d1[~d1] = d2
return A2, d1
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
B, drops = drop_strict(A)
Demonstration
print B
print
print drops
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
[False True False False True True]
Explanation
N = A.T.dot(A) is a matrix of every combination of dot product. Per the definition of subset at the top, this will come in handy.
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
# (N == D)[i, j] being True identifies A[:, i] as a subset
# of A[:, j] if i < j. The relationship is reversed if j < i.
# If A[:, j] is subset of A[:, i] and vice versa, then we have
# equivalent columns. Taking the lower triangle ensures we
# leave one.
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
# without concern for removing equivalent columns, this
# removes any column that has an off diagonal equal to the diagonal
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops

Fastest way to remove identical sub-arrays in a nd-array? [duplicate]

I need to find unique rows in a numpy.array.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.
As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:
unique_rows = np.unique(original_array, axis=0)
Yet another possible solution
np.vstack({tuple(row) for row in a})
Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
>>> unique_a
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
EDIT
Added np.ascontiguousarray following #seberg's recommendation. This will slow the method down if the array is not already contiguous.
EDIT
The above can be slightly sped up, perhaps at the cost of clarity, by doing:
unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
a = np.random.randint(2, size=(10000, 6))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop
a = np.random.randint(2, size=(10000, 100))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop
If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.
As a quick example:
import numpy as np
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq
To understand what's going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)
In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(1, 1, 1, 0, 0, 0)],
[(1, 1, 1, 1, 1, 0)]],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
Once we run numpy.unique, we'll get a structured array back:
In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):
In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])
And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):
In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Obviously, if you wanted to be more concise, you could write it as:
import numpy as np
def unique_rows(data):
uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
return uniq.view(data.dtype).reshape(-1, data.shape[1])
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
print unique_rows(data)
Which results in:
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:
array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)
That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"
np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:
In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop
In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop
In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop
With a larger a, however, this version ends up being much, much faster:
In [96]: a = np.random.randint(0,2,size=(10000,6))
In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop
In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop
In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop
I've compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy's native unique with the axis argument. If you're looking for speed, you'll want
numpy.unique(
a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])
I've implemented that fastest variant in npx.unique_rows.
There is a bug report on GitHub for this, too.
Code to reproduce the plot:
import numpy
import perfplot
def unique_void_view(a):
return (
numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
.view(a.dtype)
.reshape(-1, a.shape[1])
)
def lexsort(a):
ind = numpy.lexsort(a.T)
return a[
ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
]
def vstack(a):
return numpy.vstack([tuple(row) for row in a])
def unique_axis(a):
return numpy.unique(a, axis=0)
perfplot.show(
setup=lambda n: numpy.random.randint(2, size=(n, 20)),
kernels=[unique_void_view, lexsort, vstack, unique_axis],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
equality_check=None,
)
Here is another variation for #Greg pythonic answer
np.vstack(set(map(tuple, a)))
I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some 𝜀”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.
Here’s my version:
from scipy.spatial.distance import squareform, pdist
def uniqueRows(arr, thresh=0.0, metric='euclidean'):
"Returns subset of rows that are unique, in terms of Euclidean distance"
distances = squareform(pdist(arr, metric=metric))
idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
return arr[[x[0] for x in idxset]]
# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
return uniqueRows(arr.T, *args, **kwargs)
The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.
As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).
If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.
Why not use drop_duplicates from pandas:
>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop
>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop
The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:
import numpy_indexed as npi
new_a = npi.unique(a) # unique elements over axis=0 (rows) by default
Based on the answer in this page I have written a function that replicates the capability of MATLAB's unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.
def unique_rows(data, prec=5):
import numpy as np
d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
_, ia = np.unique(b, return_index=True)
_, ic = np.unique(b, return_inverse=True)
return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic
Beyond #Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :
a[unique(a.view(void(a.strides[0])),1)[1]]
For
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
np.unique works given a list of tuples:
>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4]])
With a list of lists it raises a TypeError: unhashable type: 'list'
For general purpose like 3D or higher multidimensional nested arrays, try this:
import numpy as np
def unique_nested_arrays(ar):
origin_shape = ar.shape
origin_dtype = ar.dtype
ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
ar = np.ascontiguousarray(ar)
unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])
which satisfies your 2D dataset:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)
gives:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
But also 3D arrays like:
b = np.array([[[1, 1, 1], [0, 1, 1]],
[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)
gives:
array([[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])
None of these answers worked for me. I'm assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:
Source: https://stackoverflow.com/a/38461043/5402386
You can use .count() and .index() list's methods
coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]
We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
Example:
A = np.array([[ 3.17 9.502 3.291],
[ 9.984 2.773 6.852],
[ 1.172 8.885 4.258],
[ 9.73 7.518 3.227],
[ 8.113 9.563 9.117],
[ 9.984 2.773 6.852],
[ 9.73 7.518 3.227]])
B, inv_, c = uniqueRow(A)
Results:
B:
[[ 1.172 8.885 4.258]
[ 3.17 9.502 3.291]
[ 8.113 9.563 9.117]
[ 9.73 7.518 3.227]
[ 9.984 2.773 6.852]]
inv_:
[3 4 1 0 2 4 0]
c:
[2 1 1 1 2]
Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix:
matrix_as_list=data.tolist()
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]
uniq_list=list()
uniq_list.append(matrix_as_list[0])
[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]
unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)
Will Give:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Send my nobel prize in the mail
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
return_index=True)
# get unique set
print(original[unique_index])

Categories

Resources