Find unique columns and column membership

Find unique columns and column membership - python

I went through these threads:
Find unique rows in numpy.array
Removing duplicates in each row of a numpy array
Pandas: unique dataframe
and they all discuss several methods for computing the matrix with unique rows and columns.
However, the solutions look a bit convoluted, at least to the untrained eye. Here is for example top solution from the first thread, which (correct me if I am wrong) I believe it is the safest and fastest:
np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1,
a.shape[1])
Either way, the above solution only returns the matrix of unique rows. What I am looking for is something along the original functionality of np.unique
u, indices = np.unique(a, return_inverse=True)
which returns, not only the list of unique entries, but also the membership of each item to each unique entry found, but how can I do this for columns?
Here is an example of what I am looking for:
array([[0, 2, 0, 2, 2, 0, 2, 1, 1, 2],
[0, 1, 0, 1, 1, 1, 2, 2, 2, 2]])
We would have:
u = array([0,1,2,3,4])
indices = array([0,1,0,1,1,3,4,4,3])
Where the different values in u represent the set of unique columns in the original array:
0 -> [0,0]
1 -> [2,1]
2 -> [0,1]
3 -> [2,2]
4 -> [1,2]

First lets get the unique indices, to do so we need to start by transposing your array:
>>> a=a.T
Using a modified version of the above to get unique indices.
>>> ua, uind = np.unique(np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1]))),return_inverse=True)
>>> uind
array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
#Thanks to #Jamie
>>> ua = ua.view(a.dtype).reshape(ua.shape + (-1,))
>>> ua
array([[0, 0],
[0, 1],
[1, 2],
[2, 1],
[2, 2]])
For sanity:
>>> np.all(a==ua[uind])
True
To reproduce your chart:
>>> for x in range(ua.shape[0]):
... print x,'->',ua[x]
...
0 -> [0 0]
1 -> [0 1]
2 -> [1 2]
3 -> [2 1]
4 -> [2 2]
To do exactly what you ask, but will be a bit slower if it has to convert the array:
>>> b=np.asfortranarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[0])))
>>> ua,uind=np.unique(b,return_inverse=True)
>>> uind
array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
>>> ua.view(a.dtype).reshape(ua.shape+(-1,),order='F')
array([[0, 0, 1, 2, 2],
[0, 1, 2, 1, 2]])
#To return this in the previous order.
>>> ua.view(a.dtype).reshape(ua.shape + (-1,))

Essentially, you want np.unique to return the indexes of the unique columns, and the indices of where they're used? This is easy enough to do by transposing the matrix and then using the code from the other question, with the addition of return_inverse=True.
at = a.T
b = np.ascontiguousarray(at).view(np.dtype((np.void, at.dtype.itemsize * at.shape[1])))
_, u, indices = np.unique(b, return_index=True, return_inverse=True)
With your a, this gives:
In [35]: u
Out[35]: array([0, 5, 7, 1, 6])
In [36]: indices
Out[36]: array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
It's not entirely clear to me what you want u to be, however. If you want it to be the unique columns, then you could use the following instead:
at = a.T
b = np.ascontiguousarray(at).view(np.dtype((np.void, at.dtype.itemsize * at.shape[1])))
_, idx, indices = np.unique(b, return_index=True, return_inverse=True)
u = a[:,idx]
This would give
In [41]: u
Out[41]:
array([[0, 0, 1, 2, 2],
[0, 1, 2, 1, 2]])
In [42]: indices
Out[42]: array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])

Not entirely sure what you are after, but have a look at the numpy_indexed package (disclaimer: I am its author); it is sure to make problems of this kind easier:
import numpy_indexed as npi
unique_columns = npi.unique(A, axis=1)
# or perhaps this is what you want?
unique_columns, indices = npi.group_by(A.T, np.arange(A.shape[1])))

Related

Assigning to slices of 2D NumPy array

I want to assign 0 to different length slices of a 2d array.
Example:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = np.array([0,1,2,0])
Given the above array arr and indices idxs how can you assign to different length slices. Such that the result is:
arr = np.array([[0,2,3,4],
[0,0,3,4],
[0,0,0,4],
[0,2,3,4]])
These don't work
slices = np.array([np.arange(i) for i in idxs])
arr[slices] = 0
arr[:, :idxs] = 0

You can use broadcasted comparison to generate a mask, and index into arr accordingly:
arr[np.arange(arr.shape[1]) <= idxs[:, None]] = 0
print(arr)
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])

This does the trick:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = [0,1,2,0]
for i,j in zip(range(arr.shape[0]),idxs):
arr[i,:j+1]=0

import numpy as np
arr = np.array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
idxs = np.array([0, 1, 2, 0])
for i, idx in enumerate(idxs):
arr[i,:idx+1] = 0

Here is a sparse solution that may be useful in cases where only a small fraction of places should be zeroed out:
>>> idx = idxs+1
>>> I = idx.cumsum()
>>> cidx = np.ones((I[-1],), int)
>>> cidx[0] = 0
>>> cidx[I[:-1]]-=idx[:-1]
>>> cidx=np.cumsum(cidx)
>>> ridx = np.repeat(np.arange(idx.size), idx)
>>> arr[ridx, cidx]=0
>>> arr
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])
Explanation: We need to construct the coordinates of the positions we want to put zeros in.
The row indices are easy: we just need to go from 0 to 3 repeating each number to fill the corresponding slice.
The column indices start at zero and most of the time are incremented by 1. So to construct them we use cumsum on mostly ones. Only at the start of each new row we have to reset. We do that by subtracting the length of the corresponding slice such as to cancel the ones we have summed in that row.

What is a faster way to get the location of unique rows in numpy

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.

Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}

There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.

With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})

This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.

Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.

Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))

The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

Resize matrix by repeating copies of it, in python

Say you have two matrices, A is 2x2 and B is 2x7 (2 rows, 7 columns). I want to create a matrix C of shape 2x7, out of copies of A. The problem is np.hstack only understands situations where the column numbers divide (say 2 and 8, thus you can easily stack 4 copies of A to get C) ,but what about when they do not? Any ideas?
A = [[0,1] B = [[1,2,3,4,5,6,7], C = [[0,1,0,1,0,1,0],
[2,3]] [1,2,3,4,5,6,7]] [2,3,2,3,2,3,2]]

Here's an approach with modulus -
In [23]: ncols = 7 # No. of cols in output array
In [24]: A[:,np.mod(np.arange(ncols),A.shape[1])]
Out[24]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
Or with % operator -
In [27]: A[:,np.arange(ncols)%A.shape[1]]
Out[27]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
For such repeated indices, using np.take would be more performant -
In [29]: np.take(A, np.arange(ncols)%A.shape[1], axis=1)
Out[29]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])

A solution without numpy (although the np solution posted above is a lot nicer):
A = [[0,1],
[2,3]]
B = [[1,2,3,4,5,6,7],
[1,2,3,4,5,6,7]]
i_max, j_max = len(A), len(A[0])
C = []
for i, line_b in enumerate(B):
line_c = [A[i % i_max][j % j_max] for j, _ in enumerate(line_b)]
C.append(line_c)
print(C)

First solution is very nice. Another possible way would be to still use hstack, but if you don't want the pattern repeated fully you can use array slicing to get the values you need:
a.shape > (2,2)
b.shape > (2,7)
repeats = np.int(np.ceil(b.shape[1]/a.shape[0]))
trim = b.shape[1] % a.shape[0]
c = np.hstack([a] * repeats)[:,:-trim]
>
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])

Extracting required indices from an array of tuples

import numpy as np
from scipy import signal
y = np.array([[2, 1, 2, 3, 2, 0, 1, 0],
[2, 1, 2, 3, 2, 0, 1, 0]])
maximas = signal.argrelmax(y, axis=1)
print maximas
(array([0, 0, 1, 1], dtype=int64), array([3, 6, 3, 6], dtype=int64))
The maximas produced the index of tuples: (0,3) and (0,6) are for row one [2, 1, 2, 3, 2, 0, 1, 0]; and (1,6) and (1,6) are for another row [2, 1, 2, 3, 2, 0, 1, 0].
The following prints all the results, but I want to extract only the first maxima of both rows, i.e., [3,3] using the tuples. So, the tuples I need are (0,3) and (1,3).
How can I extract them from the array of tuples, i.e., 'maximas'?
>>> print y[kk]
[3 1 3 1]

Given the tuple maximas, here's one possible NumPy way:
>>> a = np.column_stack(maximas)
>>> a[np.unique(a[:,0], return_index=True)[1]]
array([[0, 3],
[1, 3]], dtype=int64)
This stacks the coordinate lists returned by signal.argrelmax into an array a. The return_index parameter of np.unique is used to find the first index of each row number. We can then retrieve the relevant rows from a using these first indexes.
This returns an array, but you could turn it into a list of lists with tolist().
To return the first column index of the maximum in each row, you just need to take the indices returned by np.unique from maximas[0] and use them to index maximas[1]. In one line, it's this:
>>> maximas[1][np.unique(maximas[0], return_index=True)[1]]
array([3, 3], dtype=int64)
To retrieve the corresponding values from each row of y, you can use np.choose:
>>> cols = maximas[1][np.unique(maximas[0], return_index=True)[1]]
>>> np.choose(cols, y.T)
array([3, 3])

Well, a pure Python approach will be to use itertools.groupby(group on the row's index) and a list comprehension:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> [max(g, key=lambda x: y[x])
for k, g in groupby(zip(*maximas), itemgetter(0))]
[(0, 3), (1, 3)]

Getting the indexes to the duplicate columns of a numpy array [duplicate]

This question already has answers here:
Find unique columns and column membership
(3 answers)
Closed 8 years ago.
I have a numpy array with duplicate columns:
import numpy as np
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
I need to find the indexes to those duplicates or something like that:
[0, 4]
[1, 2, 5]
I have a hard time dealing with indexes in Python. I really don't know to approach it.
Thanks
I tried identifying the unique columns first with this function:
def unique_columns(data):
ind = np.lexsort(data)
return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T
But I can't figure out the indexes from there.

There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
def unique_columns2(data):
dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
dataf = np.asfortranarray(data).view(dt)
u,uind = np.unique(dataf, return_inverse=True)
u = u.view(data.dtype).reshape(-1,data.shape[0]).T
return (u,uind)
Our result:
u,uind = unique_columns2(A)
u
array([[0, 1, 1],
[0, 1, 2],
[0, 1, 3]])
uind
array([1, 2, 2, 0, 1, 2])
I am not really sure what you want to do from here, for example you can do something like this:
>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]
Some timings:
tmp = np.random.randint(0,4,(30000,500))
#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop
%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop

Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.
Here's an implementation of the method described above.
import numpy as np
def duplicate_columns(data, minoccur=2):
ind = np.lexsort(data)
diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
edges = np.where(diff)[0] + 1
result = np.split(ind, edges)
result = [group for group in result if len(group) >= minoccur]
return result
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find unique columns and column membership - python

Related

Assigning to slices of 2D NumPy array

What is a faster way to get the location of unique rows in numpy

Resize matrix by repeating copies of it, in python

Extracting required indices from an array of tuples

Getting the indexes to the duplicate columns of a numpy array [duplicate]

Categories

Resources