Assign same lexicographic rank to duplicate elements of 2d array

Assign same lexicographic rank to duplicate elements of 2d array - python

I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.
import numpy as np
values = np.asarray([
[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)
The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.

Here's one approach -
# Get lexsorted indices and hence sorted values by those indices
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_vals = values[lexsort_idx]
# Mask of steps where rows shift (there are no duplicates in subsequent rows)
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]
# Get the stepped indices (indices shift at non duplicate rows) and
# the index values are scaled corresponding to row numbers
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))
# Re-arrange the stepped indices based on the original order of rows
# This is basically same as the original code does in last 4 steps,
# just in a concise manner
out_idx = stepped_idx[lexsort_idx.argsort()]
Sample step-by-step intermediate outputs -
In [55]: values
Out[55]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [56]: lexsort_idx
Out[56]: array([1, 4, 0, 3, 2])
In [57]: lexsort_vals
Out[57]:
array([[1, 1, 1],
[1, 1, 2],
[1, 2, 3],
[1, 2, 3],
[2, 2, 3]])
In [58]: mask
Out[58]: array([ True, True, True, False, True], dtype=bool)
In [59]: stepped_idx
Out[59]: array([0, 1, 2, 2, 4])
In [60]: lexsort_idx.argsort()
Out[60]: array([2, 0, 4, 3, 1])
In [61]: stepped_idx[lexsort_idx.argsort()]
Out[61]: array([2, 0, 4, 2, 1])
Performance boost
For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).
Runtime test
Applying few more optimization tricks, we would have a version like so -
def numpy_app(values):
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_v = values[lexsort_idx]
mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))
stepped_idx = np.arange(mask.size)
stepped_idx[mask] = 0
np.maximum.accumulate(stepped_idx, out=stepped_idx)
return stepped_idx[argsort_unique(lexsort_idx)]
#Warren Weckesser's rankdata based method as a func for timings -
def scipy_app(values):
v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
return rankdata(v, method='min') - 1
Timings -
In [97]: a = np.random.randint(0,9,(10000,3))
In [98]: out1 = numpy_app(a)
In [99]: out2 = scipy_app(a)
In [100]: np.allclose(out1, out2)
Out[100]: True
In [101]: %timeit scipy_app(a)
100 loops, best of 3: 5.32 ms per loop
In [102]: %timeit numpy_app(a)
100 loops, best of 3: 1.96 ms per loop

Here's a way to do it using scipy.stats.rankdata (with method='min'), by viewing the 2-d array as a 1-d structured array:
In [15]: values
Out[15]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [16]: v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
In [17]: rankdata(v, method='min') - 1
Out[17]: array([2, 0, 4, 2, 1])

Related

Cycling Slicing in Python

I've come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
with shift values of S = array([0, 1, 1]), the output needs to be
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
This is easy to implement by the following code:
Y = []
for i in range(X.shape[0]):
if (S[i] > 0):
Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
else:
Y.append(X[i,:].tolist())
Y = np.array(Y)
This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?
Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array
array([[[8, 1, 8],
[8, 6, 2],
[5, 3, 7]],
[[4, 1, 0],
[5, 9, 5],
[5, 1, 7]],
[[9, 8, 6],
[5, 1, 0],
[5, 5, 4]]])
Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to
array([[[8, 1, 7],
[8, 6, 8],
[5, 3, 2]],
[[4, 1, 7],
[5, 9, 0],
[5, 1, 5]],
[[9, 8, 4],
[5, 1, 6],
[5, 5, 0]]])

Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so -
def cyclic_slice(X, S):
m,n = X.shape
idx = np.mod(np.arange(n) + S[:,None],n)
return X[np.arange(m)[:,None], idx]
Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so -
def cyclic_slice_strided(X, S):
X2 = np.column_stack((X,X[:,:-1]))
s0,s1 = X2.strides
strided = np.lib.stride_tricks.as_strided
m,n1 = X.shape
n2 = X2.shape[1]
X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
return X2_3D[np.arange(len(S)),S]
Sample run -
In [34]: X
Out[34]:
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
In [35]: S
Out[35]: array([0, 1, 1])
In [36]: cyclic_slice(X, S)
Out[36]:
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
Runtime test -
In [75]: X = np.random.rand(10000,100)
...: S = np.random.randint(0,100,(10000))
# #Moses Koledoye's soln
In [76]: %%timeit
...: Y = []
...: for i, x in zip(S, X):
...: Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop
In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop
In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop
Adaption for 3D case
Adapting approach #1 for the 3D case, we would have -
shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
Sa = S
else:
Sa = -S
# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]
# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]
# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]
There could be a way to have a generic solution for a generic axis, but I will keep it to this point.

You could shift each row using np.roll and use the new rows to build the output array:
Y = []
for i, x in zip(S, X):
Y.append(np.roll(x, -i))
print(np.array(Y))
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])

Create a 2D array from another array and its indices with NumPy

Given an array:
arr = np.array([[1, 3, 7], [4, 9, 8]]); arr
array([[1, 3, 7],
[4, 9, 8]])
And given its indices:
np.indices(arr.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
How would I be able to stack them neatly one against the other to form a new 2D array? This is what I'd like:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
This is my current solution:
def foo(arr):
return np.hstack((np.indices(arr.shape).reshape(2, arr.size).T, arr.reshape(-1, 1)))
It works, but is there something shorter/more elegant to carry this operation out?

Using array-initialization and then broadcasted-assignment for assigning indices and the array values in subsequent steps -
def indices_merged_arr(arr):
m,n = arr.shape
I,J = np.ogrid[:m,:n]
out = np.empty((m,n,3), dtype=arr.dtype)
out[...,0] = I
out[...,1] = J
out[...,2] = arr
out.shape = (-1,3)
return out
Note that we are avoiding the use of np.indices(arr.shape), which could have slowed things down.
Sample run -
In [10]: arr = np.array([[1, 3, 7], [4, 9, 8]])
In [11]: indices_merged_arr(arr)
Out[11]:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
Performance
arr = np.random.randn(100000, 2)
%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 4.97 ms per loop
%timeit pd.DataFrame(indices_merged_arr_divakar(arr), columns=['x', 'y', 'value'])
100 loops, best of 3: 3.82 ms per loop
%timeit pd.DataFrame(indices_merged_arr_eric(arr), columns=['x', 'y', 'value'], dtype=np.float32)
100 loops, best of 3: 5.59 ms per loop
Note: Timings include conversion to pandas dataframe, that is the eventual use case for this solution.

A more generic answer for nd arrays, that handles other dtypes correctly:
def indices_merged_arr(arr):
out = np.empty(arr.shape, dtype=[
('index', np.intp, arr.ndim),
('value', arr.dtype)
])
out['value'] = arr
for i, l in enumerate(arr.shape):
shape = (1,)*i + (-1,) + (1,)*(arr.ndim-1-i)
out['index'][..., i] = np.arange(l).reshape(shape)
return out.ravel()
This returns a structured array with an index column and a value column, which can be of different types.

Difference between every row and column in two DataFrames (Python / Pandas)

Is there a more efficient way to compare every column in every row in one DF to every column in every row of another DF? This feels sloppy to me, but my loop / apply attempts have been much slower.
df1 = pd.DataFrame({'a': np.random.randn(1000),
'b': [1, 2] * 500,
'c': np.random.randn(1000)},
index=pd.date_range('1/1/2000', periods=1000))
df2 = pd.DataFrame({'a': np.random.randn(100),
'b': [2, 1] * 50,
'c': np.random.randn(100)},
index=pd.date_range('1/1/2000', periods=100))
df1 = df1.reset_index()
df1['embarrassingHackInd'] = 0
df1.set_index('embarrassingHackInd', inplace=True)
df1.rename(columns={'index':'origIndex'}, inplace=True)
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
timeit df3 = abs(df1-df2)
10 loops, best of 3: 60.6 ms per loop
I need to know which comparison was made, thus the ugly addition of each opposing index to the comparison DF so that it will end up in the final DF.
Thanks in advance for any assistance.

The code you posted shows a clever way to produce a subtraction table. However, it doesn't play to Pandas strengths. Pandas DataFrames store the underlying data in column-based blocks. So retrieval of the data is fastest when done by column, not by row. Since all the rows have the same index, the subtractions are performed by row (pairing each row with every other row), which means there is a lot of row-based data retrieval going on in df1-df2. That's not ideal for Pandas, particularly when not all the columns have the same dtype.
Subtraction tables are something NumPy is good at:
In [5]: x = np.arange(10)
In [6]: y = np.arange(5)
In [7]: x[:, np.newaxis] - y
Out[7]:
array([[ 0, -1, -2, -3, -4],
[ 1, 0, -1, -2, -3],
[ 2, 1, 0, -1, -2],
[ 3, 2, 1, 0, -1],
[ 4, 3, 2, 1, 0],
[ 5, 4, 3, 2, 1],
[ 6, 5, 4, 3, 2],
[ 7, 6, 5, 4, 3],
[ 8, 7, 6, 5, 4],
[ 9, 8, 7, 6, 5]])
You can think of x as one column of df1, and y as one column of df2. You'll see below that NumPy can handle all the columns of df1 and all the columns of df2 in basically the same way, using basically the same syntax.
The code below defines orig and using_numpy. orig is the code you posted, using_numpy is an alternative method which performs the subtraction using NumPy arrays:
In [2]: %timeit orig(df1.copy(), df2.copy())
10 loops, best of 3: 96.1 ms per loop
In [3]: %timeit using_numpy(df1.copy(), df2.copy())
10 loops, best of 3: 19.9 ms per loop
import numpy as np
import pandas as pd
N = 100
df1 = pd.DataFrame({'a': np.random.randn(10*N),
'b': [1, 2] * 5*N,
'c': np.random.randn(10*N)},
index=pd.date_range('1/1/2000', periods=10*N))
df2 = pd.DataFrame({'a': np.random.randn(N),
'b': [2, 1] * (N//2),
'c': np.random.randn(N)},
index=pd.date_range('1/1/2000', periods=N))
def orig(df1, df2):
df1 = df1.reset_index() # 312 µs per loop
df1['embarrassingHackInd'] = 0 # 75.2 µs per loop
df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop
df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
df3 = abs(df1-df2) # 88.7 ms per loop <-- this is the bottleneck
return df3
def using_numpy(df1, df2):
df1.index.name = 'origIndex'
df2.index.name = 'origIndex'
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df1_date = df1['origIndex']
df2_date = df2['origIndex']
df1['origIndex'] = df1_date.astype(np.int64)
df2['origIndex'] = df2_date.astype(np.int64)
arr1 = df1.values
arr2 = df2.values
arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms
arr3 = arr3.reshape(-1, 4)
index = pd.MultiIndex.from_product(
[df1_date, df2_date], names=['df1Date', 'df2Date'])
result = pd.DataFrame(arr3, index=index, columns=df1.columns)
# You could stop here, but the rest makes the result more similar to orig
result.reset_index(inplace=True, drop=False)
result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9
result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9
return result
def is_equal(expected, result):
expected.reset_index(inplace=True, drop=True)
result.reset_index(inplace=True, drop=True)
# expected has dtypes 'O', while result has some float and int dtypes.
# Make all the dtypes float for a quick and dirty comparison check
expected = expected.astype('float')
result = result.astype('float')
columns = ['a','b','c','origIndex','df1Date','df2Date']
return expected[columns].equals(result[columns])
expected = orig(df1.copy(), df2.copy())
result = using_numpy(df1.copy(), df2.copy())
assert is_equal(expected, result)
How x[:, np.newaxis] - y works:
This expression takes advantage of NumPy broadcasting.
To understand broadcasting -- and in general with NumPy -- it pays to know the shape of the arrays:
In [6]: x.shape
Out[6]: (10,)
In [7]: x[:, np.newaxis].shape
Out[7]: (10, 1)
In [8]: y.shape
Out[8]: (5,)
The [:, np.newaxis] adds a new axis to x on the right, so the shape is (10, 1). So x[:, np.newaxis] - y is the subtraction of an array of shape (10, 1) with an array of shape (5,).
On the face of it, that doesn't make sense, but NumPy arrays broadcast their shape according to certain rules to try to make their shapes compatible.
The first rule is that new axes can be added on the left. So an array of shape (5,) can broadcast itself to shape (1, 5).
The next rule is that axes of length 1 can broadcast itself to arbitrary length. The values in the array are simply repeated as often as needed along the extra dimension(s).
So when arrays of shape (10, 1) and (1, 5) are put together in a NumPy arithmetic operation, they are both broadcasted up to arrays of shape (10, 5):
In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y)
In [15]: broadcasted_x
Out[15]:
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4],
[5, 5, 5, 5, 5],
[6, 6, 6, 6, 6],
[7, 7, 7, 7, 7],
[8, 8, 8, 8, 8],
[9, 9, 9, 9, 9]])
In [16]: broadcasted_y
Out[16]:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
So x[:, np.newaxis] - y is equivalent to broadcasted_x - broadcasted_y.
Now, with this simpler example under our belt, we can look at
arr1[:,np.newaxis,:]-arr2.
arr1 has shape (1000, 4) and arr2 has shape (100, 4). We want to subtract the items in the axis of length 4, for each row along the 1000-length axis, and each row along the 100-length axis. In other words, we want the subtraction to form an array of shape (1000, 100, 4).
Importantly, we don't want the 1000-axis to interact with the 100-axis. We want them to be in separate axes.
So if we add an axis to arr1 like this: arr1[:,np.newaxis,:], then its shape becomes
In [22]: arr1[:, np.newaxis, :].shape
Out[22]: (1000, 1, 4)
And now, NumPy broadcasting pumps up both arrays to the common shape of (1000, 100, 4). Voila, a subtraction table.
To massage the values into a 2D DataFrame of shape (1000*100, 4), we can use reshape:
arr3 = arr3.reshape(-1, 4)
The -1 tells NumPy to replace -1 with whatever positive integer is needed for the reshape to make sense. Since arr has 1000*100*4 values, the -1 is replaced with 1000*100. Using -1 is nicer than writing 1000*100 however since it allows the code to work even if we change the number of rows in df1 and df2.

Getting the indexes to the duplicate columns of a numpy array [duplicate]

This question already has answers here:
Find unique columns and column membership
(3 answers)
Closed 8 years ago.
I have a numpy array with duplicate columns:
import numpy as np
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
I need to find the indexes to those duplicates or something like that:
[0, 4]
[1, 2, 5]
I have a hard time dealing with indexes in Python. I really don't know to approach it.
Thanks
I tried identifying the unique columns first with this function:
def unique_columns(data):
ind = np.lexsort(data)
return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T
But I can't figure out the indexes from there.

There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
def unique_columns2(data):
dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
dataf = np.asfortranarray(data).view(dt)
u,uind = np.unique(dataf, return_inverse=True)
u = u.view(data.dtype).reshape(-1,data.shape[0]).T
return (u,uind)
Our result:
u,uind = unique_columns2(A)
u
array([[0, 1, 1],
[0, 1, 2],
[0, 1, 3]])
uind
array([1, 2, 2, 0, 1, 2])
I am not really sure what you want to do from here, for example you can do something like this:
>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]
Some timings:
tmp = np.random.randint(0,4,(30000,500))
#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop
%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop

Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.
Here's an implementation of the method described above.
import numpy as np
def duplicate_columns(data, minoccur=2):
ind = np.lexsort(data)
diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
edges = np.where(diff)[0] + 1
result = np.split(ind, edges)
result = [group for group in result if len(group) >= minoccur]
return result
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]

Find unique columns and column membership

I went through these threads:
Find unique rows in numpy.array
Removing duplicates in each row of a numpy array
Pandas: unique dataframe
and they all discuss several methods for computing the matrix with unique rows and columns.
However, the solutions look a bit convoluted, at least to the untrained eye. Here is for example top solution from the first thread, which (correct me if I am wrong) I believe it is the safest and fastest:
np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1,
a.shape[1])
Either way, the above solution only returns the matrix of unique rows. What I am looking for is something along the original functionality of np.unique
u, indices = np.unique(a, return_inverse=True)
which returns, not only the list of unique entries, but also the membership of each item to each unique entry found, but how can I do this for columns?
Here is an example of what I am looking for:
array([[0, 2, 0, 2, 2, 0, 2, 1, 1, 2],
[0, 1, 0, 1, 1, 1, 2, 2, 2, 2]])
We would have:
u = array([0,1,2,3,4])
indices = array([0,1,0,1,1,3,4,4,3])
Where the different values in u represent the set of unique columns in the original array:
0 -> [0,0]
1 -> [2,1]
2 -> [0,1]
3 -> [2,2]
4 -> [1,2]

First lets get the unique indices, to do so we need to start by transposing your array:
>>> a=a.T
Using a modified version of the above to get unique indices.
>>> ua, uind = np.unique(np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1]))),return_inverse=True)
>>> uind
array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
#Thanks to #Jamie
>>> ua = ua.view(a.dtype).reshape(ua.shape + (-1,))
>>> ua
array([[0, 0],
[0, 1],
[1, 2],
[2, 1],
[2, 2]])
For sanity:
>>> np.all(a==ua[uind])
True
To reproduce your chart:
>>> for x in range(ua.shape[0]):
... print x,'->',ua[x]
...
0 -> [0 0]
1 -> [0 1]
2 -> [1 2]
3 -> [2 1]
4 -> [2 2]
To do exactly what you ask, but will be a bit slower if it has to convert the array:
>>> b=np.asfortranarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[0])))
>>> ua,uind=np.unique(b,return_inverse=True)
>>> uind
array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
>>> ua.view(a.dtype).reshape(ua.shape+(-1,),order='F')
array([[0, 0, 1, 2, 2],
[0, 1, 2, 1, 2]])
#To return this in the previous order.
>>> ua.view(a.dtype).reshape(ua.shape + (-1,))

Essentially, you want np.unique to return the indexes of the unique columns, and the indices of where they're used? This is easy enough to do by transposing the matrix and then using the code from the other question, with the addition of return_inverse=True.
at = a.T
b = np.ascontiguousarray(at).view(np.dtype((np.void, at.dtype.itemsize * at.shape[1])))
_, u, indices = np.unique(b, return_index=True, return_inverse=True)
With your a, this gives:
In [35]: u
Out[35]: array([0, 5, 7, 1, 6])
In [36]: indices
Out[36]: array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])
It's not entirely clear to me what you want u to be, however. If you want it to be the unique columns, then you could use the following instead:
at = a.T
b = np.ascontiguousarray(at).view(np.dtype((np.void, at.dtype.itemsize * at.shape[1])))
_, idx, indices = np.unique(b, return_index=True, return_inverse=True)
u = a[:,idx]
This would give
In [41]: u
Out[41]:
array([[0, 0, 1, 2, 2],
[0, 1, 2, 1, 2]])
In [42]: indices
Out[42]: array([0, 3, 0, 3, 3, 1, 4, 2, 2, 4])

Not entirely sure what you are after, but have a look at the numpy_indexed package (disclaimer: I am its author); it is sure to make problems of this kind easier:
import numpy_indexed as npi
unique_columns = npi.unique(A, axis=1)
# or perhaps this is what you want?
unique_columns, indices = npi.group_by(A.T, np.arange(A.shape[1])))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Assign same lexicographic rank to duplicate elements of 2d array - python

Related

Cycling Slicing in Python

Create a 2D array from another array and its indices with NumPy

Difference between every row and column in two DataFrames (Python / Pandas)

Getting the indexes to the duplicate columns of a numpy array [duplicate]

Find unique columns and column membership

Categories

Resources