Related
Let's say I have a symmetric n-by-n array A and a 1D array x of length n, where the rows/columns of A correspond to the entries of x, and x is ordered. Now assume both A and x are randomly rearranged, so that the rows/columns still correspond but they're no longer in order. How can I manipulate A to recover the correct order?
As an example: x = array([1, 3, 2, 0]) and
A = array([[1, 3, 2, 0],
[3, 9, 6, 0],
[2, 6, 4, 0],
[0, 0, 0, 0]])
so the mapping from x to A in this example is A[i][j] = x[i]*x[j]. x should be sorted like array([0, 1, 2, 3]) and I want to arrive at
A = array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
I guess that OP is looking for a flexible way to use indices that sorts both rows and columns of his mapping at once. What is more, OP might be interested in doing it in reverse, i.e. find and initial view of mapping if it's lost.
def mapping(x, my_map, return_index=True, return_inverse=True):
idx = np.argsort(x)
out = my_map(x[idx], x[idx])
inv = np.empty_like(idx)
inv[idx] = np.arange(len(idx))
return out, idx, inv
x = np.array([1, 3, 2, 0])
a, idx, inv = mapping(x, np.multiply.outer) #sorted mapping
b = np.multiply.outer(x, x) #straight mapping
print(b)
>>> [[1 3 2 0]
[3 9 6 0]
[2 6 4 0]
[0 0 0 0]]
print(a)
>>> [[0 0 0 0]
[0 1 2 3]
[0 2 4 6]
[0 3 6 9]]
np.array_equal(b, a[np.ix_(inv, inv)]) #sorted to straight
>>> True
np.array_equal(a, b[np.ix_(idx, idx)]) #straight to sorted
>>> True
A simple implementation would be
idx = np.argsort(x)
A = A[idx, :]
A = A[:, idx]
Another possibility would be (all credit to #mathfux):
A[np.ix_(idx, idx)]
You can use argsort and fancy indexing:
idx = np.argsort(x)
A2 = A[idx[None], idx[:,None]]
output:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
I want to replace all the items of sequence with ids that tell which list of labeller they are in. Assume that all the values are distinct in both sequence and labeller and a union of lists of labeller has the same items as sequence. lsizes corresponds to the sizes of lists in labeller and is redundant for Pythonic solution but might be compulsory for solution to be vectorised fully.
sequence = [1, 2, 10, 5, 6, 4, 3, 8, 7, 9],
labeller = [[1, 2, 10], [3, 4, 5, 6, 7], [8, 9]]
lsizes = [3, 5, 2]
I know how to solve it in a simple way:
idx = {u:i for i, label in enumerate(labeller) for u in label}
tags = [idx[u] for u in sequence]
And the output is:
tags = [0, 0, 0, 1, 1, 1, 1, 2, 1, 2]
After that I put all my efforts to do it in vectorised way. It's quite complicated for me. This is my attempt, done rather by a guess, but, unfortunately, it doesn't pass all my tests. I hope I'm close:
sequence = np.array(sequence)
cl = np.concatenate(labeller)
_, cl_idx = np.unique(cl, return_index=True)
_, idx = np.unique(sequence[cl_idx], return_index=True)
tags = np.repeat(np.arange(len(lsizes)), lsizes)[idx]
#output: [0 0 1 1 0 1 1 1 2 2]
How can I finish it? I would also like to see rigour explanation what it does and how to understand it better. Any sources are also welcome.
Approach #1
For those tracing back problems, searchsorted seems to be the way to go and works here too, re-using your cl -
cl = np.concatenate(labeller)
sidx = cl.argsort()
idx = np.searchsorted(cl, sequence, sorter=sidx)
idx0 = sidx[idx]
l = list(map(len, labeller))
r = np.repeat(np.arange(len(l)), l)
out = r[idx0]
Using lsizes for l makes it fully vectorized. But, I suspect the concatenation step might be heavy. Whether this is worth it or not would depend a lot on the lengths of the subarrays.
Approach #2
For positive numbers, here's one with array-indexing as a hashing mechanism -
N = max(map(max, labeller))+1
id_ar = np.zeros(N, dtype=int) # use np.empty for perf. boost
for i,l in enumerate(labeller):
id_ar[l] = i
out = id_ar[sequence]
sequence = [1, 2, 10, 5, 6, 4, 3, 8, 7, 9]
labeller = [[1, 2, 10], [3, 4, 5, 6, 7], [8, 9]]
lsizes = [3, 5, 2]
sequence_array = np.array(sequence)
labeller_array = np.array(labeller).sum()
index_array = np.repeat(list(range(len(lsizes))), lsizes)
np.apply_along_axis(lambda num : index_array[np.where(labeller_array == num)[0]], 0, sequence_array[None, :])
# output: array([[0, 0, 0, 1, 1, 1, 1, 2, 1, 2]])
Alternative:
label_df = pd.DataFrame({'label':labeller_array, 'index':index_array})
seq_df = pd.DataFrame({'seq':sequence_array})
seq_df.merge(label_df, left_on = 'seq', right_on = 'label')['index'].tolist()
#output: [0, 0, 0, 1, 1, 1, 1, 2, 1, 2]
I am learning at Numpy and I want to understand such shuffling data code as following:
# x is a m*n np.array
# return a shuffled-rows array
def shuffle_col_vals(x):
rand_x = np.array([np.random.choice(x.shape[0], size=x.shape[0], replace=False) for i in range(x.shape[1])]).T
grid = np.indices(x.shape)
rand_y = grid[1]
return x[(rand_x, rand_y)]
So I input an np.array object as following:
x1 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int)
And I get a output of shuffle_col_vals(x1) like comments as following:
array([[ 1, 5, 11, 15],
[ 3, 8, 9, 14],
[ 4, 6, 12, 16],
[ 2, 7, 10, 13]], dtype=int64)
I get confused about the initial way of rand_x and I didn't get such way in numpy.array
And I have been thinking it a long time, but I still don't understand why return x[(rand_x, rand_y)] will get a shuffled-rows array.
If not mind, could anyone explain the code to me?
Thanks in advance.
In indexing Numpy arrays, you can take single elements. Let's use a 3x4 array to be able to differentiate between the axes:
In [1]: x1 = np.array([[1, 2, 3, 4],
...: [5, 6, 7, 8],
...: [9, 10, 11, 12]], dtype=int)
In [2]: x1[0, 0]
Out[2]: 1
If you review Numpy Advanced indexing, you will find that you can do more in indexing, by providing lists for each dimension. Consider indexing with x1[rows..., cols...], let's take two elements.
Pick from the first and second row, but always from the first column:
In [3]: x1[[0, 1], [0, 0]]
Out[3]: array([1, 5])
You can even index with arrays:
In [4]: x1[[[0, 0], [1, 1]], [[0, 1], [0, 1]]]
Out[4]:
array([[1, 2],
[5, 6]])
np.indices creates a row and col array, that if used for indexing, give back the original array:
In [5]: grid = np.indices(x1.shape)
In [6]: np.alltrue(x1[grid[0], grid[1]] == x1)
Out[6]: True
Now if you shuffle the values of grid[0] col-wise, but keep grid[1] as-is, and then use these for indexing, you get an array with the values of the columns shuffled.
Each column index vector is [0, 1, 2]. The code now shuffles these column index vectors for each column individually, and stacks them together into rand_x into the same shape as x1.
Create a single shuffled column index vector:
In [7]: np.random.seed(0)
In [8]: np.random.choice(x1.shape[0], size=x1.shape[0], replace=False)
Out[8]: array([2, 1, 0])
The stacking works by (pseudo-code) stacking with [random-index-col-vec for cols in range(x1.shape[1])] and then transposing (.T).
To make it a little clearer we can rewrite i as col and use column_stack instead of np.array([... for col]).T:
In [9]: np.random.seed(0)
In [10]: col_list = [np.random.choice(x1.shape[0], size=x1.shape[0], replace=False)
for col in range(x1.shape[1])]
In [11]: col_list
Out[11]: [array([2, 1, 0]), array([2, 0, 1]), array([0, 2, 1]), array([2, 0, 1])]
In [12]: rand_x = np.column_stack(col_list)
In [13]: rand_x
Out[13]:
array([[2, 2, 0, 2],
[1, 0, 2, 0],
[0, 1, 1, 1]])
In [14]: x1[rand_x, grid[1]]
Out[14]:
array([[ 9, 10, 3, 12],
[ 5, 2, 11, 4],
[ 1, 6, 7, 8]])
Details to note:
the example output you give is different from what the function you provide does. It seems to be transposed.
the use of rand_x and rand_y in the sample code can be confusing when being used to the convention of x=column index, y=row index
See output:
import numpy as np
def shuffle_col_val(x):
print("----------------------------\n A rand_x\n")
f = np.random.choice(x.shape[0], size=x.shape[0], replace=False)
print(f, "\nNow I transpose an array.")
rand_x = np.array([f]).T
print(rand_x)
print("----------------------------\n B rand_y\n")
print("Grid gives you two possibilities\n you choose second:")
grid = np.indices(x.shape)
print(format(grid))
rand_y = grid[1]
print("\n----------------------------\n C Our rand_x, rand_y:")
print("\nThe order of values in the column CHANGE:\n has random order\n{}".format(rand_x))
print("\nThe order of values in the row NO CHANGE:\n has normal order 0, 1, 2, 3\n{}".format(rand_y))
return x[(rand_x, rand_y)]
x1 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int)
print("\n----------------------------\n D Our shuffled-rows: \n{}\n".format(shuffle_col_val(x1)))
Output:
A rand_x
[2 3 0 1]
Now I transpose an array.
[[2]
[3]
[0]
[1]]
----------------------------
B rand_y
Grid gives you two possibilities, you choose second:
[[[0 0 0 0]
[1 1 1 1]
[2 2 2 2]
[3 3 3 3]]
[[0 1 2 3]
[0 1 2 3]
[0 1 2 3]
[0 1 2 3]]]
----------------------------
C Our rand_x, rand_y:
The order of values in the column CHANGE: has random order
[[2]
[3]
[0]
[1]]
The order of values in the row NO CHANGE: has normal order 0, 1, 2, 3
[[0 1 2 3]
[0 1 2 3]
[0 1 2 3]
[0 1 2 3]]
----------------------------
D Our shuffled-rows:
[[ 9 10 11 12]
[13 14 15 16]
[ 1 2 3 4]
[ 5 6 7 8]]
taxi_modified is a two-dimensional ndarray.
Code below works, but seems un-pythonic:
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1
Need to assign 1 to col at index 15 if col at index 5 is 2, 3, or 5.
The below didn't work:
taxi_modified[taxi_modified[:, 5] == 2 | 3 | 5, 15] = 1
You can use fancy indexing with np.isin (NumPy v1.13+), or np.in1d for older versions.
Here's a demo:
# example input array
A = np.arange(16).reshape((4, 4))
# calculate Boolean mask for rows
mask = np.isin(A[:, 1], [1, 5, 13])
# assign values, converting mask to integers
A[np.where(mask), 2] = -1
print(A)
array([[ 0, 1, -1, 3],
[ 4, 5, -1, 7],
[ 8, 9, 10, 11],
[12, 13, -1, 15]])
In one line, this can be written:
A[np.where(np.isin(A[:, 1], [1, 5, 13])), 2] = -1
I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)