How to find the nearest neighbor in numpy? - python

There are two array u and v.
u.shape = (N,d)
v.shape = (q,d)
I need to find, for every q, the nearest value's index for each d in u.
For example:
u = [[5,3],
[3,4],
[3,2],
[8,7]] , shape (4,2)
v = [[1,3],
[2,4]] , shape (2,2)
and I found many people said we can do that:
v = v.expand_dims(v,axis=1) # reshape to (2,1,2) for broadcast
result = np.argmin(abs(v-u),axis=1) # (u-v).shape = (2,4,2)
Of course it found the nearest value's index. But! when there are two nearest value, I need to take the "second" one's index.
In that case:
v-u = [[[-4, 0],
[-2, -1],
[-2, 1],
[-7, -4]],
[[-3, 1],
[-1, 0],
[-1, 2],
[-6, -3]]])
along axis=1, there are two -2 in (u-v)[0,:,0] and two -1 in (u-v)[1,:,0]
If we directly use:
result = np.argmin(abs(v-u),axis=1)
result will be:
array([[1, 0],
[1, 1]], dtype=int64)
It returns the indices corresponding to the first occurrence but I need the second one, i,e
array([[2, 0],
[2, 1]], dtype=int64)
Can anyone help? Thanks!

If there can be at most 2 minimal values, you can retrieve indices of
the last minimum.
To do it:
reverse abs(v-u) along axis 1,
compute argmin, getting a "reversed_index" (actually the index in the
reversed array),
map back to "original" indices using u.shape[0] - 1 - <reversed_index>
formula (in your case of 4 rows, reversed index == 3 corresponds to
original index == 0)
The whole code is:
u.shape[0] - 1 - np.argmin(abs(v-u)[:,::-1,:],axis=1)
Other choice, when there can be more than 2 min values, is to write
a specialized version of argmin, for an 1-D input array, returning
the index of the second minimal value if there are more of them:
def argmin2(arr):
ind = arr.argpartition(1)[:2]
return ind[0] if arr[ind[0]] < arr[ind[1]] else ind[1]
and then apply it to abs(v-u) along axis 1:
np.apply_along_axis(argmin2, 1, abs(v-u))

Related

Reduce sum with condition in tensorflow

I am given a 2D Tensor with stochastic rows. After applying tf.math.greater() and tf.cast(tf.int32) I am left with a Tensor with 0's and 1's. I now want to apply reduce sum onto that matrix but with a condition: If there was at least one 1 summed and a 0 follows I want to remove all following 1 aswell, meaning 1 0 1 should result in 1 instead of 2.
I have tried to solve the Problem with tf.scan(), but I was not able to come up with a function yet that is able to handle starting 0's, because the row might look like: 0 0 0 1 0 1
One idea was to set the lower part of the matrix to one (bc I know everything left from the diagonal will always be 0) and then have a function like tf.scan() run to filter out the spots (see code and error message below).
Let z be the matrix after tf.cast.
helper = tf.matrix_band_part(tf.ones_like(z), -1, 0)
z = tf.math.logical_or(tf.cast(z, tf.bool), tf.cast(helper,tf.bool))
z = tf.cast(z, tf.int32)
z = tf.scan(lambda a, x: x if a == 1 else 0 ,z)
Resulting in:
ValueError: Incompatible shape for value ([]), expected ([5])
IIUC, this is one way to do what you want without scanning or looping. It may be a bit convoluted, and is actually iterating the columns twice (one cumsum and one cumprod), but being vectorized operations I think it is probably faster. Code is TF 2.x but runs the same in TF 1.x (except for the last line obviously).
import tensorflow as tf
# Example data
a = tf.constant([[0, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 1, 0, 1],
[1, 1, 1, 0],
[1, 1, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
# Cumsum columns
c = tf.math.cumsum(a, axis=1)
# Column-wise differences
diffs = tf.concat([tf.ones([tf.shape(c)[0], 1], c.dtype), c[:, 1:] - c[:, :-1]], axis=1)
# Find point where we should not sum anymore (cumsum is not zero and difference is zero)
cutoff = tf.equal(a, 0) & tf.not_equal(c, 0)
# Make mask
mask = tf.math.cumprod(tf.dtypes.cast(~cutoff, tf.uint8), axis=1)
# Compute result
result = tf.reduce_max(c * tf.dtypes.cast(mask, c.dtype), axis=1)
print(result.numpy())
# [0 1 2 1 3 2 3 4]

Tensor reduction based off index vector

As an example, I have 2 tensors: A = [1;2;3;4;5;6;7] and B = [2;3;2]. The idea is that I want to reduce A based off B - such that B's values represent how to sum A's values- such that B = [2;3;2] means the reduced A shall be the sum of the first 2 values, next 3, and last 2: A' = [(1+2);(3+4+5);(6+7)]. It is apparent that the sum of B shall always be equal to the length of A. I'm trying to do this as efficiently as possible - preferably specific functions or matrix operations contained within pytorch/python. Thanks!
Here is the solution.
First, we create an array of indices B_idx with the same size of A.
Then, accumulate (add) all elements in A based on the indices B_idx using index_add_.
A = torch.arange(1, 8)
B = torch.tensor([2, 3, 2])
B_idx = [idx.repeat(times) for idx, times in zip(torch.arange(len(B)), B)]
B_idx = torch.cat(B_idx) # tensor([0, 0, 1, 1, 1, 2, 2])
A_sum = torch.zeros_like(B)
A_sum.index_add_(dim=0, index=B_idx, source=A)
print(A_sum) # tensor([ 3, 12, 13])

Sorting by another matrix works in one case but fails for another

I need to sort matrices according to the descending order of the values in another matrix.
E.g. in a first step I would have the following matrix A:
1 0 1 0 1
0 1 0 1 0
0 1 0 1 1
1 0 1 0 0
Then for the procedure I am following I need to take the rows of the matrix as binary numbers and sort them in descending order of their binary value.
I am doing this the following way:
for i in range(0,num_rows):
for j in range(0,num_cols):
row_val[i] = row_val[i] + A[i][j] * (2 ** (num_cols - 1 - j))
This gets me a 4x1 vector row_val with the following values:
21
10
11
20
Now I am sorting the rows of the matrix according to row_val by
A = [x for _,x in sorted(zip(row_val,A),reverse=True)]
This works perfectly fine I get the matrix A:
1 0 1 0 1
1 0 1 0 0
0 1 0 1 1
0 1 0 1 0
However now I need to apply the same procedure to the columns. So I calculate a the col_val vector with the binary values of the columns:
12
3
12
3
3
To sort the matrix A according to the vector col_val I thought I could just transpose matrix A and then do the same as before:
At = np.transpose(A)
At = [y for _,y in sorted(zip(col_val,At),reverse=True)]
Unfortunatly this fails with the error message
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am suspecting that this might be because there are several entries with the same value in vector col_val, however in an example shown in another question the sorting seems to work for a case with several equal entries.
Your suspicion is correct, you can't sort multidimensional numpy arrays using the Python builtin sorted because the comparison of two rows, say, will yield a row of truth values instead of a single one
A[0] < A[1]
# array([False, True, False, True, False])
so sorted can't tell which should go before the other.
In your first example this is masked by lexicographic ordering of tuples: Because tuples are compared left to right and because row_val has unique entries the comparison never looks at the second elements.
But in your second example because some col_val entries are equal, the comparison will look at At for a tie breaker which is where the exception occurs.
Here is a working method which uses numpy methods:
A[np.argsort(np.packbits(A, axis=1).ravel())[::-1]]
# array([[1, 0, 1, 0, 1],
# [1, 0, 1, 0, 0],
# [0, 1, 0, 1, 1],
# [0, 1, 0, 1, 0]])
A[:, np.argsort(np.packbits(A, axis=0).ravel())[::-1]]
# array([[1, 1, 1, 0, 0],
# [0, 0, 0, 1, 1],
# [1, 0, 0, 1, 1],
# [0, 1, 1, 0, 0]])
Explanation:
np.packbits as the name suggests packs binary vectors into bit field; it is almost equivalent to your hand-written code - there is one small difference in that packbits operates on chunks of 8 and pads with zero on the right, so for example [1, 1] will go to 192, not 3.
np.argsort does an indirect sort, so it doesn't actually move the elements of its operand A but just writes down the sequence of indices I into A which would sort it A[I] == np.sort(A). This is useful when we want to sort something based on the order of something else like in this case.

Operations on 'N' dimensional numpy arrays

I am attempting to generalize some Python code to operate on arrays of arbitrary dimension. The operations are applied to each vector in the array. So for a 1D array, there is simply one operation, for a 2-D array it would be both row and column-wise (linearly, so order does not matter). For example, a 1D array (a) is simple:
b = operation(a)
where 'operation' is expecting a 1D array. For a 2D array, the operation might proceed as
for ii in range(0,a.shape[0]):
b[ii,:] = operation(a[ii,:])
for jj in range(0,b.shape[1]):
c[:,ii] = operation(b[:,ii])
I would like to make this general where I do not need to know the dimension of the array beforehand, and not have a large set of if/elif statements for each possible dimension.
Solutions that are general for 1 or 2 dimensions are ok, though a completely general solution would be preferred. In reality, I do not imagine needing this for any dimension higher than 2, but if I can see a general example I will learn something!
Extra information:
I have a matlab code that uses cells to do something similar, but I do not fully understand how it works. In this example, each vector is rearranged (basically the same function as fftshift in numpy.fft). Not sure if this helps, but it operates on an array of arbitrary dimension.
function aout=foldfft(ain)
nd = ndims(ain);
for k = 1:nd
nx = size(ain,k);
kx = floor(nx/2);
idx{k} = [kx:nx 1:kx-1];
end
aout = ain(idx{:});
In Octave, your MATLAB code does:
octave:19> size(ain)
ans =
2 3 4
octave:20> idx
idx =
{
[1,1] =
1 2
[1,2] =
1 2 3
[1,3] =
2 3 4 1
}
and then it uses the idx cell array to index ain. With these dimensions it 'rolls' the size 4 dimension.
For 5 and 6 the index lists would be:
2 3 4 5 1
3 4 5 6 1 2
The equivalent in numpy is:
In [161]: ain=np.arange(2*3*4).reshape(2,3,4)
In [162]: idx=np.ix_([0,1],[0,1,2],[1,2,3,0])
In [163]: idx
Out[163]:
(array([[[0]],
[[1]]]), array([[[0],
[1],
[2]]]), array([[[1, 2, 3, 0]]]))
In [164]: ain[idx]
Out[164]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
Besides the 0 based indexing, I used np.ix_ to reshape the indexes. MATLAB and numpy use different syntax to index blocks of values.
The next step is to construct [0,1],[0,1,2],[1,2,3,0] with code, a straight forward translation.
I can use np.r_ as a short cut for turning 2 slices into an index array:
In [201]: idx=[]
In [202]: for nx in ain.shape:
kx = int(np.floor(nx/2.))
kx = kx-1;
idx.append(np.r_[kx:nx, 0:kx])
.....:
In [203]: idx
Out[203]: [array([0, 1]), array([0, 1, 2]), array([1, 2, 3, 0])]
and pass this through np.ix_ to make the appropriate index tuple:
In [204]: ain[np.ix_(*idx)]
Out[204]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
In this case, where 2 dimensions don't roll anything, slice(None) could replace those:
In [210]: idx=(slice(None),slice(None),[1,2,3,0])
In [211]: ain[idx]
======================
np.roll does:
indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)
np.apply_along_axis is another function that constructs an index array (and turns it into a tuple for indexing).
If you are looking for a programmatic way to index the k-th dimension an n-dimensional array, then numpy.take might help you.
An implementation of foldfft is given below as an example:
In[1]:
import numpy as np
def foldfft(ain):
result = ain
nd = len(ain.shape)
for k in range(nd):
nx = ain.shape[k]
kx = (nx+1)//2
shifted_index = list(range(kx,nx)) + list(range(kx))
result = np.take(result, shifted_index, k)
return result
a = np.indices([3,3])
print("Shape of a = ", a.shape)
print("\nStarting array:\n\n", a)
print("\nFolded array:\n\n", foldfft(a))
Out[1]:
Shape of a = (2, 3, 3)
Starting array:
[[[0 0 0]
[1 1 1]
[2 2 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
Folded array:
[[[2 0 1]
[2 0 1]
[2 0 1]]
[[2 2 2]
[0 0 0]
[1 1 1]]]
You could use numpy.ndarray.flat, which allows you to linearly iterate over a n dimensional numpy array. Your code should then look something like this:
b = np.asarray(x)
for i in range(len(x.flat)):
b.flat[i] = operation(x.flat[i])
The folks above provided multiple appropriate solutions. For completeness, here is my final solution. In this toy example for the case of 3 dimensions, the function 'ops' replaces the first and last element of a vector with 1.
import numpy as np
def ops(s):
s[0]=1
s[-1]=1
return s
a = np.random.rand(4,4,3)
print '------'
print 'Array a'
print a
print '------'
for ii in np.arange(a.ndim):
a = np.apply_along_axis(ops,ii,a)
print '------'
print ' Axis',str(ii)
print a
print '------'
print ' '
The resulting 3D array has a 1 in every element on the 'border' with the numbers in the middle of the array unchanged. This is of course a toy example; however ops could be any arbitrary function that operates on a 1D vector.
Flattening the vector will also work; I chose not to pursue that simply because the book-keeping is more difficult and apply_along_axis is the simplest approach.
apply_along_axis reference page

Finding the row with the highest average in a numpy array

Given the following array:
complete_matrix = numpy.array([
[0, 1, 2, 4],
[1, 0, 3, 5],
[2, 3, 0, 6],
[4, 5, 6, 0]])
I would like to identify the row with the highest average, excluding the diagonal zeros.
So, in this case, I would be able to identify complete_matrix[:,3] as being the row with the highest average.
Note that the presence of the zeros doesn't affect which row has the highest mean because all rows have the same number of elements. Therefore, we just take the mean of each row, and then ask for the index of the largest element.
#Take the mean along the 1st index, ie collapse into a Nx1 array of means
means = np.mean(complete_matrix, 1)
#Now just get the index of the largest mean
idx = np.argmax(means)
idx is now the index of the row with the highest mean!
You don't need to worry about the 0s, they shouldn't effect how the averages compare since there will presumably be one in each row. Hence, you can do something like this to get the index of the row with the highest average:
>>> import numpy as np
>>> complete_matrix = np.array([
... [0, 1, 2, 4],
... [1, 0, 3, 5],
... [2, 3, 0, 6],
... [4, 5, 6, 0]])
>>> np.argmax(np.mean(complete_matrix, axis=1))
3
Reference:
numpy.mean
numpy.argmax
As pointed out by a lot of people, presence of zeros isn't an issue as long as you have the same number of zeros in each column. Just in case your intention was to ignore all the zeros, preventing them from participating in the average computation, you could use weights to suppress the contribution of the zeros. The following solution assigns 0 weight to zero entries, 1 otherwise:
numpy.argmax(numpy.average(complete_matrix,axis=0, weights=complete_matrix!=0))
You can always create a weight matrix where the weight is 0 for diagonal entries, and 1 otherwise.
You will see that this answer actually would fit better to your other question that was marked as duplicated to this one (and don't know why because it is not the same question...)
The presence of zeros can indeed affect the columns' or rows' average, for instance:
a = np.array([[ 0, 1, 0.9, 1],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5]])
Without eliminating the diagonals, it would tell that the column 3 has the highest average, but eliminating the diagonals the highest average belongs to column 1 and now column 3 has the least average of all columns!
You can correct the calculated mean using the lcm (least common multiple) of the number of lines with and without the diagonals, by guaranteeing that where a diagonal element does not exist the correction is not applied:
correction = column_sum/lcm(len(column), len(column)-1)
new_mean = mean + correction
I copied the algorithm for lcm from this answer and proposed a solution for your case:
import numpy as np
def gcd(a, b):
"""Return greatest common divisor using Euclid's Algorithm."""
while b:
a, b = b, a % b
return a
def lcm(a, b):
"""Return lowest common multiple."""
return a * b // gcd(a, b)
def mymean(a):
if len(a.diagonal()) < a.shape[1]:
tmp = np.hstack((a.diagonal()*0+1,0))
else:
tmp = a.diagonal()*0+1
return np.mean(a, axis=0) + np.sum(a,axis=0)*tmp/lcm(a.shape[0],a.shape[0]-1)
Testing with the a given above:
mymean(a)
#array([ 0.95 , 1. , 0.95 , 0.83333333])
With another example:
b = np.array([[ 0, 1, 0.9, 0],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5],
[0.9, 0.2, 1, 0],
[ 1, 1, 0.7, 0.5]])
mymean(b)
#array([ 0.95, 0.8 , 0.9 , 0.5 ])
With the corrected average you just use np.argmax() to get the column index with the highest average. Similarly, np.argmin() to get the index of the column with the least average:
np.argmin(mymean(a))

Categories

Resources