I am working on my own implementation of the weighted knn algorithm.
To simplify the logic, let's represent this as a predict method, which takes three parameters:
indices - matrix of nearest j neighbors from the training sample for object i (i=1...n, n objects in total). [i, j] - index of object from the training sample.
For example, for 4 objects and 3 neighbors:
indices = np.asarray([[0, 3, 1],
[0, 3, 1],
[1, 2, 0],
[5, 4, 3]])
distances - matrix of distances from j nearest neighbors from the training sample to object i. (i=1...n, n objects in total). For example, for 4 objects and 3 neighbors:
distances = np.asarray([[ 4.12310563, 7.07106781, 7.54983444],
[ 4.89897949, 6.70820393, 8.24621125],
[ 0., 1.73205081, 3.46410162],
[1094.09368886, 1102.55022561, 1109.62245832]])
labels - vector with true labels of classes for each object j of training sample. For example:
labels = np.asarray([0, 0, 0, 1, 1, 2])
Thus, the function signature is:
def predict(indices, distances, labels):
....
# return [np.bincount(x).argmax() for x in labels[indices]]
return predict
In the commentary you can see the code that returns the prediction for the "non-weighted" knn-method, which does not use distances. Can you please show, how predictions can be calculated with using the distance matrix? I found the algorithm, but now I'm completely stumped becase I don't know how to realize it with numpy.
Thank you!
This should work:
# compute inverses of distances
# suppress division by 0 warning,
# replace np.inf with a very large number
with np.errstate(divide='ignore'):
dinv = np.nan_to_num(1 / distances)
# an array with distinct class labels
distinct_labels = np.array(list(set(labels)))
# an array with labels of neighbors
neigh_labels = labels[indices]
# compute the weighted score for each potential label
weighted_scores = ((neigh_labels[:, :, np.newaxis] == distinct_labels) * dinv[:, :, np.newaxis]).sum(axis=1)
# choose the label with the highest score
predictions = distinct_labels[weighted_scores.argmax(axis=1)]
Related
I have some code which calculates the nearest neighbors amongst some vectors (values).
However, the values of these vectors are dependent on weights. Each column of the vectors has a different weight at every iteration.
Just for the sake of the example, at the code below I try to find everytime the nearest neighbor of the last vector (vector[3]).
That's a very simplified version of my code:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1)
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
# weights set No1
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[0])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No2
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[1])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No3
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[2])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
(Obviously I could have a for loop for the different weights sets but I just wanted to point the repetition of the matter)
My question is, is there any way that I can avoid using the KNN 3 times but just use it once at the beginning to do the initial similarity ranking/sorting and then just do some re-calculations?
In different words, is there any way to reduce the computation complexity of this code in terms of calling the KNN fewer times?
PS
I know that there are KNN implementations which are much faster than the ScikitLearn one but that's not really the point; the point is more on using KNN just once instead of N=3 times or something like that.
assuming calling the KNN fewer times means the number of times the KNN is fit, yes it's possible. if calling the KNN means the number of times kneighbors is invoked, that might be difficult due to how relative distances aren't preserved under affine transformations.
This solution runs in O(wk log n) time compared to the original O(wn) time with w being the number of weights.
what you're doing is
taking the input points
scaling its dimensions (projecting the input points into a new coordinate space)
building a knn model from the scaled inputs
classifying the target based on the scaled input.
However, consider
taking the input points
building a knn model from the scaled inputs
inverse scaling the target point (projecting the target into the original coordinate space)
classifying the inverse scaled target based on the input
the result of this process would be that steps 1 and 2 could be reused for each target point. weights with value 0 will require special handling.
this would look would be something like:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree")
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
targets = [
[4, 15, 4], # values[3] * weights[0]
[2.0, 10, 4], # values[3] * weights[1]
[12, 5, 8] # values[3] * weights[2]
]
knn.fit(values)
# weights set No1
print(knn.kneighbors([[a/b for a, b in zip(targets[0], weights[0])]]))
# weights set No2
print(knn.kneighbors([[a/b for a, b in zip(targets[1], weights[1])]]))
# weights set No3
print(knn.kneighbors([[a/b for a, b in zip(targets[2], weights[2])]]))
Arrays of labels of objects and distances to that objects are given. I want to apply knn to find the label of prediction. I want to use np.bincount for that. However, I don't understand how to use this.
See some example
labels = [[1,1,2,0,0,3,3,3,5,1,3],
[1,1,2,0,0,3,3,3,5,1,3]]
weights= [[0,0,0,0,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,0,1,0,0]]
Imagine 10 nearest neighbors for 2 objects are given and their labels and distances are given above. So I want the output as [5,5], because only neighbours with that label have nonzero weight. I am doing the next thing:
eps = 1e-5
lab_weight = np.array(list(zip(labels, weights)))
predict = np.apply_along_axis(lambda x: np.bincount(x[0], weights=x[1]).argmax(), 2, lab_weight)
I expect that x will correspond to [[1,1,2,0,0,3,3,3,5,1,3], [0,0,0,0,0,0,0,0,1,0,0]], but it won't. Other axis parameters are not working too. How can I achieve the goal? I want to use numpy functions and avoid python loops.
The next solution gives me desired result:
labels = [[1,1,2,0,0,3,3,3,5,1,3],
[1,1,2,0,0,3,3,3,5,1,3]]
weights= [[0,0,0,0,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,0,1,0,0]]
length = len(labels[0])
lab_weight = np.hstack((labels, weights))
predict = np.apply_along_axis(lambda x: np.bincount(x[:length], weights=x[length:]).argmax(), 1, lab_weight)
The problem with your code is that you attempt to use your
function to 2-D slices of your array, whereas apply_along_axis
applies the given function to 1-D slices.
So your code generates an exception: ValueError: object of too small
depth for desired array.
To apply your function to 2-D slices, use a list comprehension based on
np.rollaxis and then create a Numpy array from it:
result = np.array([ np.bincount(x[0], weights=x[1]).argmax()
for x in np.rollaxis(lab_weight, 2) ])
The result, for your array, is:
array([1, 1, 2, 0, 0, 3, 3, 3, 5, 1, 3], dtype=int64)
To trace, for each interation, the source array, intermediate results
and the final result, run:
i = 0
for x in np.rollaxis(lab_weight, 2):
print(f' i: {i}\n{x}'); i += 1
bc = np.bincount(x[0], weights=x[1])
bcm = bc.argmax()
print(bc, bcm)
I would like to create a lower triangular matrix with unit diagonal elements from a vector.
From a vector
[a_21, a_31, a_32, ..., a_N1, ... , a_N(N-1)]
how to convert it into a lower triangular matrix with unit diagonal elements of the form,
[[1, 0, ..., 0], [a_21, 1, ..., 0], [a_31, a_32, 1, ..., 0], ..., [a_N1, a_N2, ... , a_N(N-1), 1]]
So far with NumPy
import numpy
A = np.eye(N)
idx = np.tril_indices(N, k=-1)
A[idx] = X
The TensorFlow, however, doesn't support item assignment. I think fill_triangular or tf.reshape help solve the problem, but I'm not sure how to do it.
I found the similar question and answer:
Packing array into lower triangular of a tensor
Based on the page above, I made a function which transform a vector into a lower triangular with unit diagonal elements:
def flat_to_mat_TF(vector, n):
idx = list(zip(*np.tril_indices(n, k=-1)))
idx = tf.constant([list(i) for i in idx], dtype=tf.int64)
values = tf.constant(vector, dtype=tf.float32)
dense = tf.sparse_to_dense(sparse_indices=idx, output_shape=[n, n], \
sparse_values=values, default_value=0, \
validate_indices=True)
mat = tf.matrix_set_diag(dense, tf.cast(tf.tile([1], [n]), dtype=tf.float32))
return mat
If the input vector is already a Tensor, values = tf.constant() could be eliminated.
You could use fill_triangular_inverse on an ascending array (e.g. like one from np.arange).
Then you have the indices how they end up in the lower triangle and you can apply them to your array to resort it and pass it to fill_triangular.
Referring to this link
which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I cannot see how the 'both rated' condition is met as per this definition
I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.
Could anyone please clarify this?
Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item. User is indexed by u and column is indexed by i.
Each user have different judgement rule of how good or how bad is something is. A 1 from a user could be a 3 from another user. That is why we subtract the average of each R_u, from each R_{u,i}. This is computed as item_mean_subtracted in your code. Notice that we are subtracting each element by its row mean to normalize the user's biasness. After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.
pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that
cosine similarity = 1- cosine distance
and hence that is why the code works.
Now, what if I compute it directly according to the definition directly? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.
import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)
#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)
Both of the codes give the same result:
[[ 1. 0.86743396 0.39694169 -0.67525773 -0.72426278]
[ 0.86743396 1. 0.80099604 -0.64553225 -0.90790362]
[ 0.39694169 0.80099604 1. -0.37833504 -0.80337196]
[-0.67525773 -0.64553225 -0.37833504 1. 0.26594024]
[-0.72426278 -0.90790362 -0.80337196 0.26594024 1. ]]
I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.
Situation:
I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height).
Now I want to calculate the average distance between all 1 values.
The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )
Example:
import numpy as np
from scipy.spatial.distance import pdist
a = np.array([[1, 0, 1],
[0, 0, 0],
[0, 0, 1]])
# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562, 10. , 10. ])
After that I would calculate the average via d.mean(). However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?
You need the actual coordinates of the non-zero markers, to compute the distance between them:
>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
... [0, 0, 0],
... [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0], # That's the coordinate of the "1" in the top left,
[0, 2], # top right,
[2, 2]]) # and bottom right.
Next you want to calculate the distance between these points. You use pdist for this, like so:
>>> dists = pdist(coords) * 10 # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[ 0. , 20. , 28.28427125],
[ 20. , 0. , 20. ],
[ 28.28427125, 20. , 0. ]])
In this last matrix, you will find (above the diagonal), the distance between each marked point in a and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]) and node 1 (a[0,2]), node 0 and node 2 (a[2,2]) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists), then S[i,j] returns the distance between the coordinates on row i of coords and row j.
Just the values in the upper triangle of that last matrix are also present in the variable dist, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform (shown here just for demonstration purposes):
>>> dists
array([ 20. , 28.2842712, 20. ])
>>> dists.mean()
22.761423749153966
Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]. The 2nd point is [0,0,0]. The Euclidean distance between those two sqrt(2)~1.4. Then, the distance between the first and the 3rd coordinate (the last row in a), is only 1. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]) and the 3rd (last row, row 2: [0,0,1]) is also 1. So remember, pdist interprets its first argument as a stack of coordinates in n-dimensional space, n being the number of elements in the tuple of each node.