Referring to this link
which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I cannot see how the 'both rated' condition is met as per this definition
I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.
Could anyone please clarify this?
Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item. User is indexed by u and column is indexed by i.
Each user have different judgement rule of how good or how bad is something is. A 1 from a user could be a 3 from another user. That is why we subtract the average of each R_u, from each R_{u,i}. This is computed as item_mean_subtracted in your code. Notice that we are subtracting each element by its row mean to normalize the user's biasness. After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.
pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that
cosine similarity = 1- cosine distance
and hence that is why the code works.
Now, what if I compute it directly according to the definition directly? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.
import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)
#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)
Both of the codes give the same result:
[[ 1. 0.86743396 0.39694169 -0.67525773 -0.72426278]
[ 0.86743396 1. 0.80099604 -0.64553225 -0.90790362]
[ 0.39694169 0.80099604 1. -0.37833504 -0.80337196]
[-0.67525773 -0.64553225 -0.37833504 1. 0.26594024]
[-0.72426278 -0.90790362 -0.80337196 0.26594024 1. ]]
Related
I am working on my own implementation of the weighted knn algorithm.
To simplify the logic, let's represent this as a predict method, which takes three parameters:
indices - matrix of nearest j neighbors from the training sample for object i (i=1...n, n objects in total). [i, j] - index of object from the training sample.
For example, for 4 objects and 3 neighbors:
indices = np.asarray([[0, 3, 1],
[0, 3, 1],
[1, 2, 0],
[5, 4, 3]])
distances - matrix of distances from j nearest neighbors from the training sample to object i. (i=1...n, n objects in total). For example, for 4 objects and 3 neighbors:
distances = np.asarray([[ 4.12310563, 7.07106781, 7.54983444],
[ 4.89897949, 6.70820393, 8.24621125],
[ 0., 1.73205081, 3.46410162],
[1094.09368886, 1102.55022561, 1109.62245832]])
labels - vector with true labels of classes for each object j of training sample. For example:
labels = np.asarray([0, 0, 0, 1, 1, 2])
Thus, the function signature is:
def predict(indices, distances, labels):
....
# return [np.bincount(x).argmax() for x in labels[indices]]
return predict
In the commentary you can see the code that returns the prediction for the "non-weighted" knn-method, which does not use distances. Can you please show, how predictions can be calculated with using the distance matrix? I found the algorithm, but now I'm completely stumped becase I don't know how to realize it with numpy.
Thank you!
This should work:
# compute inverses of distances
# suppress division by 0 warning,
# replace np.inf with a very large number
with np.errstate(divide='ignore'):
dinv = np.nan_to_num(1 / distances)
# an array with distinct class labels
distinct_labels = np.array(list(set(labels)))
# an array with labels of neighbors
neigh_labels = labels[indices]
# compute the weighted score for each potential label
weighted_scores = ((neigh_labels[:, :, np.newaxis] == distinct_labels) * dinv[:, :, np.newaxis]).sum(axis=1)
# choose the label with the highest score
predictions = distinct_labels[weighted_scores.argmax(axis=1)]
I am seeking to construct a matrix of which I will calculate the inverse. This will be used in an implicit method for solving a nonlinear parabolic PDE. My current calculations are, which will become obvious to why, giving me a singular (no possible inverse) matrix. For context, in reality the matrix will be of dimension 30 by 30 but in these examples I am using smaller matrices for testing purposes.
Say I want to create a large square sparse matrix. Using spdiags only allows you to input members of the main, lower and upper diagonals individually. So how to you make it so that each diagonal has one value for all its entries?
Example Code:
import numpy as np
from scipy.sparse import spdiags
from numpy.linalg import inv
updiag = -0.25
diag = 0.5
lowdiag = -0.25
Jdata = np.array([[diag], [lowdiag], [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
inverseJ = inv(J)
print(inverseJ)
This would produce an 3 x 3 matrix but only with the first entry of each diagonal given. I wondered about using np.fill_diagonal but that would require a matrix first and only does the main diagonal. Am I misunderstanding something?
The first argument of spdiags is a matrix of values to be used as the diagonals. You can use it this way:
Jdata = np.array([3 * [diag], 3 * [lowdiag], 3 * [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
# [[ 0.5 -0.25 0. ]
# [-0.25 0.5 -0.25]
# [ 0. -0.25 0.5 ]]
Suppose I want to sample 10 times from multiple normal distributions with the same covariance matrix (identity) but different means, which are stored as rows of the following matrix:
means = np.array([[1, 5, 2],
[6, 2, 7],
[1, 8, 2]])
How can I do that in the most efficient way possible (i.e. avoiding loops)
I tried like this:
scipy.stats.multivariate_normal(means, np.eye(2)).rvs(10)
and
np.random.multivariate_normal(means, np.eye(2))
But they throw an error saying mean should be 1D.
Slow Example
import scipy
np.r_[[scipy.stats.multivariate_normal(means[i, :], np.eye(3)).rvs() for i in range(len(means))]]
Your covariance matrix indicate that the sample are independent. You can just sample them at once:
num_samples = 10
flat_means = means.ravel()
# build block covariance matrix
cov = np.eye(3)
block_cov = np.kron(np.eye(3), cov)
out = np.random.multivariate_normal(flat_means, cov=block_cov, size=num_samples)
out = out.reshape((-1,) + means.shape)
I have a numpy matrix say A as below
array([[1, 2, 3],
[1, 2, 2]])
I want to find the cosine similarity matrix of this a matrix where cosine similarity is between the columns.
Now cosine similarity of two vectors is just a dot product of two normalized by the L2 norm product of each
But I don't want to iterate for each column in a loop and do it.
So I first tried this:
from scipy.spatial import distance
cos=distance.cdist(a.T,a.T,'cosine')
Here I am taking transpose as else it would do cosine of rows(observations). I want for columns.
However I am not sure this is the right answer. The doc of this function says it gives 1- cosine_similarity. So should I then do?
cos-1-distance.cdist(a.T,a.T,'cosine')
Please advise.
II)
Also what If I try doing something like this:
cos=(np.dot(a.T,a))/(np.linalg.norm(a, axis=0, keepdims=True))*(np.linalg.norm(a, axis=0, keepdims=True))
It won't work as some problem in getting the right L2 norm of the right column. Any idea how we can implement this without function?
Try this:
a = np.array([[1, 2, 3], [1, 2, 2]])
n = np.linalg.norm(a, axis=0).reshape(1, a.shape[1])
a.T.dot(a) / n.T.dot(n)
array([[ 1. , 1. , 0.98058068],
[ 1. , 1. , 0.98058068],
[ 0.98058068, 0.98058068, 1. ]])
This assignment for n would have also worked.
np.linalg.norm(a, axis=0)[None, :]
I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.
Situation:
I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height).
Now I want to calculate the average distance between all 1 values.
The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )
Example:
import numpy as np
from scipy.spatial.distance import pdist
a = np.array([[1, 0, 1],
[0, 0, 0],
[0, 0, 1]])
# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562, 10. , 10. ])
After that I would calculate the average via d.mean(). However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?
You need the actual coordinates of the non-zero markers, to compute the distance between them:
>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
... [0, 0, 0],
... [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0], # That's the coordinate of the "1" in the top left,
[0, 2], # top right,
[2, 2]]) # and bottom right.
Next you want to calculate the distance between these points. You use pdist for this, like so:
>>> dists = pdist(coords) * 10 # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[ 0. , 20. , 28.28427125],
[ 20. , 0. , 20. ],
[ 28.28427125, 20. , 0. ]])
In this last matrix, you will find (above the diagonal), the distance between each marked point in a and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]) and node 1 (a[0,2]), node 0 and node 2 (a[2,2]) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists), then S[i,j] returns the distance between the coordinates on row i of coords and row j.
Just the values in the upper triangle of that last matrix are also present in the variable dist, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform (shown here just for demonstration purposes):
>>> dists
array([ 20. , 28.2842712, 20. ])
>>> dists.mean()
22.761423749153966
Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]. The 2nd point is [0,0,0]. The Euclidean distance between those two sqrt(2)~1.4. Then, the distance between the first and the 3rd coordinate (the last row in a), is only 1. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]) and the 3rd (last row, row 2: [0,0,1]) is also 1. So remember, pdist interprets its first argument as a stack of coordinates in n-dimensional space, n being the number of elements in the tuple of each node.