Calculating cosine similarity of columns of a python matrix - python

I have a numpy matrix say A as below
array([[1, 2, 3],
[1, 2, 2]])
I want to find the cosine similarity matrix of this a matrix where cosine similarity is between the columns.
Now cosine similarity of two vectors is just a dot product of two normalized by the L2 norm product of each
But I don't want to iterate for each column in a loop and do it.
So I first tried this:
from scipy.spatial import distance
cos=distance.cdist(a.T,a.T,'cosine')
Here I am taking transpose as else it would do cosine of rows(observations). I want for columns.
However I am not sure this is the right answer. The doc of this function says it gives 1- cosine_similarity. So should I then do?
cos-1-distance.cdist(a.T,a.T,'cosine')
Please advise.
II)
Also what If I try doing something like this:
cos=(np.dot(a.T,a))/(np.linalg.norm(a, axis=0, keepdims=True))*(np.linalg.norm(a, axis=0, keepdims=True))
It won't work as some problem in getting the right L2 norm of the right column. Any idea how we can implement this without function?

Try this:
a = np.array([[1, 2, 3], [1, 2, 2]])
n = np.linalg.norm(a, axis=0).reshape(1, a.shape[1])
a.T.dot(a) / n.T.dot(n)
array([[ 1. , 1. , 0.98058068],
[ 1. , 1. , 0.98058068],
[ 0.98058068, 0.98058068, 1. ]])
This assignment for n would have also worked.
np.linalg.norm(a, axis=0)[None, :]

Related

NumPy array with largest value on diagonal and other values shuffled

I am trying to create a square NumPy (or PyTorch, since PyTorch code can be turned into NumPy with minimal effort) matrix which has the following property: given a set of values, the diagonal elements in each row have the largest value and the other values are randomly shuffled for the other positions.
For example, if I have [1, 2, 3, 4], a possible desired output is:
[[4, 3, 1, 2],
[1, 4, 3, 2],
[2, 1, 4, 3],
[2, 3, 1, 4]]
There can be (several) other possible outputs, as long as the diagonal elements are the largest value (4 in this case) and the off-diagonal elements in each row contain the other values but shuffled.
A hacky/inefficient way of doing this could be first creating a square matrix (4x4) of zeros and putting the largest value (4) in all the diagonal positions, and then traversing the matrix row by row, where for each row i, populate the elements except index i with shuffled remaining values (shuffled versions of [1, 2, 3]). This would be very slow as the matrix size increases. Is there a cleaner/faster/Pythonic way of doing it? Thank you.
First you can generate a randomized array on the first axis with np.random.shuffle(), then I've used a (not so easy to understand) mathematical tricks to shift each rows:
import numpy as np
from numpy.fft import fft, ifft
# First create your randomized array with np.random.shuffle()
x = np.array([[1,2,3,4],
[2,4,3,1],
[4,1,2,3],
[2,3,1,4]])
# We use np.where to determine on which column each 4 are.
_,s = np.where(x==4);
# We compute the left shift that need to be applied to each row in order to get each 4 on the diagonal
s = s-np.r_[0:x.shape[0]]
# And here is the tricks, we can use the fast fourrier transform in order to left shift each row by a given value:
L = np.real(ifft(fft(x,axis=1)*np.exp(2*1j*np.pi/x.shape[1]*s[:,None]*np.r_[0:x.shape[1]][None,:]),axis=1).round())
# Noticed that we could also use a right shift, we simply have to negate our exponential exponant:
# np.exp(-2*1j*np.pi...
And we obtain the following matrix:
[[4. 1. 2. 3.]
[2. 4. 1. 3.]
[2. 3. 4. 1.]
[3. 2. 1. 4.]]
No hidden for loop, only pure linear algaebra stuff.
To give you an idea it take only a few milliseconds for a 1000x1000 matrix on my computer and ~20s for a 10000x10000 matrix.

More efficient way to merge columns in pandas

My code calculates the euclidean distance between all points in a set of samples I have. What I want to know is in general this the most efficient way to perform some operation between all elements in a set and then plot them, for instance to make a correlation matrix.
The index of samples is used to initialize the dataframe and provide labels. Then the 3d coordinates are provided as tuples in three_D_coordinate_tuple_list but this could easily be any measurement and then the variable distance could be any operation. I'm curious about finding a more efficient solution to making each column and then merging them again using pandas or numpy. Am I clogging up any memory with my solution? How can I make this cleaner?
def euclidean_distance_matrix_maker(three_D_coordinate_tuple_list, index_of_samples):
#list of tuples
#well_id or index as series or list
n=len(three_D_coordinate_tuple_list)
distance_matrix_df=pd.DataFrame(index_of_samples)
for i in range(0, n):
column=[]
#iterates through all elemetns calculates distance vs this element
for j in range(0, n):
distance=euclidean_dist_threeD_for_tuples( three_D_coordinate_tuple_list[i],
three_D_coordinate_tuple_list[j])
column.append(distance)
#adds euclidean distance to a list which overwrites old data frame then
#is appeneded with concat column wise to output matrix
new_column=pd.DataFrame(column)
distance_matrix_df=pd.concat([distance_matrix_df, new_column], axis=1)
distance_matrix_df=distance_matrix_df.set_index(distance_matrix_df.iloc[:,0])
distance_matrix_df=distance_matrix_df.iloc[:,1:]
distance_matrix_df.columns=distance_matrix_df.index
Setup
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scipy.spatial.distance_matrix
from scipy.spatial import distance_matrix
distance_matrix(x, x)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])
Numpy
from scipy.spatial.distance import squareform
i, j = np.triu_indices(len(x), 1)
((x[i] - x[j]) ** 2).sum(-1) ** .5
array([ 5.19615242, 10.39230485, 5.19615242])
Which we can make into a square form with squareform
squareform(((x[i] - x[j]) ** 2).sum(-1) ** .5)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])

Adjusted Cosine Similarity in Python

Referring to this link
which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I cannot see how the 'both rated' condition is met as per this definition
I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.
Could anyone please clarify this?
Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item. User is indexed by u and column is indexed by i.
Each user have different judgement rule of how good or how bad is something is. A 1 from a user could be a 3 from another user. That is why we subtract the average of each R_u, from each R_{u,i}. This is computed as item_mean_subtracted in your code. Notice that we are subtracting each element by its row mean to normalize the user's biasness. After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.
pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that
cosine similarity = 1- cosine distance
and hence that is why the code works.
Now, what if I compute it directly according to the definition directly? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.
import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)
#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)
Both of the codes give the same result:
[[ 1. 0.86743396 0.39694169 -0.67525773 -0.72426278]
[ 0.86743396 1. 0.80099604 -0.64553225 -0.90790362]
[ 0.39694169 0.80099604 1. -0.37833504 -0.80337196]
[-0.67525773 -0.64553225 -0.37833504 1. 0.26594024]
[-0.72426278 -0.90790362 -0.80337196 0.26594024 1. ]]

Given a MxN grid in the x,y plane, compute f(x,y) and store it into matrix (python)

I'm looking for something similar to ARRAYFUN in MATLAB, but for Python. What I need to do is to compute a matrix whose components are exp(j*dot([kx,ky], [x,y])), where [kx,ky] is a fixed known vector, and [x,y] is an element from a meshgrid.
What I was trying to do is to define
RX, RY = np.meshgrid(np.arange(N), np.arange(M))
R = np.dstack((RX,RY))
and then iterate over the R indices, filling a matrix with the same shape as R, in which each component would be exp(j*dot([kx,ky], [x,y])), with [x,y] being in R. This doesn't look efficient nor elegant.
Thanks for your help.
You could do what we used to do in MATLAB before they added ARRAYFUN - change the calculation so it works with arrays. That could be tricky in the days when everything in MATLAB was 2d; allowing more dimensions made it easier. numpy allows more than 2 dimensions.
Anyways, here a quick attempt:
In [497]: rx,ry=np.meshgrid(np.arange(3),np.arange(4))
In [498]: R=np.dstack((rx,ry))
In [499]: R.shape
Out[499]: (4, 3, 2)
In [500]: kx,ky=1,2
In [501]: np.einsum('i,jki->jk',[kx,ky],R)
Out[501]:
array([[0, 1, 2],
[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
There are other versions of dot, matmul and tensordot, but einsum is the one I like to use. I've worked with it enough to quickly set up a multidimensional dot.
Now just apply the 1j and exp to each element:
In [502]: np.exp(np.einsum('i,jki->jk',[kx,ky],R)*1j)
Out[502]:
array([[ 1.00000000+0.j , 0.54030231+0.84147098j,
-0.41614684+0.90929743j],
[-0.41614684+0.90929743j, -0.98999250+0.14112001j,
-0.65364362-0.7568025j ],
[-0.65364362-0.7568025j , 0.28366219-0.95892427j,
0.96017029-0.2794155j ],
[ 0.96017029-0.2794155j , 0.75390225+0.6569866j ,
-0.14550003+0.98935825j]])

Find the index of the min value in a pdist condensed distance matrix

I have used scipy.spatial.distance.pdist(X) to calculate the euclidian distance metric between each pair of elements of the below list X:
X = [[0, 3, 4, 2], [23, 5, 32, 1], [3, 4, 2, 1], [33, 54, 5, 12]]
This returns a condensed distance matrix:
array([ 36.30426972, 3.87298335, 61.57109712, 36.06937759,
57.88782255, 59.41380311])
For each element X, I need to find the index of the closest other element.
Converting the condensed distance matrix to square form help visualize the results, but I can't figure out how to programmatically identify the index of the closest element X for each element in X.
array([[ 0. , 36.30426972, 3.87298335, 61.57109712],
[ 36.30426972, 0. , 36.06937759, 57.88782255],
[ 3.87298335, 36.06937759, 0. , 59.41380311],
[ 61.57109712, 57.88782255, 59.41380311, 0. ]])
I believe argmin() is the function to use, but I'm lost from here. Thanks for any help in advance.
We'll operate on the square form of the results. First, to exclude "New York is closest to New York" answers,
numpy.fill_diagonal(distances, numpy.inf)
Then, it's a simple argmin along an axis:
closest_points = distances.argmin(axis=0)

Categories

Resources