Vectorizing Computation of Cosine Similarity Matrix

Vectorizing Computation of Cosine Similarity Matrix - python

I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?

If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.

Related

Vectorization of numpy matrix that contains pdf of multiple gaussians and multiple samples

My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?

I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.

How do I organize an uneven matrix of many calculations in pandas / numpy / pandas

I am calculating a model that requires a large number of calculations in a big matrix, representing interactions between households (numbering N, roughly 10E4) and firms (numbering M roughly 10E4). In particular, I want to perform the following steps:
X2 is an N x M matrix representing pairwise distance between each household and each firm. The step is to multiply every entry by a parameter gamma.
delta is a vector length M. The step is to broadcast multiply delta into the rows of the matrix from 1.
Exponentiate the matrix from 2.
Calculate the row sums of the matrix from 3.
Broadcast division division by the row sum vector from 4 into the rows of the matrix from 3.
w is a vector of length N. The step is to broadcast multiply w into the columns of the matrix from 5.
The final step is to take column sums of the matrix from 6.
These steps have to be performed 1000s of times in the context of matching the model simulation to data. At the moment, I have an implementation using a big NxM numpy array and using matrix algebra operations to perform the steps as described above.
I would like to be able to reduce the number of calculations by eliminating all the "cells" where the distance is greater than some critical value r.
How can I organize my data to do this, while performing all the operations I need to do (exponentiation, row/column sums, broadcasting across rows and columns)?
The solution I have in mind is something like storing the distance matrix in "long form", with each row representing a household / firm pair, rather than the N x M matrix, deleting all the invalid rows to get an array whose length is something less than NM, and then performing all the calculations in this format. In this solution I am wondering if I can use pandas dataframes to make the "broadcasts" and "row sums" work properly (and quickly). How can I make that work?
(Or alternately, if there is a better way I should be exploring, I'd love to know!)

Clustering a sparse co-occurrence matrix

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.

If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.

Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

Given a sparse matrix with shape (num_samples, num_features), how do I estimate the co-occurrence matrix?

The sparse matrix has only 0 and 1 at each entry (i,j) (1 stands for sample i has feature j). How can I estimate the co-occurrence matrix for each feature given this sparse representation of data points? Especially, I want to find pairs of features that co-occur in at least 50 samples. I realize it might be hard to produce the exact result, is there any approximated algorithm in data mining that allows me to do that?

This can be solved reasonably easily if you go to a transposed matrix.
Of any two features (now rows, originally columns) you compute the intersection. If it's larger than 50, you have a frequent cooccurrence.
If you use an appropriate sparse encoding (now of rows, but originally of columns - so you probably need not only to transpose the matrix, but also to reencode it) this operation using O(n+m), where n and m are the number of nonzero values.
If you have an extremely high number of features this make take a while. But 100000 should be feasible.

pseudo inverse of sparse matrix in python

I am working with data from neuroimaging and because of the large amount of data, I would like to use sparse matrices for my code (scipy.sparse.lil_matrix or csr_matrix).
In particular, I will need to compute the pseudo-inverse of my matrix to solve a least-square problem.
I have found the method sparse.lsqr, but it is not very efficient. Is there a method to compute the pseudo-inverse of Moore-Penrose (correspondent to pinv for normal matrices).
The size of my matrix A is about 600'000x2000 and in every row of the matrix I'll have from 0 up to 4 non zero values. The matrix A size is given by voxel x fiber bundle (white matter fiber tracts) and we are expecting maximum 4 tracts to cross in a voxel. In most of the white matter voxels we expect to have at least 1 tract, but I will say that around 20% of the lines could be zeros.
The vector b should not be sparse, actually b contains the measure for each voxel, which is in general not zero.
I would need to minimize the error, but there are also some conditions on the vector x. As I tried the model on smaller matrices, I never needed to constrain the system in order to satisfy these conditions (in general 0
Is that of any help? Is there a way to avoid taking the pseudo-inverse of A?
Thanks
Update 1st June:
thanks again for the help.
I can't really show you anything about my data, because the code in python give me some problems. However, in order to understand how I could choose a good k I've tried to create a testing function in Matlab.
The code is as follow:
F=zeros(100000,1000);
for k=1:150000
p=rand(1);
a=0;
b=0;
while a<=0 || b<=0
a=random('Binomial',100000,p);
b=random('Binomial',1000,p);
end
F(a,b)=rand(1);
end
solution=repmat([0.5,0.5,0.8,0.7,0.9,0.4,0.7,0.7,0.9,0.6],1,100);
size(solution)
solution=solution';
measure=F*solution;
%check=pinvF*measure;
k=250;
F=sparse(F);
[U,S,V]=svds(F,k);
s=svds(F,k);
plot(s)
max(max(U*S*V'-F))
for s=1:k
if S(s,s)~=0
S(s,s)=1/S(s,s);
end
end
inv=V*S'*U';
inv*measure
max(inv*measure-solution)
Do you have any idea of what should be k compare to the size of F? I've taken 250 (over 1000) and the results are not satisfactory (the waiting time is acceptable, but not short).
Also now I can compare the results with the known solution, but how could one choose k in general?
I also attached the plot of the 250 single values that I get and their squares normalized. I don't know exactly how to better do a screeplot in matlab. I'm now proceeding with bigger k to see if suddently the value will be much smaller.
Thanks again,
Jennifer

You could study more on the alternatives offered in scipy.sparse.linalg.
Anyway, please note that a pseudo-inverse of a sparse matrix is most likely to be a (very) dense one, so it's not really a fruitful avenue (in general) to follow, when solving sparse linear systems.
You may like to describe a slight more detailed manner your particular problem (dot(A, x)= b+ e). At least specify:
'typical' size of A
'typical' percentage of nonzero entries in A
least-squares implies that norm(e) is minimized, but please indicate whether your main interest is on x_hat or on b_hat, where e= b- b_hat and b_hat= dot(A, x_hat)
Update: If you have some idea of the rank of A (and its much smaller than number of columns), you could try total least squares method. Here is a simple implementation, where k is the number of first singular values and vectors to use (i.e. 'effective' rank).
from scipy.sparse import hstack
from scipy.sparse.linalg import svds
def tls(A, b, k= 6):
"""A tls solution of Ax= b, for sparse A."""
u, s, v= svds(hstack([A, b]), k)
return v[-1, :-1]/ -v[-1, -1]

Regardless of the answer to my comment, I would think you could accomplish this fairly easily using the Moore-Penrose SVD representation. Find the SVD with scipy.sparse.linalg.svds, replace Sigma by its pseudoinverse, and then multiply V*Sigma_pi*U' to find the pseudoinverse of your original matrix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.