Clustering a sparse co-occurrence matrix - python

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.

If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.

Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

Related

Vectorizing Computation of Cosine Similarity Matrix

I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?
If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.

Vectorization of numpy matrix that contains pdf of multiple gaussians and multiple samples

My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?
I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.

Does PCA transform preserve the sorting of the data?

I am working with a huge number of n-dimensional arrays in python. All the arrays are stored in a dictionary, so each array is uniquely identified by a key.
I would like to visualize all the arrays in 2D, so I have performed a PCA:
# standardize data before applying PCA
dict_data_std = StandardScaler().fit_transform(dict_data.values())
pca = PCA(n_components=2)
data_post_pca = pca.fit_transform(dict_data_std.values())
My problem is: does PCA transform preserve the order of the data? So, does the first array of dict_data get mapped to the first (2D) array of data_post_pca?
I need a 100% certain answer.
Not necessarily. You can think of PCA with one eigenvector as a weighted linear combination of input. Depending on the decomposition it could do anything from reordering to reversing the weights. However, if your intuition holds true it should produce a weighting similar to what you think it should.
Now once you take more eigenvectors it becomes a set of linear combinations instead of a single one. - https://www.reddit.com/r/MachineLearning/comments/24ywyc/does_pca_preserve_the_order_in_data/

Understanding output from kmeans clustering in python

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:
A B C D ... A B C D ...
A 0 1 5 3 A 0 5 3 9
B 4 0 4 1 B 2 0 7 8
C 2 6 0 3 C 2 6 0 1
D 2 7 1 0 D 5 2 5 0
... ...
The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.
Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:
1.
import sklearn.cluster
import numpy as np
data = np.load('difference_matrix_file.npy') #loads difference matrix from file
a = np.array([x[0:] for x in data])
clust_centers = 3
model = sklearn.cluster.k_means(a, clust_centers)
print model
2.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)
3.
import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
np.set_printoptions(threshold=np.nan)
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids
What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.
However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.
I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?
You have two issues where, and the recommendation of k-means probably was not very good...
K-means expects a coordinate data matrix, not a distance matrix.
In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.
So you haven't understood the input of k-means, no wonder you do not understand the output.
I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.
Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with #Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.
Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:
sklearn.cluster,
scipy.cluster.vq?
In short,
the model sklearn.cluster.k_means() returns a tuple with three fields:
an array with the centroids (that should be 3x232 for you)
the label assignment for each point (i.e. a 232-long array with values 0-2)
and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
scipy.cluster.vq.kmeans2() returns a tuple with two fields:
the cluster centroids (as above)
the label assignment (as above)
kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().
As for how to get to the coordinates of the points in each cluster, you could:
for cc in range(clust_centers):
print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))
where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

Given a sparse matrix with shape (num_samples, num_features), how do I estimate the co-occurrence matrix?

The sparse matrix has only 0 and 1 at each entry (i,j) (1 stands for sample i has feature j). How can I estimate the co-occurrence matrix for each feature given this sparse representation of data points? Especially, I want to find pairs of features that co-occur in at least 50 samples. I realize it might be hard to produce the exact result, is there any approximated algorithm in data mining that allows me to do that?
This can be solved reasonably easily if you go to a transposed matrix.
Of any two features (now rows, originally columns) you compute the intersection. If it's larger than 50, you have a frequent cooccurrence.
If you use an appropriate sparse encoding (now of rows, but originally of columns - so you probably need not only to transpose the matrix, but also to reencode it) this operation using O(n+m), where n and m are the number of nonzero values.
If you have an extremely high number of features this make take a while. But 100000 should be feasible.

Categories

Resources