Understanding output from kmeans clustering in python

Understanding output from kmeans clustering in python - python

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:
A B C D ... A B C D ...
A 0 1 5 3 A 0 5 3 9
B 4 0 4 1 B 2 0 7 8
C 2 6 0 3 C 2 6 0 1
D 2 7 1 0 D 5 2 5 0
... ...
The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.
Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:
1.
import sklearn.cluster
import numpy as np
data = np.load('difference_matrix_file.npy') #loads difference matrix from file
a = np.array([x[0:] for x in data])
clust_centers = 3
model = sklearn.cluster.k_means(a, clust_centers)
print model
2.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)
3.
import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
np.set_printoptions(threshold=np.nan)
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids
What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.
However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.
I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?

You have two issues where, and the recommendation of k-means probably was not very good...
K-means expects a coordinate data matrix, not a distance matrix.
In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.
So you haven't understood the input of k-means, no wonder you do not understand the output.
I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with #Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.
Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:
sklearn.cluster,
scipy.cluster.vq?
In short,
the model sklearn.cluster.k_means() returns a tuple with three fields:
an array with the centroids (that should be 3x232 for you)
the label assignment for each point (i.e. a 232-long array with values 0-2)
and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
scipy.cluster.vq.kmeans2() returns a tuple with two fields:
the cluster centroids (as above)
the label assignment (as above)
kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().
As for how to get to the coordinates of the points in each cluster, you could:
for cc in range(clust_centers):
print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))
where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

Related

Using np.cov() on a centered matrix not equivalent to matrix multiplication between the array and its transpose

I'm trying to get the eigenvectors and values for the MNIST dataset.
I'm testing out a concept on the dataset so I can carry it to a different dataset
I have a matrix M where the rows are the images and the columns are the pixel values.
I'm trying to do the above in two ways (taken from https://mml-book.github.io/book/mml-book.pdf, chapter 10, section 1 and section 5):
M is of shape 500 rows x 784 columns
First, I'm using the following code:
V = cov(M.T)
and then using:
V2 = np.dot(matrix.T,matrix) / 783
According to numpy's guide on cov(), it seems like with one variable given I the results of both should be identical, but they're not. https://numpy.org/doc/stable/reference/generated/numpy.cov.html
sorry if the question is simple and there's an obvious answer
EDIT:
if I take the highest eigenvector of both methods and scale it so the lowest number is zero and the highest is 255, I get the same vector. What am I missing here?

partially define initial centroid for scikit-learn K-Means clustering

Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.

Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )

That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.

Clustering a sparse co-occurrence matrix

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.

If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.

Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

Understanding scikitlearn PCA.transform function in Python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!

When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.