scipy kmeans -- how to get actual threshold between clusters - python

Using scipy kmeans I can get the centroids of my clusters with
centroids, variance = kmeans(pixel,3)
and also an array showing which value is assigned to which cluster:
code, distance = vq(features, centroids)
But how do I get the actual threshold values separating the clusters from each other? Is there any variable or command containing these?
thanks for any advice

K-means does not use numerical thresholds.
Every point belongs to the closest cluster, so the "threshold" is the hyperplane (see "Voronoi diagram" in Wikipedia) inbetween of the two nearest cluster centers.

Related

Dendrogram, KMeans, centroids and labels

Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!

Labels for cluster centers in Python sklearn

When utilizing the sklearn class sklearn.cluster for K-means clustering, a fitted k-means object has 3 attributes, including a numpy array of cluster centers (centers x features) named cluster_centers_. However, these centers don't have an attached label.
My question is: are the centers (rows) in cluster_centers_ ordered by the label value? That is, does row 1 correspond to the center for the cluster labeled 1? Or are they placed in the array randomly? A pointer to any documentation would be more than sufficient.
Thanks.
I couldn't find the documentation but yes it is ordered by cluster.
So:
kmeans.cluster_centers_[0] = centroid of cluster 0

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

How to compute cluster assignments from linkage/distance matrices

if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)
then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?
To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.
I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?
If I understand you right, that is what fcluster does:
scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.
...
Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.
So just call fcluster(linkage_matrix, t), where t is your threshold.

Can i get features of the clusters using hierarchical clustering - numpy

I am trying to do hierarchical clustering on an m*n array.
Input array : 500 * 1000 (1000 features, 500 observations)
Calculate distance matrix using a self-defined pdist function
Feed this distance matrix to linkage function :
clusters = sch.linkage(distanceMatrix,'single')
Form flat clusters :
fc = sch.fcluster(clusters,cutoff,'distance')
This gives me some clusters (around 80, using a cutoff value of 6.0).
Now, is there anyway, that i can get the 1000 features corresponding to each cluster as well? ( like we get the features of the centroids using K-means clustering).
Clusters in hierarchical clustering (or pretty much anything except k-means and Gaussian Mixture EM that are restricted to "spherical" - actually: convex - clusters) do not necessarily have sensible means.
Because they allow for non-spherical clusters. That actually is a feature...
https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity_based_clustering_.28hierarchical_clustering.29
Have a look at the right image titled "Linkage clustering examples". What good is a cluster in this "banana" example? The centroid might not even be in the cluster!
Note that you can still compute the centroid yourself, if you need it. As the clustering algorithm does not need the centroid, it will not be computing it for you automatically, obviously.

Categories

Resources