Labels for cluster centers in Python sklearn - python

When utilizing the sklearn class sklearn.cluster for K-means clustering, a fitted k-means object has 3 attributes, including a numpy array of cluster centers (centers x features) named cluster_centers_. However, these centers don't have an attached label.
My question is: are the centers (rows) in cluster_centers_ ordered by the label value? That is, does row 1 correspond to the center for the cluster labeled 1? Or are they placed in the array randomly? A pointer to any documentation would be more than sufficient.
Thanks.

I couldn't find the documentation but yes it is ordered by cluster.
So:
kmeans.cluster_centers_[0] = centroid of cluster 0

Related

How to use KMeans for distance clustering

I have a dataframe with X and Y axis values
They don't have any labels
They look as shown below
X-COORDINATE
Y-COORDINATE
12
34
99
42
90
27
49
64
Is it possible to use KMeans for clustering the data?
How do I get the labels and plot the data on a graph for each cluster?
Yes, you can use k-means even if you don't have labels because k-means is an unsupervised method, but...
First of all you need to scale your data, because k-means is a distance algorithm and using distances between data points to determine their similarity. More about that here.
I found this tutorial for clustering very useful, you could start with that. It also describes how to plot your data first with silhouette or elbow plot to define perfect number of clusters.
It should look somewhat like that:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=n_clusters) # you can get n_clusters from silhouette/elbow plot or just try out different numbers
kmeans_model.fit(your_dataframe)
labels = kmeans_model.predict(your_dataframe)
print(labels)
K-Means is not always performing perfect, if you want to get better results, you could also try out other algorithms like DBSCAN, HDBSCAN, Agglomerative clustering.... It always depends on your data which one you should choose.

Re-using cluster_centers for another K-mean clustering

I have a matrix of dimension (nw,ny,nx) where nx and ny are dimension of an image (photon counts) and for each pixel I have a spectral profile of nw wavelength points.
I have applied K-mean clustering from scikit-learn python package with number of cluster equal to ncl=5.
dat =dat1.reshape(nw,nx*ny)
mm[:]=KMeans(n_clusters=ncl).fit(np.transpose(dat)).labels_
x=KMeans(n_clusters=ncl).fit(np.transpose(dat)).cluster_centers_
and then plotting x[i,:] (i= cluster label) I can see the 5 different average spectral profiles generated by Kmeans.
Now my question is the following: I would like to use these 5 cluster_centers in a different dataset of the same dimensions (nw,ny,nx) to retrieve the lables that here I have called mm. How can I do it?
Thank you in advance for your time.
As #sascha pointed out, you need to persist the KMeans object to predict future data
dat = dat1.reshape(nw,nx*ny)
clusterer = KMeans(n_clusters=ncl).fit(np.transpose(dat)
dat2 = dat2.reshape(nw,nx*ny)
dat2_labels = clusterer.predict(np.transpose(dat2))

How to calculate the distance between a document and each centroid (k-means)?

I executed scikit-learn k-means algorithm and got the resulting centroids. I have a new document (was not in the initial collection) and I would like to calculate the distance between every centroid and the new document to know in which cluster it should be placed.
Is there a built in function to achieve that or should I write a similarity function manually?
You can use the method predict to get the closest cluster for each sample in a matrix X:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=K)
model.fit(X_train)
label = model.predict(X_test)

ValueError While Clustering in Sklearn

I have a RGB image of the following shape ((3L, 5L, 5L). It means 5 by 5 pixels image having 3 layers (R,G,andB).I want to cluster it using DBSCAN algorithm as follows. But I got an error message that ValueError: Found array with dim 3. Expected <= 2. Can not I use for my 3d image?
import numpy as np
from sklearn.cluster import DBSCAN
from collections import Counter
data = np.random.rand(3,5,5)
print np.shape(data)
print data
db = DBSCAN(eps=0.12, min_samples=3).fit(data)
print db
DBSCAN(algorithm='auto', eps=0.12, leaf_size=30, metric='euclidean',
min_samples=1, p=None, random_state=None)
labels = db.labels_
print Counter(labels)
To cluster you need to say what the distance between two points is. DBSCAN is not a graph clustering algorithm, it works with features. You need to represent each pixel as features, so that the distances are appropriate.
The features could just be RGB, in which case similar colors are clustered together. Or the features could also include x, y coordinates, which would mean spacial distances are also considered.
If you want to consider spatial distances, I'd suggest you take a look at scikit-image's segmentation module, which contains a couple of popular image segmentation methods.

scipy kmeans -- how to get actual threshold between clusters

Using scipy kmeans I can get the centroids of my clusters with
centroids, variance = kmeans(pixel,3)
and also an array showing which value is assigned to which cluster:
code, distance = vq(features, centroids)
But how do I get the actual threshold values separating the clusters from each other? Is there any variable or command containing these?
thanks for any advice
K-means does not use numerical thresholds.
Every point belongs to the closest cluster, so the "threshold" is the hyperplane (see "Voronoi diagram" in Wikipedia) inbetween of the two nearest cluster centers.

Categories

Resources