I am running K-Means on some statistical Data. My Matrix size is [192x31634].
K-Means performs well and creates the amount of 7 centroids, that I want it to. So my Result is [192x7]
As some self-check I store the index-Values I obtain in the K-Means run to a dictionary.
centroids,idx = runkMeans(X_train, initial_centroids, max_iters)
resultDict.update({'centroid' : centroids})
resultDict.update({'idx' : idx})
Then I test my K-Means on the same Data I used to find the centroids. Strangely my Result differs:
dict= pickle.load(open("MyDictionary.p", "rb"))
currentIdx = findClosestCentroids(X_train, dict['centroid'])
print("idx Differs: ",np.count_nonzero(currentIdx != dict['idx']))
Output:
idx Differs: 189
Can someone explain this Difference to me? I turned up the max-iterations of the Algorithm to 50 which seems to be way too much. #Joe Halliwell pointed out, that K-Means is non-deterministic. findClosestCentroids gets called by runkMeans. I do not see, why the Results of the two idx can differ. Thanks for any Ideas.
Here is my code:
def findClosestCentroids(X, centroids):
K = centroids.shape[0]
m = X.shape[0]
dist = np.zeros((K,1))
idx = np.zeros((m,1), dtype=int)
#number of columns defines my number of data points
for i in range(m):
#Every column is one data point
x = X[i,:]
#number of rows defines my number of centroids
for j in range(K):
#Every row is one centroid
c = centroids[j,:]
#distance of the two points c and x
dist[j] = np.linalg.norm(c-x)
#if last centroid is processed
if (j == K-1):
#the Result idx is set with the index of the centroid with minimal distance
idx[i] = np.argmin(dist)
return idx
def runkMeans(X, initial_centroids, max_iters):
#Initialize values
m,n = X.shape
K = initial_centroids.shape[0]
centroids = initial_centroids
previous_centroids = centroids
for i in range(max_iters):
print("K_Means iteration:",i)
#For each example in X, assign it to the closest centroid
idx = findClosestCentroids(X, centroids)
#Given the memberships, compute new centroids
centroids = computeCentroids(X, idx, K)
return centroids,idx
Edit: I turned my max_iters to 60 and get a
idx Differs: 0
Seems that was the problem.
K-means is a non-deterministic algorithm. One typically controls for this by setting the random seed. For example, SciKit Learn's implementation provides the random_state argument for this purpose:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
See the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Related
I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.
It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.
You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
batch_data
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.
im getting bad clusters i would like to rewrite it in a way where i can just plug in any algorithm that i would like (e.g hierarchical, knn, k-means) etc.
#takes in our text_extracts dictionary and returns clusters in an indexed list
def run_clustering(plan):
""" Transform texts to Tf-Idf coordinates and cluster texts using K-Means """
vectorizer = TfidfVectorizer(tokenizer=process_text,
max_df=0.5,
min_df=0.005,
ngram_range=(1,4),
lowercase=True)
#set the model with the vectorizer which will tokenize with our process_text function
extracts = {}
for page in plan.page_list:
if len(page.text_extract) > 50:
extracts[str(page.document_id) + '_' + str(page.page_number)] = page.text_extract
extract_lst = [extracts[text] for text in extracts]
tfidf_model = vectorizer.fit_transform(extract_lst)
#determine cluster number with silhouette coefficient
#start with 2 as a cluster size in case the set is very small
num_of_clusters_to_test = [2]
#going to test 25 more sizes in equal intervals based on the number of docs we are clustering
intervals_to_test = int(len(extracts) / 25)
#print(intervals_to_test)
num_of_clusters_to_test += [i for i in range(len(extracts)) if i % intervals_to_test == 0 and i != 0]
#these variables will help us determine the max silhouette
#iters_since_new_max is just being held so that if we aren't reaching optimal size for
#four iterations in a row, we dont have to keep testing huge cluster sizes
max_silhouette_coef = 0
iters_since_new_max = 0
good_size = 2
#cluster with a certain cluster size and record the silhouette coefficient
for size in num_of_clusters_to_test:
kmeans = KMeans(n_clusters=size).fit(tfidf_model)
label = kmeans.labels_
sil_coeff = silhouette_score(tfidf_model, label, metric='euclidean')
if sil_coeff > max_silhouette_coef:
max_silhouette_coef = sil_coeff
good_size = size
iters_since_new_max = 0
else:
iters_since_new_max += 1
if iters_since_new_max > 4:
break
# finally cluster for with the good size we want
km_model = KMeans(n_clusters=good_size)
km_model.fit(tfidf_model)
clustering = collections.defaultdict(list)
for idx, label in enumerate(km_model.labels_):
clustering[label].append(idx)
return clustering
left as much comment as i can to help you all follow what i am going for can anyone help me improve this
You know KMeans if for numeric data only, right. I mean, don't expect it to work on labeled data. With KMeans, you calculate the distance to the nearest centroid (cluster center) and add this point to this cluster. What is the 'distance' between apple, banana, and watermelon? It doesn't make sense! So, just make sure you are running your KMeans over numerics.
import numpy as np
import pandas as pd
from pylab import plot,show
from numpy import vstack,array
from scipy.cluster.vq import kmeans,vq
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.read_csv('foo.csv')
# get only numeric fields from your dataframe
df = df.sample(frac=0.1, replace=True, random_state=1)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
for col in newdf.columns:
print(col)
# your independent variables
X = newdf[['NumericField1','NumericField2','NumericField3','list_price']]
# your dependent variable
y = newdf['DependentVariable']
# take all numeric features from the corr exercise, and turn into an array
# so we can feed it into a cluetering algorythm
data = np.asarray(newdf)
X = data
# computing K-Means with K = 100 (100 clusters)
centroids,_ = kmeans(data,100)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
details = [(name,cluster) for name, cluster in zip(df.brand,idx)]
for detail in details:
print(detail)
I've found Affinity Propogation to produce much tighter clusters than KMeans can achieve. Here is an example.
# Run Affinity Propogation Experiment
af = AffinityPropagation(preference=20).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
# plt.scatter(X[:, 0], X[:, 1], s=50)
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Try these concepts and see how you get along.
I figured that sklearn kmeans uses imaginary points as cluster centroids.
So far, I found no option to use real data points as centroids in sklearn.
I am currently calculating the data point that is closest to a centroid but thought there might be an easier way.
I am not necessarily restricted to kmeans by the way.
A google search around clustering with real data centroids wasn't fruitful either.
Did anyone have the same problem before?
import numpy as np
from sklearn.cluster import KMeans
import math
def distance(a, b):
dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
return dist
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
print(np.where(xy == centroids[0])[0])
for c in centroids:
nearest = min(xy, key=lambda x: distance(x, c))
print('centroid', c)
print('nearest data point to centroid', nearest)
Actually sklearn.cluster.KMeans allows now to use custom centroids.
see init section here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
or in source code for sklearn.kmneans here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649
"If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers."
I hope that it works. Please try.
Centroids does not have to be points in your set. Since you are in a 2d space, you will find centroids with 2d coordinates. If you want to print distances between each centroid and each point you can:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
for centroid in centroids:
print(f'List of distances between centroid {centroid} and each point:\n\
{np.linalg.norm(centroid-xy, axis=1)}\n')
List of distances between centroid [0.87236496 0.74034618] and each point:
[0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
0.44043437 0.66736601 0.55282619 0.14813826]
List of distances between centroid [0.37243631 0.37851987] and each point:
[0.77005698 0.29192851 0.25249753 0.60881231 0.2219568 0.24264077
0.27374379 0.39968813 0.31728732 0.58604271]
As you can see we have that prediction corresponds to the centroid to which the distance is minimal:
kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
Let's plot the data: centroids are square shaped and points are circle shaped, which size is the inverse proportional to the distance from its centroid.
Now although the figure is plotting other random data points, I hope it helps.
I've been through the same question, how to find the sample within each cluster that minimizes inertia. I made this function :
import numpy as np
from sklearn.metrics import pairwise_distances_chunked
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
s = []
for _ in pairwise_distances_chunked(X=X[mask]):
s.append(np.square(_).sum(axis=1))
ret.append(mask[np.argmin(np.concatenate(s))])
return np.array(ret)
And it can be used like this :
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
km = KMeans(n_clusters=3, random_state=0).fit(X)
index_representative_points(km, X)
>>> array([89, 25, 28], dtype=int64)
EDIT :
For very large datasets, the function is very slow. But it can be proven that the point within the cluster that minimizes the inertia is the closest one of the centroid. Hence, this second version :
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
centroid = np.mean(X[mask], axis=0)
i0 = mask[pairwise_distances_argmin(centroid[None, :], X[mask])[0]]
ret.append(i0)
return np.array(ret)
So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.
I'm employing a fractional distance metric in order to calculate distances:
where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1
I would then calculate the medoid as:
where S is the set of datapoints, and δ is the absolute value of the distance metric used above.
I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.
Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:
I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).
compute pairwise distance matrix
compute column or row sum
argmin to find medoid index
i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.
So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:
(1) This is the distance function:
def fractional(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist
(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):
def medoid(dataset):
point = []
w = len(dataset)
if(len(dataset) < 6000):
h = len(dataset)
dist_matrix = [[0 for x in range(w)] for y in range(h)]
list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]
for counter_3, tuple in enumerate(list_combinations):
temp_dist = fractional(tuple[2], tuple[3])
dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)
Any questions, feel free to comment!
If you don't mind using brute force this might help:
def calc_medoid(X, Y, f=2):
n = len(X)
m = len(Y)
dist_mat = np.zeros((m, n))
# compute distance matrix
for j in range(n):
center = X[j, :]
for i in range(m):
if i != j:
dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)
medoid_id = np.argmin(dist_mat.sum(axis=0)) # sum over y
return medoid_id, X[medoid_id, :]
Here is an example of computing a medoid for a single cluster with Euclidean distance.
import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)
def GetMedoid(vX):
vMean = np.mean(vX, axis=0) # compute centroid
return vX[np.argmin([sum((x - vMean)**2) for x in vX])] # pick a point closest to centroid
vMedoid = GetMedoid([a, b, c, d])
print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid as red star
You can also use the following package to compute medoid for one or more clusters
!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]
I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)
Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.
MATLAB has a nice silhouette function to help evaluate the number of clusters for k-means. Is there an equivalent for Python's Numpy/Scipy as well?
I present below a sample silhouette implementation in both MATLAB and Python/Numpy (keep in mind that I am more fluent in MATLAB):
1) MATLAB
function s = mySilhouette(X, IDX)
%# X : matrix of size N-by-p, data where rows are instances
%# IDX: vector of size N, cluster index of each instance (starting from 1)
%# s : vector of size N, silhouette score value of each instance
N = size(X,1); %# number of instances
K = numel(unique(IDX)); %# number of clusters
%# compute pairwise distance matrix
D = squareform( pdist(X,'euclidean').^2 );
%# indices belonging to each cluster
kIndices = accumarray(IDX, 1:N, [K 1], #(x){sort(x)});
%# compute a,b,s for each instance
%# a(i): average distance from i to all other data within the same cluster.
%# b(i): lowest average dist from i to the data of another single cluster
a = zeros(N,1);
b = zeros(N,1);
for i=1:N
ind = kIndices{IDX(i)}; ind = ind(ind~=i);
a(i) = mean( D(i,ind) );
b(i) = min( cellfun(#(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
end
s = (b-a) ./ max(a,b);
end
To emulate the plot from the silhouette function in MATLAB, we group the silhouette values by cluster, sort within each, then plot the bars horizontally. MATLAB adds NaNs to separate the bars from the different clusters, I found it easier to simply color-code the bars:
%# sample data
load fisheriris
X = meas;
N = size(X,1);
%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);
%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], #(x){sort(x)});
ytick = cellfun(#(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d'); %#'
h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')
%# compare against SILHOUETTE
figure, silhouette(X,IDX)
2) Python
And here is what I came up with in Python:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm
def silhouette(X, cIDX):
"""
Computes the silhouette score for each instance of a clustered dataset,
which is defined as:
s(i) = (b(i)-a(i)) / max{a(i),b(i)}
with:
-1 <= s(i) <= 1
Args:
X : A M-by-N array of M observations in N dimensions
cIDX : array of len M containing cluster indices (starting from zero)
Returns:
s : silhouette value of each observation
"""
N = X.shape[0] # number of instances
K = len(np.unique(cIDX)) # number of clusters
# compute pairwise distance matrix
D = squareform(pdist(X))
# indices belonging to each cluster
kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]
# compute a,b,s for each instance
a = np.zeros(N)
b = np.zeros(N)
for i in range(N):
# instances in same cluster other than instance itself
a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
# instances in other clusters, one cluster at a time
b[i] = np.min( [np.mean(D[i][ind])
for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
s = (b-a)/np.maximum(a,b)
return s
def main():
# load Iris dataset
data = datasets.load_iris()
X = data['data']
# cluster and compute silhouette score
K = 3
C, cIDX = kmeans2(X, K)
s = silhouette(X, cIDX)
# plot
order = np.lexsort((-s,cIDX))
indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
ytickLabels = ["%d" % x for x in range(K)]
cmap = cm.jet( np.linspace(0,1,K) ).tolist()
clr = [cmap[i] for i in cIDX[order]]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.barh(range(X.shape[0]), s[order], height=1.0,
edgecolor='none', color=clr)
ax.set_ylim(ax.get_ylim()[::-1])
plt.yticks(ytick, ytickLabels)
plt.xlabel('Silhouette Value')
plt.ylabel('Cluster')
plt.show()
if __name__ == '__main__':
main()
Update:
As noted by others, scikit-learn has since then added its own silhouette metric implementation. To use it in the above code, replace the call to the custom-defined silhouette function with:
from sklearn.metrics import silhouette_samples
...
#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX) # <-- scikit-learn function
...
the rest of the code can still be used as-is to generate the exact same plot.
I've looked, but I can't find a numpy/scipy silhouette function, I even looked in pylab and matplotlib. I think you'll have to implement it yourself.
I can point you to http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462. It has a few functions which implement a silhouette function.
Hope this helps.
This is a little late, but for what it is worth, it appears that scikits-learn now implements a silhouette function. See their documentation page or view the source code directly.