Dendrogram, KMeans, centroids and labels

Dendrogram, KMeans, centroids and labels - python

Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!

Related

How to use KMeans for distance clustering

I have a dataframe with X and Y axis values
They don't have any labels
They look as shown below
X-COORDINATE
Y-COORDINATE
12
34
99
42
90
27
49
64
Is it possible to use KMeans for clustering the data?
How do I get the labels and plot the data on a graph for each cluster?

Yes, you can use k-means even if you don't have labels because k-means is an unsupervised method, but...
First of all you need to scale your data, because k-means is a distance algorithm and using distances between data points to determine their similarity. More about that here.
I found this tutorial for clustering very useful, you could start with that. It also describes how to plot your data first with silhouette or elbow plot to define perfect number of clusters.
It should look somewhat like that:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=n_clusters) # you can get n_clusters from silhouette/elbow plot or just try out different numbers
kmeans_model.fit(your_dataframe)
labels = kmeans_model.predict(your_dataframe)
print(labels)
K-Means is not always performing perfect, if you want to get better results, you could also try out other algorithms like DBSCAN, HDBSCAN, Agglomerative clustering.... It always depends on your data which one you should choose.

How to evaluate K-Means Clustering since automatic indexes of clusters don't match true labels?

How do we measure the accuracy of a K-Means clustering algorithm (say, generate a confusion matrix) since the automatic indexes of cluster is probably a permutation of the original labels?

I don't exactly know what you mean too. Your original labels perhaps is the ground truth labeling. The clustering results provided by k-means is usually an integer with range given as many as the k clusters you wish the k-means algorithm to give you.
I typically use pandas.crosstab function to visualize the localizations of the groundtruth labeling with kmeans labeling with cross-tabulation.
For better visualization, you may want to use the following:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(30,10))
# plot the heatmap for correlation matrix
ax = sns.heatmap(crosstab_groundtruth_kmeans.T,
square=True, annot=True, fmt='.2f')
ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);
out:
Good luck!~

k-means is a clustering (grouping algorithm, not used for classification), hence, it is not feasible to check and analyze accuracy. Major concept of k-means is to find a cluster of data-points which maximize the "between-cluster" distance (and does not have the concept of labels, and hence, you can't get accuracy matrix). More insights: https://scikit-learn.org/stable/modules/clustering.html#k-means
The accuracy (assuming, you want to visualize which cluster consists of which data points) has to be analyzed manually using the predict method from sklearn.cluster.KMeans. It basically "Predicts the closest cluster each sample in X belongs to." (from documentation)

Techniques for analyzing clusters after performing k-means clustering on dataset

I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.
This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:
# loads the .csv dataframe (DF)
df = pd.read_csv('./dados_selecionados.csv', sep=",")
# prints the df
print(df)
#list for the sum of squared distances
SQD = []
#cluster number for testing in elbow method
num_clusters = 10
# runs k-means for each cluster number
for k in range(1,num_clusters+1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
SQD.append(kmeans.inertia_)
# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()
According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:
# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)
clusters = kmeans.fit_predict(df)
# dict to store clusters data
cluster_dict=[]
for c in clusters:
cluster_dict.append(c)
# prints the cluster dict
cluster_dict
# adds the cluster information as a column in the df
df['cluster'] = cluster_dict
# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")
# shows the resulting df
print(df)
# shows each separate cluster
for c in clusters:
print(df[df['cluster'] == c].head(10))
My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.
And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.

If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.

You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.
https://scikit-learn.org/stable/modules/clustering.html
I think there are over 100 different clustering algos out there now.
Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?

I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).

Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.

In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

Scikit Learn - K-Means - Elbow - criterion

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
kmeans.fit(data)
So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!

If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient.
Elbow Criterion Method:
The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).
After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster).
Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).
So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.
Let's consider iris datasets,
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
#print(data["clusters"])
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
Plot for above code:
We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct.
Silhouette Coefficient Method:
From sklearn documentation,
A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
`
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next
nearest cluster.
The Silhouette Coefficient is for a single sample is then given as:
Now, to find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.
A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
X = load_iris().data
y = load_iris().target
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(X)
label = kmeans.labels_
sil_coeff = silhouette_score(X, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))
Output -
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
As we can see, n_clusters=2 has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right?
But here's the catch.
Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2 having highest Silhouette Coefficient, We would consider n_clusters=3 as optimal number of cluster due to -
Iris dataset has 3 species. (Most Important)
n_clusters=3 has the 2nd highest value of Silhouette Coefficient.
So choosing n_clusters=3 is the optimal no. of cluster for iris dataset.
Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster.
Hope it helps!

The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it.
But k-means is a pretty crude heuristic, too.
So yes, you will need to run k-means with k=1...kmax, then plot the resulting SSQ and decide upon an "optimal" k.
There exist advanced versions of k-means such as X-means that will start with k=2 and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.
Either way, I have the impression that in any actual use case where k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantization algorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector.
The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.
For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.

This answer is inspired by what OmPrakash has written. This contains code to plot both the SSE and Silhouette Score. What I've given is a general code snippet you can follow through in all cases of unsupervised learning where you don't have the labels and want to know what's the optimal number of cluster. There are 2 criterion. 1) Sum of Square errors (SSE) and Silhouette Score. You can follow OmPrakash's answer for the explanation. He's done a good job at that.
Assume your dataset is a data frame df1. Here I have used a different dataset just to show how we can use both the criterion to help decide optimal number of cluster. Here I think 6 is the correct number of cluster.
Then
range_n_clusters = [2, 3, 4, 5, 6,7,8]
elbow = []
ss = []
for n_clusters in range_n_clusters:
#iterating through cluster sizes
clusterer = KMeans(n_clusters = n_clusters, random_state=42)
cluster_labels = clusterer.fit_predict(df1)
#Finding the average silhouette score
silhouette_avg = silhouette_score(df1, cluster_labels)
ss.append(silhouette_avg)
print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)`
#Finding the average SSE"
elbow.append(clusterer.inertia_) # Inertia: Sum of distances of samples to their closest cluster center
fig = plt.figure(figsize=(14,7))
fig.add_subplot(121)
plt.plot(range_n_clusters, elbow,'b-',label='Sum of squared error')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.legend()
fig.add_subplot(122)
plt.plot(range_n_clusters, ss,'b-',label='Silhouette Score')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.legend()
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.