I plot elbow method to find appropriate number of KMean cluster when I am using Python and sklearn. I want to do the same when I'm working in PySpark. I am aware that PySpark has limited functionality due to the Spark's distributed nature, but, is there a way to get this number?
I am using the following code to plot the elbow Using the Elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
I did it another way. Calculate the cost of features using Spark ML and store the results in Python list and then plot it.
# Calculate cost and plot
cost = np.zeros(10)
for k in range(2,10):
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol('features')
model = kmeans.fit(df)
cost[k] = model.summary.trainingCost
# Plot the cost
df_cost = pd.DataFrame(cost[2:])
df_cost.columns = ["cost"]
new_col = [2,3,4,5,6,7,8, 9]
df_cost.insert(0, 'cluster', new_col)
import pylab as pl
pl.plot(df_cost.cluster, df_cost.cost)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
PySpark is not the right tool to plot an eblow method. To plot a chart, the data must be collected into a Pandas dataframe, which is not possible in my case because of the massive amount of data. The alternative is to use silhouette analysis like below
# Keep changing the number of clusters and re-calculate
kmeans = KMeans().setK(6).setSeed(1)
model = kmeans.fit(dataset.select('features'))
predictions = model.transform(dataset)
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
Or evaluate clustering by computing Within Set Sum of Squared Errors, which is explained here
I think the last answer is not completely correct. The first answer, however, is correct. Looking at the documentation and source code of Pyspark.ml.clustering the model.summary.trainingCost is the inertia of Sklearn in Pyspark. In the link you can find the text:
This is equivalent to sklearn's inertia.
The silhouette score is given by the ClusteringEvaluator class of pyspark.ml.evaluation: see this link
The Davies-Bouldin index and Calinski-Harabasz index of Sklearn are not yet implemented in Pyspark. However, there are some suggested functions of them. For example for the Davies-Bouldin index.
Related
Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!
My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?
It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.
How do we measure the accuracy of a K-Means clustering algorithm (say, generate a confusion matrix) since the automatic indexes of cluster is probably a permutation of the original labels?
I don't exactly know what you mean too. Your original labels perhaps is the ground truth labeling. The clustering results provided by k-means is usually an integer with range given as many as the k clusters you wish the k-means algorithm to give you.
I typically use pandas.crosstab function to visualize the localizations of the groundtruth labeling with kmeans labeling with cross-tabulation.
For better visualization, you may want to use the following:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(30,10))
# plot the heatmap for correlation matrix
ax = sns.heatmap(crosstab_groundtruth_kmeans.T,
square=True, annot=True, fmt='.2f')
ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);
out:
Good luck!~
k-means is a clustering (grouping algorithm, not used for classification), hence, it is not feasible to check and analyze accuracy. Major concept of k-means is to find a cluster of data-points which maximize the "between-cluster" distance (and does not have the concept of labels, and hence, you can't get accuracy matrix). More insights: https://scikit-learn.org/stable/modules/clustering.html#k-means
The accuracy (assuming, you want to visualize which cluster consists of which data points) has to be analyzed manually using the predict method from sklearn.cluster.KMeans. It basically "Predicts the closest cluster each sample in X belongs to." (from documentation)
I'm using K-Means for extracting topics from text. I know it is not the best way but this is just one step towards a more complex model. What puzzles me is the elbow curve I get (below). How would you interpret it? Why is there a sudden spike around 50 K? Or the elbow method doesn't really work when dealing with text?
from sklearn.cluster import MiniBatchKMeans
wcse = []
for k in range(5, 100, 5):
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(X) # where X is my data
wcse.append(kmeans_.inertia_)
#plot it
fig = plt.figure(figsize=(15, 5))
plt.plot(range(5, 100, 5), wcse)
plt.grid(True)
plt.title('Elbow curve')
The problem is that k-means is not stable on such data.
Run it 10 times with each k, and plot all results.
K-means is sensitive to outliers and high-dimensional data. So it just does not work reliable on text.
How would you define the distance between different topics using k-means?
If you just use similarity of words as a distance metric for k-means you won't get the topics, you get some kind of a word counter.
I'd use Latent Dirichlet Allocation (LDA) for topic modeling, there are easy to use libraries for Python, R, Java..
Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
kmeans.fit(data)
So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!
If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient.
Elbow Criterion Method:
The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).
After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster).
Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).
So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.
Let's consider iris datasets,
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
#print(data["clusters"])
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
Plot for above code:
We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct.
Silhouette Coefficient Method:
From sklearn documentation,
A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
`
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next
nearest cluster.
The Silhouette Coefficient is for a single sample is then given as:
Now, to find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.
A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
X = load_iris().data
y = load_iris().target
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(X)
label = kmeans.labels_
sil_coeff = silhouette_score(X, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))
Output -
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
As we can see, n_clusters=2 has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right?
But here's the catch.
Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2 having highest Silhouette Coefficient, We would consider n_clusters=3 as optimal number of cluster due to -
Iris dataset has 3 species. (Most Important)
n_clusters=3 has the 2nd highest value of Silhouette Coefficient.
So choosing n_clusters=3 is the optimal no. of cluster for iris dataset.
Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster.
Hope it helps!
The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it.
But k-means is a pretty crude heuristic, too.
So yes, you will need to run k-means with k=1...kmax, then plot the resulting SSQ and decide upon an "optimal" k.
There exist advanced versions of k-means such as X-means that will start with k=2 and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.
Either way, I have the impression that in any actual use case where k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantization algorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector.
The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.
For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.
This answer is inspired by what OmPrakash has written. This contains code to plot both the SSE and Silhouette Score. What I've given is a general code snippet you can follow through in all cases of unsupervised learning where you don't have the labels and want to know what's the optimal number of cluster. There are 2 criterion. 1) Sum of Square errors (SSE) and Silhouette Score. You can follow OmPrakash's answer for the explanation. He's done a good job at that.
Assume your dataset is a data frame df1. Here I have used a different dataset just to show how we can use both the criterion to help decide optimal number of cluster. Here I think 6 is the correct number of cluster.
Then
range_n_clusters = [2, 3, 4, 5, 6,7,8]
elbow = []
ss = []
for n_clusters in range_n_clusters:
#iterating through cluster sizes
clusterer = KMeans(n_clusters = n_clusters, random_state=42)
cluster_labels = clusterer.fit_predict(df1)
#Finding the average silhouette score
silhouette_avg = silhouette_score(df1, cluster_labels)
ss.append(silhouette_avg)
print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)`
#Finding the average SSE"
elbow.append(clusterer.inertia_) # Inertia: Sum of distances of samples to their closest cluster center
fig = plt.figure(figsize=(14,7))
fig.add_subplot(121)
plt.plot(range_n_clusters, elbow,'b-',label='Sum of squared error')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.legend()
fig.add_subplot(122)
plt.plot(range_n_clusters, ss,'b-',label='Silhouette Score')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.legend()
plt.show()