I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...
The code below is how people usually find inertia using sklearn k means.
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
you can write your own function to obtain the inertia for Kmeanscluster in nltk.
As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
The previous comment is actually missing a small detail:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.
Related
I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.
For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen
I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?
For my implementation, I now just initialized the first element
self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]
for k in range(1, self.num_clusters):
for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?
You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.
This would look like this:
centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf)
for j in range(1,self.n_clusters):
distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance)
p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
sample = np.random.choice(X.shape[0], p = p)
centers.append(X[sample])
Data & Objective
I have a set of 50-dimension vector embedding DataFrame df with 50000 rows:
id v1 v2 v3 ... v50
a0 0.231 0.370 0.071 -0.578
...
a49999 0.510 -0.111 0.235 -0.004
And within my df, I have selected 3 ids: [a15000, a30000, a45000]. I plan to do a t-SNE analysis to cluster my embedded vectors and analyze the data points that are the closest to my target ids.
My Work
First, I decided on the optimal number of clusters by running a silhouette-score package:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
best_n, best_silscore = -1, -1
# exploring from 2 clusters to 30
for k in range(2, 31):
kmeans = KMeans(n_clusters = k, random_state=200)
kmeans.fit(df.iloc[:, 1:])
clusters = kmeans.predict(df.iloc[:, 1:])
score = silhouette_score(df.iloc[:, 1:], clusters)
if score > best_silscore:
best_n = k
best_silscore = score
print('best k:', best_n, '\t best score:', best_silscore)
where I get the optimal number of clusters of 15 with its silhouette score of roughly 0.097.
After retrieving the number of optimal clusters, I have performed the t-SNE analysis:
from sklearn.manifold import TSNE
# clustering
k = 15
kmeans = KMeans(n_clusters=k, random_state=200)
y_pred = kmeans.fit_predict(df[:,1:])
# tsne
tsne = TSNE(n_components=2, verbose=1, perplexity=50, random_state=200)
X_embedded = tsne.fit_transform(df[:,1:])
# visualization
sns.set(rc={'figure.figsize':(20,20)})
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred,
legend='full', palette=sns.hls_palette(k, l=.4, s=.9))
plt.title('t-SNE with KMeans Labels for Embedded Vectors')
plt.show()
(with my dataset being pretty large, I put the perplexity level to the highest recommended level.)
>>> [t-SNE] Mean sigma: 0.195604
>>> [t-SNE] KL divergence after 250 iterations with early exaggeration: 94.565384
>>> [t-SNE] KL divergence after 1000 iterations: 2.455614
Problem
Now, I'm aware that it's easier to plot a separate scatterplot to highlight the three pre-selected points from earlier. Currently, they are somewhere in the scatterplot.
However, if I do a separate t-SNE analysis on just those three points, they (understandably) are way off any of the clusters in the grid. They look something like this, with the target ids marked as black dots.
Instead of this bizarre-looking scatterplot, I want something almost identical to the first one, but with the corresponding black, highlighted data points in the clusters.
I have implemented the kmeans++ algorithm to initialize clusters when performing K-means clustering. The loop has to run k times. I was wondering if there was any way to vectorize the algorithm to get it to run faster?
points is an array of points in d-dimensions and k is the number of centroids to return.
It works by calculating the minimum distances from the already found clusters, to all the points and then calculating the probability of choosing the next cluster from the points.
The issue is really that it scales badly when k is large.
def init_plus_plus(points, k):
centroids = np.zeros_like(points[:k])
r = np.random.randint(0, points.shape[0])
centroids[0] = points[r]
for i in range(1, k):
min_distances = self.euclidian_distance(centroids[:i], points).min(1)
prob = min_distances / min_distances.sum()
cs = np.cumsum(prob)
idx = np.sum(cs < np.random.rand())
centroids[i] = points[int(idx)]
return centroids
The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.
I am using sklearn's k-means clustering to cluster my data. Now I want to have the distance between my clusters, but can't find it. I could calculate the distance between each centroid, but wanted to know if there is a function to get it and if there is a way to get the minimum/maximum/average linkage distance between each cluster. My code is very simple:
km = KMeans(n_clusters = 5, random_state = 1)
km.fit(X_tfidf )
clusterkm = km.cluster_centers_
clusters = km.labels_.tolist()
Thank you!
Unfortunately, you're going to have to compute those distances on the cluster centers yourself. Scikit doesn't provide a method for that right out of the box. Here's a comparable problem setup:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
X, y = load_iris(return_X_y=True)
km = KMeans(n_clusters = 5, random_state = 1).fit(X)
And how you'd compute the distances:
dists = euclidean_distances(km.cluster_centers_)
And then to get the stats you're interested in, you'll only want to compute on the upper (or lower) triangular corner of the distance matrix:
import numpy as np
tri_dists = dists[np.triu_indices(5, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
km.inertia_ is measure of sklearn’s KMeans is the sum of squared distances.
from the sklearn website:
inertia_: float
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html