Data & Objective
I have a set of 50-dimension vector embedding DataFrame df with 50000 rows:
id v1 v2 v3 ... v50
a0 0.231 0.370 0.071 -0.578
...
a49999 0.510 -0.111 0.235 -0.004
And within my df, I have selected 3 ids: [a15000, a30000, a45000]. I plan to do a t-SNE analysis to cluster my embedded vectors and analyze the data points that are the closest to my target ids.
My Work
First, I decided on the optimal number of clusters by running a silhouette-score package:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
best_n, best_silscore = -1, -1
# exploring from 2 clusters to 30
for k in range(2, 31):
kmeans = KMeans(n_clusters = k, random_state=200)
kmeans.fit(df.iloc[:, 1:])
clusters = kmeans.predict(df.iloc[:, 1:])
score = silhouette_score(df.iloc[:, 1:], clusters)
if score > best_silscore:
best_n = k
best_silscore = score
print('best k:', best_n, '\t best score:', best_silscore)
where I get the optimal number of clusters of 15 with its silhouette score of roughly 0.097.
After retrieving the number of optimal clusters, I have performed the t-SNE analysis:
from sklearn.manifold import TSNE
# clustering
k = 15
kmeans = KMeans(n_clusters=k, random_state=200)
y_pred = kmeans.fit_predict(df[:,1:])
# tsne
tsne = TSNE(n_components=2, verbose=1, perplexity=50, random_state=200)
X_embedded = tsne.fit_transform(df[:,1:])
# visualization
sns.set(rc={'figure.figsize':(20,20)})
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred,
legend='full', palette=sns.hls_palette(k, l=.4, s=.9))
plt.title('t-SNE with KMeans Labels for Embedded Vectors')
plt.show()
(with my dataset being pretty large, I put the perplexity level to the highest recommended level.)
>>> [t-SNE] Mean sigma: 0.195604
>>> [t-SNE] KL divergence after 250 iterations with early exaggeration: 94.565384
>>> [t-SNE] KL divergence after 1000 iterations: 2.455614
Problem
Now, I'm aware that it's easier to plot a separate scatterplot to highlight the three pre-selected points from earlier. Currently, they are somewhere in the scatterplot.
However, if I do a separate t-SNE analysis on just those three points, they (understandably) are way off any of the clusters in the grid. They look something like this, with the target ids marked as black dots.
Instead of this bizarre-looking scatterplot, I want something almost identical to the first one, but with the corresponding black, highlighted data points in the clusters.
Related
I have used kmeans to cluster my data. Now I want to cluster the clusters so that the clustered clusters consist of the individual clusters from the first round of clustering.
Minimal reproducible example:
# create dataframe
n1 = 336
n2 = 200
x_list = np.array(range(0, n1))
y_list = np.array(range(0, n2))
x_list = np.repeat([x_list], n2, axis=0).flatten() # width
y_list = np.repeat(y_list, n1, axis=0).flatten() # height
# normalize x, y to avoid skewing the clustering
norm_x = np.linalg.norm(x_list)
norm_y = np.linalg.norm(y_list)
normal_array_x = np.round(x_list / norm_x, 6)
normal_array_y = np.round(y_list / norm_y, 6)
data = {'x_position_norm': normal_array_x,
'y_position_norm': normal_array_y}
features = pd.DataFrame(data).to_numpy()
kmeans = KMeans(init='k-means++', n_clusters=16800, n_init=3, max_iter=3, random_state=1)
kmeans.fit(features)
kmeans2 = KMeans(init='k-means++', n_clusters=4200, n_init=3, max_iter=3, random_state=1)
kmeans2.fit(kmeans.cluster_centers_)
At the moment, I am clustering the cluster centers. Is there a better/ more efficient way to cluster that guarantees clusters in the second round of clustering are consisting of clusters from the first round of clusters?
Each clustering method has its own characteristics. It only depends on what your data looks like.
For Bottom-up clustering, spectral clustering will work well.
Plot a cluster map
Here is a list of clustering methods.
Overview of clustering methods
For a single exponential curve such as shown in the image here curve_fit for as single exponential curve , I am able to fit the data using scipy.optimize.curve_fit. However, I am unsure on how to realize a fit for similar dataset composed of multiple exponential curves as shown here double exponential curves.
I achieved the fit for the single curve using the following approach:
def exp_decay(x,a,r):
return a * ((1-r)**x)
x = np.linspace(0,50,50)
y = exp_decay(x, 400, 0.06)
y1 = exp_decay(x, 550, 0.06) # this is to be used to append to y to generate two curves
pars, cov = curve_fit(exp_decay, x, y, p0=[0,0])
plt.scatter(x,y)
plt.plot(x, exp_decay(x, *pars), 'r-') #this realizes the fit for a single curve
yx = np.append(y,y1) #this realizes two exponential curves (as shown above - double exponential curves) for which I don't need to fit a model to
Can someone help describe how to achieve this for a dataset of two curves. My actual dataset comprises of multiple exponential curves but I think if I can realize a fit for two curves, I may be able to replicate same for my dataset. This must not be done with scipy's curve_fit; any implementation that works is fine.
PLEASE HELP !!!
Your problem can easily be tackled by splitting your dataset using a simple criterion such as first derivative estimate and then we can apply simple curve fitting procedure to each sub dataset.
Trial Dataset
First, let's import some packages and create a synthetic dataset with three curves to represent your problem.
We use a two parameters exponential model as time origin shift will be handled by the splitting methodology. We also add noise as there is always noise on real world data:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
def func(x, a, b):
return a*np.exp(b*x)
N = 1001
n1 = N//3
n2 = 2*n1
t = np.linspace(0, 10, N)
x0 = func(t[:n1], 1, -0.2)
x1 = func(t[n1:n2]-t[n1], 5, -0.4)
x2 = func(t[n2:]-t[n2], 2, -1.2)
x = np.hstack([x0, x1, x2])
xr = x + 0.025*np.random.randn(x.size)
Graphically it renders as follow:
Dataset Splitting
We can split the dataset into three sub-datasets using a simple criterion as first derivative estimate using first difference to assess it. The goal is to detect when curve drastically goes up or down (where dataset should be split. First derivative is estimated as follow):
dxrdt = np.abs(np.diff(xr)/np.diff(t))
The criterion requires an extra parameter (threshold) that must be tuned accordingly to your signal specifications. The criterion is equivalent to:
xcrit = 20
q = np.where(dxrdt > xcrit) # (array([332, 665], dtype=int64),)
And split index are:
idx = [0] + list(q[0]+1) + [t.size] # [0, 333, 666, 1001]
Mainly the criterion threshold will be affected by the nature and the power of the noise on your data and the gap magnitudes between two curves. The usage of this methodology depends on the ability to detect curves gap in presence of noise. It will break when the noise power has the same magnitude of the gap we want to detect. You can also observe false split index if the noise is heavily tailed (few strong outliers).
In this MCVE, we have set the threshold to 20 [Signal Units/Time Units]:
An alternative to this hand-crafted criterion is to delegate the identification to the excellent find_peaks method of scipy. But it will not avoid the requirement to tune the detection to your signal specifications.
Fit origin-shifted dataset
Now we can apply the curve fitting on each sub-dataset (with origin shifted time), collect parameters and statistics and plot the result:
trials = []
fig, axe = plt.subplots()
for k, (i, j) in enumerate(zip(idx[:-1], idx[1:])):
p, s = optimize.curve_fit(func, t[i:j]-t[i], xr[i:j])
axe.plot(t[i:j], xr[i:j], '.', label="Data #{}".format(k+1))
axe.plot(t[i:j], func(t[i:j]-t[i], *p), label="Data Fit #{}".format(k+1))
trials.append({"n0": i, "n1": j, "t0": t[i], "a": p[0], "b": p[1],
"s_a": s[0,0], "s_b": s[1,1], "s_ab": s[0,1]})
axe.set_title("Curve Fits")
axe.set_xlabel("Time, $t$")
axe.set_ylabel("Signal Estimate, $\hat{g}(t)$")
axe.legend()
axe.grid()
df = pd.DataFrame(trials)
It returns the following fitting results:
n0 n1 t0 a b s_a s_b s_ab
0 0 333 0.00 0.998032 -0.199102 0.000011 4.199937e-06 -0.000005
1 333 666 3.33 5.001710 -0.399537 0.000013 3.072542e-07 -0.000002
2 666 1001 6.66 2.002495 -1.203943 0.000030 2.256274e-05 -0.000018
Which complies with our original parameters (see Trial dataset section).
Graphically we can check the goodness of fits:
I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...
The code below is how people usually find inertia using sklearn k means.
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
you can write your own function to obtain the inertia for Kmeanscluster in nltk.
As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
The previous comment is actually missing a small detail:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.
I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?
Heres the code:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)
features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica
Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?
You need to take a look at clustering metrics to evaluate your predicitons, these include
Homegenity Score
V measure
Completenss Score and so on
Now take Completeness Score for example,
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
For example
from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0
Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.
Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)
Read the official documentation here : Homogenity, completeness and V-measure
First of all, you are not classifying, you are clustering the data. Classification is a different process.
The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.
That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.
Reference from this blog https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
You need to got the relation from confusion matrix with Hungarian algorithm.
The code is below:
from scipy.optimize import linear_sum_assignment as linear_assignment
def cluster_acc(y_true, y_pred):
cm = metrics.confusion_matrix(y_true, y_pred)
_make_cost_m = lambda x:-x + np.max(x)
indexes = linear_assignment(_make_cost_m(cm))
indexes = np.concatenate([indexes[0][:,np.newaxis],indexes[1][:,np.newaxis]], axis=-1)
js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
cm2 = cm[:, js]
acc = np.trace(cm2) / np.sum(cm2)
return acc
Or just import library coclust
from coclust.evaluation.external import accuracy
accuracy(labels, predicted_labels)
Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
kmeans.fit(data)
So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!
If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient.
Elbow Criterion Method:
The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).
After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster).
Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).
So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.
Let's consider iris datasets,
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
#print(data["clusters"])
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
Plot for above code:
We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct.
Silhouette Coefficient Method:
From sklearn documentation,
A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
`
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next
nearest cluster.
The Silhouette Coefficient is for a single sample is then given as:
Now, to find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.
A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
X = load_iris().data
y = load_iris().target
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(X)
label = kmeans.labels_
sil_coeff = silhouette_score(X, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))
Output -
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
As we can see, n_clusters=2 has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right?
But here's the catch.
Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2 having highest Silhouette Coefficient, We would consider n_clusters=3 as optimal number of cluster due to -
Iris dataset has 3 species. (Most Important)
n_clusters=3 has the 2nd highest value of Silhouette Coefficient.
So choosing n_clusters=3 is the optimal no. of cluster for iris dataset.
Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster.
Hope it helps!
The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it.
But k-means is a pretty crude heuristic, too.
So yes, you will need to run k-means with k=1...kmax, then plot the resulting SSQ and decide upon an "optimal" k.
There exist advanced versions of k-means such as X-means that will start with k=2 and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.
Either way, I have the impression that in any actual use case where k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantization algorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector.
The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.
For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.
This answer is inspired by what OmPrakash has written. This contains code to plot both the SSE and Silhouette Score. What I've given is a general code snippet you can follow through in all cases of unsupervised learning where you don't have the labels and want to know what's the optimal number of cluster. There are 2 criterion. 1) Sum of Square errors (SSE) and Silhouette Score. You can follow OmPrakash's answer for the explanation. He's done a good job at that.
Assume your dataset is a data frame df1. Here I have used a different dataset just to show how we can use both the criterion to help decide optimal number of cluster. Here I think 6 is the correct number of cluster.
Then
range_n_clusters = [2, 3, 4, 5, 6,7,8]
elbow = []
ss = []
for n_clusters in range_n_clusters:
#iterating through cluster sizes
clusterer = KMeans(n_clusters = n_clusters, random_state=42)
cluster_labels = clusterer.fit_predict(df1)
#Finding the average silhouette score
silhouette_avg = silhouette_score(df1, cluster_labels)
ss.append(silhouette_avg)
print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)`
#Finding the average SSE"
elbow.append(clusterer.inertia_) # Inertia: Sum of distances of samples to their closest cluster center
fig = plt.figure(figsize=(14,7))
fig.add_subplot(121)
plt.plot(range_n_clusters, elbow,'b-',label='Sum of squared error')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.legend()
fig.add_subplot(122)
plt.plot(range_n_clusters, ss,'b-',label='Silhouette Score')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.legend()
plt.show()