So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.
The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.
I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is?
Also if there's something I'm missing.
Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory..
The "elbow" is not even well defined so how can it be reliable?
You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable.
For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.
Related
I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:
Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.
I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)
Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?
Also any suggestions and advice on the solution approach would be really helpful.
Thanks in advance.
You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?
As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound
The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.
You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of....
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs
inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric.
It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.
Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.
So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.
The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.
I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)
ks = range(1,30)
inertias = []
for k in ks:
km = KMeans(n_clusters=k).fit(trialsX)
inertias.append(km.inertia_)
plt.plot(ks,inertias)
Based on my reading, the optimal k value lies at the 'elbow' of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?
I'll post this, because it's the best I have come up with thus far:
It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:
y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
x_range = np.linspace(ks[0],ks[-1],1000)
y_spl_1d = y_spl.derivative(n=1)
plt.plot(x_range,y_spl_1d(x_range))
then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.
EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.
I would like to choose an optimal number of clusters for my dataset using silhouette score. My data set are information about 2,000+ brands, including number of customers purchased this brand, sales for the brand and number of goods the brand sells under each category.
Since my data set is quite sparse, I've used MaxAbsScaler and TruncatedSVD before clustering.
The clustering method I use is k-means since I'm most familiar with this one (I would appreciate your suggestion on other clustering method).
When I set the cluster number to 80 and run k-means, I got different silhouette score each time. Is it because k-means gives different clusters each time?
Sometimes silhouette score for a cluster number of 80 is less than 200 and sometimes it's the opposite. So I'm confused about how to choose a reasonable number of clusters.
Besides, the range of my silhouette score is quite small and doesn't change a lot as I increase the number of clusters, which ranges from 0.15 to 0.2.
Here is the result I got from running Silhouette score:
For n_clusters=80, The Silhouette Coefficient is 0.17329035592930178
For n_clusters=100, The Silhouette Coefficient is 0.16970208098407866
For n_clusters=200, The Silhouette Coefficient is 0.1961679920561574
For n_clusters=300, The Silhouette Coefficient is 0.19367019831221857
For n_clusters=400, The Silhouette Coefficient is 0.19818865972762675
For n_clusters=500, The Silhouette Coefficient is 0.19551544844885604
For n_clusters=600, The Silhouette Coefficient is 0.19611760638136203
I would much appreciate your suggestions! Thanks in advance!
Yes, k-means is randomized, so it doesn't always give the same result.
Usually that means this k is NOT good.
But don't blindly rely on silhouette. It's not reliable enough to find the "best" k. Largely, because there usually is no best k at all.
Look at the data, and use your understanding to choose a good clustering instead. Don't expect anything good to come out automatically.
I think you are using sklearn so setting the random_state parameter to a number should let you have reproducible results for different executions of k-means for the same k. You can set that number to 0, 42 or whatever you want just keep the same number for different runs of your code and the results will be the same.
I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona