Maximum value for n_clusters in K Means algorithm - python

I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:
Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.
I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)
Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?
Also any suggestions and advice on the solution approach would be really helpful.
Thanks in advance.

You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?
As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound

The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.
You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of....
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs
inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric.
It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.
Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.

Related

k means clustering with fixed constraints (sum of specific attribute should be less than or equal 90,000)

Suppose I have 20,000 features on map, and each feature have many attributes (as well as the latitude and longitude). One of the attributes called population.
I want to split these 20,000 features into 3 clusters where the total sum of population of each cluster are equal to specific value 90,000 and features in each cluster should be near each others(ie will take locations in our consideration)
So, the output clusters should have the following conditions:
Sum(population) of all points/items/features in cluster 1=90,000
Sum(population) of all points/items/features in cluster 2=90,000
Sum(population) of all points/items/features in cluster 3=90,000
I tried to use the k-mean clustering algorithm which gave me 3 clusters, but how to force the above constraint (sum of population should equal 90,000)
Any idea is appreciated.
A turnkey solution will not work for you.
You'll have to formulate this as a standard constraint optimization problem and run a silver to optimize this. It's fairly straightforward: take the k-means objective and add your constraints...

How reliable is the Elbow curve in finding K in K-Means?

So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.
The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.
I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is?
Also if there's something I'm missing.
Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory..
The "elbow" is not even well defined so how can it be reliable?
You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable.
For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.

Different silhouette scores for the same data and number of clusters

I would like to choose an optimal number of clusters for my dataset using silhouette score. My data set are information about 2,000+ brands, including number of customers purchased this brand, sales for the brand and number of goods the brand sells under each category.
Since my data set is quite sparse, I've used MaxAbsScaler and TruncatedSVD before clustering.
The clustering method I use is k-means since I'm most familiar with this one (I would appreciate your suggestion on other clustering method).
When I set the cluster number to 80 and run k-means, I got different silhouette score each time. Is it because k-means gives different clusters each time?
Sometimes silhouette score for a cluster number of 80 is less than 200 and sometimes it's the opposite. So I'm confused about how to choose a reasonable number of clusters.
Besides, the range of my silhouette score is quite small and doesn't change a lot as I increase the number of clusters, which ranges from 0.15 to 0.2.
Here is the result I got from running Silhouette score:
For n_clusters=80, The Silhouette Coefficient is 0.17329035592930178
For n_clusters=100, The Silhouette Coefficient is 0.16970208098407866
For n_clusters=200, The Silhouette Coefficient is 0.1961679920561574
For n_clusters=300, The Silhouette Coefficient is 0.19367019831221857
For n_clusters=400, The Silhouette Coefficient is 0.19818865972762675
For n_clusters=500, The Silhouette Coefficient is 0.19551544844885604
For n_clusters=600, The Silhouette Coefficient is 0.19611760638136203
I would much appreciate your suggestions! Thanks in advance!
Yes, k-means is randomized, so it doesn't always give the same result.
Usually that means this k is NOT good.
But don't blindly rely on silhouette. It's not reliable enough to find the "best" k. Largely, because there usually is no best k at all.
Look at the data, and use your understanding to choose a good clustering instead. Don't expect anything good to come out automatically.
I think you are using sklearn so setting the random_state parameter to a number should let you have reproducible results for different executions of k-means for the same k. You can set that number to 0, 42 or whatever you want just keep the same number for different runs of your code and the results will be the same.

Kmeans is not producing an elbow

I have a data frame of about 300,000 unique product names and I am trying to use k means to cluster similar names together. I used sklearn's tfidfvectorizer to vectorize the names and convert to a tf-idf matrix.
Next I ran k means on the tf-idf matrix with number of clusters ranging from 5 to 25. Then I plotted the inertia for each # of clusters.
Based on the plot am I approaching the problem wrong? What are some takeaways from this if there is no distinct elbow?
Most likely because k-means w=th TF-IDF doesn't work well on such short text such as product names.
Not seeing an elbow is an indication that the results aren't good.

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

Categories

Resources