I have a data frame of about 300,000 unique product names and I am trying to use k means to cluster similar names together. I used sklearn's tfidfvectorizer to vectorize the names and convert to a tf-idf matrix.
Next I ran k means on the tf-idf matrix with number of clusters ranging from 5 to 25. Then I plotted the inertia for each # of clusters.
Based on the plot am I approaching the problem wrong? What are some takeaways from this if there is no distinct elbow?
Most likely because k-means w=th TF-IDF doesn't work well on such short text such as product names.
Not seeing an elbow is an indication that the results aren't good.
Related
I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:
Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.
I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)
Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?
Also any suggestions and advice on the solution approach would be really helpful.
Thanks in advance.
You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?
As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound
The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.
You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of....
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs
inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric.
It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.
Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.
Suppose I have 20,000 features on map, and each feature have many attributes (as well as the latitude and longitude). One of the attributes called population.
I want to split these 20,000 features into 3 clusters where the total sum of population of each cluster are equal to specific value 90,000 and features in each cluster should be near each others(ie will take locations in our consideration)
So, the output clusters should have the following conditions:
Sum(population) of all points/items/features in cluster 1=90,000
Sum(population) of all points/items/features in cluster 2=90,000
Sum(population) of all points/items/features in cluster 3=90,000
I tried to use the k-mean clustering algorithm which gave me 3 clusters, but how to force the above constraint (sum of population should equal 90,000)
Any idea is appreciated.
A turnkey solution will not work for you.
You'll have to formulate this as a standard constraint optimization problem and run a silver to optimize this. It's fairly straightforward: take the k-means objective and add your constraints...
So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.
The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.
I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is?
Also if there's something I'm missing.
Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory..
The "elbow" is not even well defined so how can it be reliable?
You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable.
For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.
I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona
I am trying to do hierarchical clustering on an m*n array.
Input array : 500 * 1000 (1000 features, 500 observations)
Calculate distance matrix using a self-defined pdist function
Feed this distance matrix to linkage function :
clusters = sch.linkage(distanceMatrix,'single')
Form flat clusters :
fc = sch.fcluster(clusters,cutoff,'distance')
This gives me some clusters (around 80, using a cutoff value of 6.0).
Now, is there anyway, that i can get the 1000 features corresponding to each cluster as well? ( like we get the features of the centroids using K-means clustering).
Clusters in hierarchical clustering (or pretty much anything except k-means and Gaussian Mixture EM that are restricted to "spherical" - actually: convex - clusters) do not necessarily have sensible means.
Because they allow for non-spherical clusters. That actually is a feature...
https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity_based_clustering_.28hierarchical_clustering.29
Have a look at the right image titled "Linkage clustering examples". What good is a cluster in this "banana" example? The centroid might not even be in the cluster!
Note that you can still compute the centroid yourself, if you need it. As the clustering algorithm does not need the centroid, it will not be computing it for you automatically, obviously.