I am clustering a data set in python using kmeans. Before I clustered the data set, I determined the optimal number of clusters using an elbow curve.
The optimal number of clusters was 5. So after kmeans clustered the dataset, I had 5 different clusters.
So here’s my question. Now that I have 5 different clusters, I would like to cluster those 5 clusters again so that I can get smaller clusters. Once I have smaller clusters for each one of those 5 clusters, I would like to cluster those smaller clusters again. I would like to repeat this until I have only about 20 points in each cluster. The dataset has 1,000,000 + observations.
What is the best way to do this? Is there a way to build a clustering loop? Is there a completely different better way to do this? I know this isn’t a specific coding question, but I’d love to hear some thoughts.
I'm going to provide some pseudocode since you didn't provide any details about yout code (which you should, by the way):
def cluster_until_20(data):
if data.size() == 20:
return data
clusters = kmeans(data, 5)
if any size of cluster in clusters != 20:
return [cluster_until_20(cluster) for cluster in clusters]
return clusters
The key is using recursion with a list comprehension, that will go "deeper" in recursion as long as the size of data is != 20
Related
I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:
Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.
I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)
Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?
Also any suggestions and advice on the solution approach would be really helpful.
Thanks in advance.
You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?
As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound
The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.
You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of....
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs
inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric.
It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.
Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.
Suppose I have 20,000 features on map, and each feature have many attributes (as well as the latitude and longitude). One of the attributes called population.
I want to split these 20,000 features into 3 clusters where the total sum of population of each cluster are equal to specific value 90,000 and features in each cluster should be near each others(ie will take locations in our consideration)
So, the output clusters should have the following conditions:
Sum(population) of all points/items/features in cluster 1=90,000
Sum(population) of all points/items/features in cluster 2=90,000
Sum(population) of all points/items/features in cluster 3=90,000
I tried to use the k-mean clustering algorithm which gave me 3 clusters, but how to force the above constraint (sum of population should equal 90,000)
Any idea is appreciated.
A turnkey solution will not work for you.
You'll have to formulate this as a standard constraint optimization problem and run a silver to optimize this. It's fairly straightforward: take the k-means objective and add your constraints...
So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.
The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.
I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is?
Also if there's something I'm missing.
Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory..
The "elbow" is not even well defined so how can it be reliable?
You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable.
For example, the Calinski and Harabasz (1974) variance ratio criterion. It is essentially a rescaled version that makes much more sense.
I have the following points in 3D space:
I need to group the points, according to D_max and d_max:
D_max = max dimension of each group
d_max = max distance of points inside each group
Like this:
The shape of the group in the above image looks like a box, but the shape can be anything which would be the output of the grouping algorithm.
I'm using Python and visualize the results with Blender. I'm considering using the scipy.spatial.KDTree and calling its query API, however, I'm not sure if that's the right tool for the job at hand. I'm worried that there might be a better tool which I'm not aware of. I'm curious to know if there is any other tool/library/algorithm which can help me.
As #CoMartel pointed out, there is DBSCAN and also HDBSCAN clustering modules which look like a good fit for this type of problems. However, as pointed out by #Paul they lack the option for max size of the cluster which correlates to my D_max parameter. I'm not sure how to add a max cluster size feature to DBSCAN and HDBSCAN clustering.
Thanks to #Anony-Mousse I watched Agglomerative Clustering: how it works and Hierarchical Clustering 3: single-link vs. complete-link and I'm studying Comparing Python Clustering Algorithms, I feel like it's getting more clear how these algorithms work.
As requested, my comment as an answer :
You could use DBSCAN(http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or HDBSCAN.
Both these algorithm allow to group each point according to d_max (maximum distance between 2 points of the same dataset), but they don't take the maximum cluster size. The only way to limit the maximum size of a cluster is by reducing the epsparameter, which control the max distance between 2 points of the same cluster.
Use hierarchical agglomerative clustering.
If you use complete linkage you can control the maximum diameter of the clusters. The complete link is the maximum distance.
DBSCAN's epsilon parameter is not a maximum distance because multiple steps are joined transitively. Clusters can become much larger than epsilon!
DBSCAN clustering algorithm with the maximum distance of points inside each group extension
You can use the DBSCAN algorithm recursively.
def DBSCAN_with_max_size(myData, eps = E, max_size = S):
clusters = DBSCAN(myData, eps = E)
Big_Clusters = find_big_clusters(clusters)
for big_cluster in Big_Clusters:
DBSCAN_with_max_size(big_cluster ,eps = E/2 ,max_size = S) //eps is something lower than E (e.g. E/2)
I have a data frame of about 300,000 unique product names and I am trying to use k means to cluster similar names together. I used sklearn's tfidfvectorizer to vectorize the names and convert to a tf-idf matrix.
Next I ran k means on the tf-idf matrix with number of clusters ranging from 5 to 25. Then I plotted the inertia for each # of clusters.
Based on the plot am I approaching the problem wrong? What are some takeaways from this if there is no distinct elbow?
Most likely because k-means w=th TF-IDF doesn't work well on such short text such as product names.
Not seeing an elbow is an indication that the results aren't good.