In Kmeans clustering we can define number of cluster. But is it possible to define that cluster_1 will contain 20% data, cluster_2 will have 30% and cluster_3 will have rest of the data points?
I try to do it by python but couldn't.
Here is a discussion on how to modify KMeans so that the clusters all have the same size. You could modify it further to make the clusters have your desired respective sizes.
Using K-means clustering, as you said we specify the number of clusters but it's not actually possible to specify the percentage of data points. I would recommend using Fuzzy-C if you want to specify a exact percentage of data points alloted for each cluster
Related
I am new to python and working on a consumer dataset where we have used LCA, K-Means, DBSCAN and Spectral Clustering to compute the clusters. In all these methods, number of clusters are different (eg-5 clusters in K-Means but 7 in LCA) and Independent variables may or may not be same (eg- 12 independent variables in K-means but 10 in LCA). Now I want to validate the clusters using Cluster Cohesion, Cluster separation, Entropy, Purity, Jaccard Coeff, RAND index etc. I need help on;
Are these measures are appropriate regarding cluster validation?
Is there any function/library in python where I can calculate all these at once?
How to calculate these in python if there is no function/library available.
Hope I am clear and Thanks for the help in advance.
Sklearn has all these parameters readily available. Whether they are appropriate? These are the standard and accepted metrics to score clustering results. If clustering was the right tool for your question, these metrics are appropriate to validate your results.
I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
On slide no 33 its mentioned that K-means has problems when clusters are of different
Sizes
Densities
Non globular shapes
Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.
Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization
https://archive.ics.uci.edu/ml/datasets/online+retail
It gave me clusters with highly uneven distribution of 4346, 23, 3
I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.
That's a limit of k-means. You don't really have a hard fact if your clustering is good or not.
Pre-steps could be:
Normalization/Standardization of the data with StandardScaler
Missing value handling
Dimension reduction (there are several techniques like: PCA), especially if you have a lot of dimensions
Random initialization (it can vary from the start point)
A real method how good your k-means clustering is doesn't really exists, here is a topic about how to "measure" it: Clustering Quality Measure
For PCA we can see variance_score and say how much percentage of original data variance is included in each Principal Components. With these variance scores, we can plot an elbow graph and decide the dimension for visualising data. But for t-SNE, I couldnt find any.
Is there any way to decide the number of dimensions in t-SNE?
Use the n_components parameter in the object construction. If you're using a large dataset, UMAP scales better at reducing into multiple dimensions than tSNE.
I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?
You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.
I tried CountVectorizer + KMeans but I don't know the number of clusters. Calculating the number of clusters in KMeans took a lot of time when I used the gap statistic method. NMF requires determining the number of components beforehand too.
There is no one algorithm which is best for unsupervised text classification. It depends on the data you have, what you are trying to achieve, etc'.
If you wish to avoid the number of clusters issue, you can try DBSCAN, which is a density-based clustering algorithm:
DBSCAN on Wikipedia:
a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
DBSCAN automatically finds the number of clusters by recursively connecting points to a nearby dense group of points (e.g. a cluster).
To use DBSCAN, the most important parameters to tune are epsilon (which controls the maximum distance to be considered a neighbor) and min_samples (the number of samples in a neighborhood to be considered a core point). Try starting with the default parameters sklearn provides, and tune them to get better results for your specific task.