I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
On slide no 33 its mentioned that K-means has problems when clusters are of different
Sizes
Densities
Non globular shapes
Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.
Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization
https://archive.ics.uci.edu/ml/datasets/online+retail
It gave me clusters with highly uneven distribution of 4346, 23, 3
I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.
That's a limit of k-means. You don't really have a hard fact if your clustering is good or not.
Pre-steps could be:
Normalization/Standardization of the data with StandardScaler
Missing value handling
Dimension reduction (there are several techniques like: PCA), especially if you have a lot of dimensions
Random initialization (it can vary from the start point)
A real method how good your k-means clustering is doesn't really exists, here is a topic about how to "measure" it: Clustering Quality Measure
Related
Firstly I want to apologize if any of my questions seem obvious or dumb. I'm still doing a lot of learning in this field so I would still appreciate any help I can get, even if the answer is obvious. I'm not sure if this belongs in stack overflow or not, but I thought I would give it a try here.
My understanding for clustering has always been that you never need to split the data to training and testing since you don't have any labels or are ignoring the labels. But from some recent reading that I have done online, I see that some people are saying that train/test splitting is still necessary for some clustering algorithms (those that generate centroids). So I just have some scenarios listed below, and I was hoping if some people could help me better understand what I have done wrong in those scenarios and what I need to do to fix it.
For all scenarios, my dataset is 95% unlabeled and 5% labeled, with 2 features. There has also been some scaling done on the dataset entirely prior to these scenarios. I know you're supposed to train/test split before scaling, but again, I thought you don't need to split for clustering.
Scenario 1
Using sklearn.cluster.KMeans, I ran KMeans on the dataset via the fit_predict() method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. I now feel that what I have done now is incorrect because the method introduces data leakage. Was I supposed to train/test split first with the unlabeled/labeled data, then do the scaling separately, then fit() on the unlabeled data, then predict() on the labeled data?
Scenario 2
Using scipy.cluster.hierarchy.linkage, I ran linkage on the dataset to get my linkage array. I drew the dendrogram, selected a distance cutoff point and saw how many clusters I would get, and generated the cluster labels via the fcluster method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. Am I doing anything wrong in this situation so far? I'm not sure what my next steps are from here. Since hierarchical clustering doesn't generate any centroids, how can it be used to properly classify/predict data?
Thank you.
I have implemented a K-Means clustering on a dataset in which I have reduced the dimensionality to 2 features with PCA.
Now I am wondering how to interprete this analysis since there is any reference on which are the variables on the axis. Given that doubt, I am also wondering if it is a good practice implementg a K-Means on a resized dataset with PCA.
How can I interprete this kind of clustering?
Thank you!
It is hard to give an answer addressing your question since it is not specific enough and I have no idea about the data and the objective question of your research. So, let me answer your question in general perspective if it helps.
First of all, PCA strictly decreases interpretability of the analysis beacuse it reduces the dimensions depending on linear relations of variables and you can not name reduced components anymore. In addition, check the correlation scores among the variables before PCA to get intiution how much PCA will be successful and check variance explained by PCA. The lower explained variance ratio, the greater the information loss. So it may mislead your intreptations.
If your objective is to analyse data and make inferences, I would suggest you not to reduce dimension. You have 3 dimensions only. You can apply K-Means without PCA and plot them in 3D. Matplotlib and plotly provide interactive feature for this.
However, If your objective is to build a macine learning model, then you should reduce the dimension if they are highly correlated. This would be a big favor for your model.
Finally, applying K-Means after PCA is not something not to do but creates difficulty for interpretations.
I am working on a project currently and I wish to cluster multi-dimensional data. I tried K-Means clustering and DBSCAN clustering, both being completely different algorithms.
The K-Means model returned a fairly good output, it returned 5 clusters but I have read that when the dimensionality is large, the Euclidean distance fails so I don't know if I can trust this model.
On trying the DBSCAN model, the model generated a lot of noise points and clustered a lot of points in one cluster. I tried the KNN dist plot method to find the optimal eps for the model but I can't seem to make the model work. This led to my conclusion that maybe the density of the points plotted is very high and maybe that is the reason I am getting a lot of points in one cluster.
For clustering, I am using 10 different columns of data. Should I change the algorithm I am using? What would be a better algorithm for multi-dimensional data and with less-varying density?
You can first make a dimension reduction on your dataset with PCA/LDA/t-sne or autoencoders. Then run standart some clustering algorithms.
Another way is you can use fancy deep clustering methods. This blog post is really nice explanation of how they apply deep clustering on the high dimensional dataset.
Maybe this provides you with some inspiration: Scikit-learn clustering algorithms
I suggest you try a few out. Hope that helps!
I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?
You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.
I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?