Scikit-Learn DBSCAN clustering yielding no clusters - python

I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:
Noisy samples are given the label -1.
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.
Here is the code I am using for clustering:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.
What could be causing this issue?

You need to choose the parameter eps, too.
DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.
IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).
200 instances probably is too small to reliably measure density, in particular with a dozen variables.

Related

Confusion on using clustering algorithms that do and do not generate centroids

Firstly I want to apologize if any of my questions seem obvious or dumb. I'm still doing a lot of learning in this field so I would still appreciate any help I can get, even if the answer is obvious. I'm not sure if this belongs in stack overflow or not, but I thought I would give it a try here.
My understanding for clustering has always been that you never need to split the data to training and testing since you don't have any labels or are ignoring the labels. But from some recent reading that I have done online, I see that some people are saying that train/test splitting is still necessary for some clustering algorithms (those that generate centroids). So I just have some scenarios listed below, and I was hoping if some people could help me better understand what I have done wrong in those scenarios and what I need to do to fix it.
For all scenarios, my dataset is 95% unlabeled and 5% labeled, with 2 features. There has also been some scaling done on the dataset entirely prior to these scenarios. I know you're supposed to train/test split before scaling, but again, I thought you don't need to split for clustering.
Scenario 1
Using sklearn.cluster.KMeans, I ran KMeans on the dataset via the fit_predict() method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. I now feel that what I have done now is incorrect because the method introduces data leakage. Was I supposed to train/test split first with the unlabeled/labeled data, then do the scaling separately, then fit() on the unlabeled data, then predict() on the labeled data?
Scenario 2
Using scipy.cluster.hierarchy.linkage, I ran linkage on the dataset to get my linkage array. I drew the dendrogram, selected a distance cutoff point and saw how many clusters I would get, and generated the cluster labels via the fcluster method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. Am I doing anything wrong in this situation so far? I'm not sure what my next steps are from here. Since hierarchical clustering doesn't generate any centroids, how can it be used to properly classify/predict data?
Thank you.

Limitations of K-Means Clustering

I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
On slide no 33 its mentioned that K-means has problems when clusters are of different
Sizes
Densities
Non globular shapes
Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.
Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization
https://archive.ics.uci.edu/ml/datasets/online+retail
It gave me clusters with highly uneven distribution of 4346, 23, 3
I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.
That's a limit of k-means. You don't really have a hard fact if your clustering is good or not.
Pre-steps could be:
Normalization/Standardization of the data with StandardScaler
Missing value handling
Dimension reduction (there are several techniques like: PCA), especially if you have a lot of dimensions
Random initialization (it can vary from the start point)
A real method how good your k-means clustering is doesn't really exists, here is a topic about how to "measure" it: Clustering Quality Measure

What is the purpose of the holdout set in k-means clustering?

Link to the MIT problem set
Here are my current thoughts--please point to where I'm wrong :)
What I believe: The holdout set's purpose is to foil,
contrast, for the training set - to prove that the
k-means eliminates error at each round.
To do this, the holdout set shows the error at the very begin-
ning, i.e. it doesn't recompute the centroid of each clusters
to be at the very center of each cluster, after each
point has been assigned. It just stops, and the error is
calculated.
The training set, for the initial 80% of the points--
partitioned using randomPartition()--simply go through
the entire k-means function, and return the error after
that.)
Where I'm probably wrong: The problem probably just
requests another run of k-means, but with a smaller set.
Also, the way of calculating error for training set vs. the holdout
set seem identical to me. They're probably not.
Also, I heard something about it involving feature selection.
Current methods I'm considering based on current belief:
Duplicate the k-means function, and modify the duplicate
so that it returns the clusters, maxDistance after initial
run. Use this function for the holdout set.
The goal of clustering is to group similar data points. But how would you know if the similar data points you have grouped are grouped correctly? How can you judge your results? For this reason you divide your available data into 2 sets: training and holdout.
Take this as an analogy.
Think about training set as practice questions for some examination. You work the practice questions, try to do best in it and improve your skills.
You can think holdout set as the actual examination. If you have worked good on the practice questions (training set) then you will probably perform good in the examination (holdout set).
Now you know how well did you do in practice and examination (of-course after attempting ) based on which you can infer your overall performance and judge what is good (what number of clusters are good or how good is the data clustered).
So you will apply your clustering algorithm on the training data but not on holdout data and find out cluster centers (representatives of clusters). For holdout data, you will simply use the cluster centers you have found from algorithm and assign data-points to cluster whose center is nearest. Calculate your performance on training and holdout data based on some performance metric (squared distance error in your case). Finally compare these metrics over different values of k to get a good judgement. There is more to it but for assignment sake it seems enough.
In practice, there are many other methods. But the key idea in most of them is same. There is a statistics community where you can find more similar questions: https://stats.stackexchange.com/
References:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method

Semi-supervised Gaussian mixture model clustering in Python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Categories

Resources