Suppose we have 1000 random data points in a cube (as shown in the following image). The distribution of points in X and Y directions are uniform but not in Z direction. As we get deeper, the data points are denser. Is there any straightforward way in python to cluster these data points such that:
each cluster has equal size
each cluster consists of local points, i.e., each cluster consists of points being close to each other.
I have already tried K-means clustering from Scipy package but it did not give me a good result and the points of each cluster were very widespread rather than being concentrated.
Try using Scikit-Learn's implementation. They initialize their clusters using a technique known as "K-Means++" which picks the first means probabilistically to get an optimal starting distribution. This creates a higher probability of a good result.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Related
I am currently learning how to use OPTICS in sklearn. I am inputting a numpy array of (205,22). I am able to get plots out of it, but I do not understand how I am getting a 2d plot out of multiple dimensions and how I am supposed to read it. I more or less understand the reachability plot, but the rest of it makes no sense to me. Can someone please explain what is happening. Is the function just simplifying the data to two dimensions somehow? Thank you
From the sklearn user guide:
The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data set. As shown in the above plot, combining reachability distances and data set ordering_ produces a reachability plot, where point density is represented on the Y-axis, and points are ordered such that nearby points are adjacent. ‘Cutting’ the reachability plot at a single value produces DBSCAN like results; all points above the ‘cut’ are classified as noise, and each time that there is a break when reading from left to right signifies a new cluster.
the other three plots are a visual representation of the actual clusters found by three different algorithms.
as you can see in the OPTICS Clustering plot there are two high density clusters (blue and cyan) the gray crosses acording to the reachability plot are classify as noise because of the low xi value
in the DBSCAN clustering with eps = 0.5 everithing is considered noise since the epsilon value is to low and the algorithm can not found any density points.
Now it is obvious that in the third plot the algorithm found just a single cluster because of the adjustment of the epsilon value and everything above the 2.0 line is considered noise.
please refer to the user guide:
I am trying to reconstruct a brain tumor image after clustering using hdbscan.
However, hdbscan does not have cluster centers unlike kmeans so I am a bit confused on how to obtain the clustered image. I have tried obtaining the ref cluster center by matching the (65536,3) array with the hdbscan labels i.e. r and storing them after getting the mean cluster points for each cluster in crs.
I am unsure if this is the best way to proceed to reconstruct an image that is, get some mean centers based on clusters and reconstruct the image using the mean centers plus labels.
crs = np.zeros((dbnumber_of_clusters, 3))
for i in range(0, dbnumber_of_clusters):
dbcluster_points = mriarr[r == i]
dbcluster_mean = np.mean(dbcluster_points, axis=0)
crs[i, :] = dbcluster_mean
HDBSCAN is not designed to "reconstruct" data. So there may not be an elegant way.
Using the mean of each cluster is an obvious choice wrt. simulating what k-mrans does, but such a point may lie outside the actual cluster if a cluster is not convex. So it may be appropriate to choose the most dense point instead.
Furthermore, the clustering is supposed to be hierarchical, so when computing a cluster representative, you should also take the data of nested clusters into account...
Last but not least, it can produce a "noise cluster". That is not actually a cluster, but simply all the unclustered data. Computing a single representative object of such points is not meaningful. Instead, you probably want to treat these points as each point bring it's own cluster.
If I apply Scikit's DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on a similarity matrix, I get a series of labels back. Some of these labels are -1. The documentation calls them noisy samples.
What are these? Do they all belong to a single cluster, or do they each belong to their own cluster since they're noisy?
Thank you
These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent.
Remember, DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise." DBSCAN checks to make sure a point has enough neighbors within a specified range to classify the points into the clusters.
But what happens to the points that do not meet the criteria for falling into any of the main clusters? What if a point does not have enough neighbors within the specified radius to be considered part of a cluster? These are the points that are given the cluster label of -1 and are considered noise.
So what?
Well, if you are analyzing data points and you are only interested in the general clusters, you lower the size of the data and cut out the noise. Or, if you are using cluster analysis to classify data, in some cases it is possible to discard the noise as outliers.
In anomaly detection, points that do not fit into any category are also significant, as they can represent a problem or rare event.
I try to write an algorithm for clustering, now I like to create some easy 2D test cases: I like to generate points in [0, 1]x[0, 1] that build clusters.
E.g. something like this:
It would be better if the clusters have different (but random) shapes, e.g. like:
Is there an easy way to do this with python / numpy? Unfortunately the generation must be very efficient. I wrote some code, but the clusters always have the same shape and they are often far away from each other. Probably already a nice algorithm exists?
Thank you
No, there isn't a packaged way to do this. However, the generation algorithms aren't that hard to write. The first one appears to be a Gaussian distribution in each dimension (X and Y), repeating the generation for each of three centroids. Alternately, perhaps it's a uniform direction, with a "decay function" distance.
The second is a pair of sets: choose the radius from a Gaussian with a small variance, while the direction is uniform over the full circle. Do that for mean radius 1 and mean radius 3.
Does that get you moving?
So basically, I use the Python module scipy-cluster to plot a lot of data points. Is there are way/function that give the representative of each cluster if given the threshold, or the number of representatives I want? Ideally, each representative must has the closest distance to the center of the cluster it belongs to.
Edit: I'm looking for the data point closest to the centroid in each cluster.
Scipy-cluster provides coordinates for each centroid and identifies which points are in each cluster. Once you have that, I believe scipy.cluster.vq.py_vq will give you the distance between observations and centroids.
I don't really know my way around scipy-cluster, but it sounds like it gives you the centroid coordinates. Given that information and the knowledge of which points are in the cluster, it should be trivial to calculate the distance from the centroid for each point in the cluster. Just make sure your calculation is based on the same distance metric you used for clustering (probably euclidean distance).