How to reconstruct an image after clustering with hdbscan?

How to reconstruct an image after clustering with hdbscan? - python

I am trying to reconstruct a brain tumor image after clustering using hdbscan.
However, hdbscan does not have cluster centers unlike kmeans so I am a bit confused on how to obtain the clustered image. I have tried obtaining the ref cluster center by matching the (65536,3) array with the hdbscan labels i.e. r and storing them after getting the mean cluster points for each cluster in crs.
I am unsure if this is the best way to proceed to reconstruct an image that is, get some mean centers based on clusters and reconstruct the image using the mean centers plus labels.
crs = np.zeros((dbnumber_of_clusters, 3))
for i in range(0, dbnumber_of_clusters):
dbcluster_points = mriarr[r == i]
dbcluster_mean = np.mean(dbcluster_points, axis=0)
crs[i, :] = dbcluster_mean

HDBSCAN is not designed to "reconstruct" data. So there may not be an elegant way.
Using the mean of each cluster is an obvious choice wrt. simulating what k-mrans does, but such a point may lie outside the actual cluster if a cluster is not convex. So it may be appropriate to choose the most dense point instead.
Furthermore, the clustering is supposed to be hierarchical, so when computing a cluster representative, you should also take the data of nested clusters into account...
Last but not least, it can produce a "noise cluster". That is not actually a cluster, but simply all the unclustered data. Computing a single representative object of such points is not meaningful. Instead, you probably want to treat these points as each point bring it's own cluster.

Related

explanation of sklearn optics plot

I am currently learning how to use OPTICS in sklearn. I am inputting a numpy array of (205,22). I am able to get plots out of it, but I do not understand how I am getting a 2d plot out of multiple dimensions and how I am supposed to read it. I more or less understand the reachability plot, but the rest of it makes no sense to me. Can someone please explain what is happening. Is the function just simplifying the data to two dimensions somehow? Thank you

From the sklearn user guide:
The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data set. As shown in the above plot, combining reachability distances and data set ordering_ produces a reachability plot, where point density is represented on the Y-axis, and points are ordered such that nearby points are adjacent. ‘Cutting’ the reachability plot at a single value produces DBSCAN like results; all points above the ‘cut’ are classified as noise, and each time that there is a break when reading from left to right signifies a new cluster.
the other three plots are a visual representation of the actual clusters found by three different algorithms.
as you can see in the OPTICS Clustering plot there are two high density clusters (blue and cyan) the gray crosses acording to the reachability plot are classify as noise because of the low xi value
in the DBSCAN clustering with eps = 0.5 everithing is considered noise since the epsilon value is to low and the algorithm can not found any density points.
Now it is obvious that in the third plot the algorithm found just a single cluster because of the adjustment of the epsilon value and everything above the 2.0 line is considered noise.
please refer to the user guide:

What are noisy samples in Scikit's DBSCAN clustering algorithm?

If I apply Scikit's DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on a similarity matrix, I get a series of labels back. Some of these labels are -1. The documentation calls them noisy samples.
What are these? Do they all belong to a single cluster, or do they each belong to their own cluster since they're noisy?
Thank you

These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent.
Remember, DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise." DBSCAN checks to make sure a point has enough neighbors within a specified range to classify the points into the clusters.
But what happens to the points that do not meet the criteria for falling into any of the main clusters? What if a point does not have enough neighbors within the specified radius to be considered part of a cluster? These are the points that are given the cluster label of -1 and are considered noise.
So what?
Well, if you are analyzing data points and you are only interested in the general clusters, you lower the size of the data and cut out the noise. Or, if you are using cluster analysis to classify data, in some cases it is possible to discard the noise as outliers.
In anomaly detection, points that do not fit into any category are also significant, as they can represent a problem or rare event.

How to improve HOG detector with linear SVM performance for car detection?

So, I want to detect cars from a driver recorder recorded video. I've read a lot and do research quite a lot but still not quite getting it. I do think of using a HOG descriptor with linear SVM. But in what way it can still be improver to make it easier to be implement and more robust since this will be kind of a research for me?
I am thinkin of combining another technique/algorithm with the HOG but still kind of lost. I am quite new in this.
Any help is greatly appreciated. I am also open to other better ideas.

HOG (histogram of oriented gradients) is merely a certain type of feature vector that can be computed from your data. You compute the gradient vector at each pixel in your image and then you divide up the possible angles into a discrete number of bins. Within a given image sub-region, you add the total magnitude of the gradient pointing in a given direction as the entry for the relevant angular bin containing that direction.
This leaves you with a vector that has a length equal to the number of bins you've chosen for dividing up the range of angles and acts as an unnormalized histogram.
If you want to compute other image features for the same sub-region, such as the sum of the pixels, some measurement of sharp angles or lines, aspects of the color distribution, or so forth, you can compute as many or as few as you would like, arrange them into a long vector as well, and simply concatenate that feature vector with the HOG vector.
You may also want to repeat the computation of the HOG vector for several different scale levels to help capture some scale variability, concatenating each scale-specific HOG vector onto the overall feature vector. There are other feature concepts like SIFT and others, which are created to automatically account for scale invariance.
You may need to do some normalization or scaling, which you can read about in any standard SVM guide. The standard LIBSVM guide is a great place to start.
You will have to be careful to organize your feature vector correctly since you will likely have a very large number of components to the feature vector, and you have to ensure they are always calculated and placed into the same ordering and undergo exactly the same scaling or normalization treatments.

Clustering points in 3D plane

Suppose we have 1000 random data points in a cube (as shown in the following image). The distribution of points in X and Y directions are uniform but not in Z direction. As we get deeper, the data points are denser. Is there any straightforward way in python to cluster these data points such that:
each cluster has equal size
each cluster consists of local points, i.e., each cluster consists of points being close to each other.
I have already tried K-means clustering from Scipy package but it did not give me a good result and the points of each cluster were very widespread rather than being concentrated.

Try using Scikit-Learn's implementation. They initialize their clusters using a technique known as "K-Means++" which picks the first means probabilistically to get an optimal starting distribution. This creates a higher probability of a good result.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Can anyone provide me with some clustering examples?

I am having a hard time understanding what scipy.cluster.vq really does!!
On Wikipedia it says Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
on other sites and books it says we can use clustering methods for clustering images for finding groups of similar images.
AS i am interested in image processing ,I really need to fully understand what clustering is .
So
Can anyone show me simple examples about using scipy.cluster.vq with images??

The kind of clustering performed by scipy.cluster.vq is definitely of the latter (groups of similar images) variety.
The only clustering algorithm implemented in scipy.cluster.vq is the K-Means algorithm, which typically treats input data as points in n-dimensional euclidean space, and attempts to divide that space so that new, incoming data can be summarized by saying "example x is most like centroid y". Centroids can be thought of as prototypical examples of the input data. Vector quantization leads to concise, or compressed representations because, instead of remembering all 100 pixels of each new image we see, we can remember a single integer which points at the prototypical example that the new image is most like.
If you had many small grayscale images:
>>> import numpy as np
>>> images = np.random.random_sample((100,10,10))
So, we've got 100 10x10 pixel images. Let's assume they already all have similar brightness and contrast. The scipy kmeans implementation expects flat vectors:
>>> images = images.reshape((100,100))
>>> images.shape
(100,100)
Now, let's train the K-Means algorithm so that any new incoming image can be assigned to one of 10 clusters:
>>> from scipy.cluster.vq import kmeans, vq
>>> codebook,distortion = kmeans(images,10)
Finally, let's say we have five new images we'd like to assign to one of the ten clusters:
>>> newimages = np.random.random_samples((5,10,10))
>>> clusters = vq(newimages.reshape((5,100)),codebook)
clusters will contain the integer index of the best matching centroid for each of the five examples.
This is kind of a toy example, and won't yield great results unless the objects of interest in the images you're working with are all centered. Since objects of interest might appear anywhere in larger images, it's typical to learn centroids for smaller image "patches", and then convolve them (compare them at many different locations) with larger images to promote translation-invariance.

The second is what clustering is: group objects that are somewhat similar (and that could be images). Clustering is not a pure imaging technique.
When processing a single image, it can for example be applied to colors. This is a quite good approach for reducing the number of colors in an image. If you cluster by colors and pixel coordinates, you can also use it for image segmentation, as it will group pixels that have a similar color and are close to each other. But this is an application domain of clustering, not pure clustering.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.