ValueError While Clustering in Sklearn - python

I have a RGB image of the following shape ((3L, 5L, 5L). It means 5 by 5 pixels image having 3 layers (R,G,andB).I want to cluster it using DBSCAN algorithm as follows. But I got an error message that ValueError: Found array with dim 3. Expected <= 2. Can not I use for my 3d image?
import numpy as np
from sklearn.cluster import DBSCAN
from collections import Counter
data = np.random.rand(3,5,5)
print np.shape(data)
print data
db = DBSCAN(eps=0.12, min_samples=3).fit(data)
print db
DBSCAN(algorithm='auto', eps=0.12, leaf_size=30, metric='euclidean',
min_samples=1, p=None, random_state=None)
labels = db.labels_
print Counter(labels)

To cluster you need to say what the distance between two points is. DBSCAN is not a graph clustering algorithm, it works with features. You need to represent each pixel as features, so that the distances are appropriate.
The features could just be RGB, in which case similar colors are clustered together. Or the features could also include x, y coordinates, which would mean spacial distances are also considered.
If you want to consider spatial distances, I'd suggest you take a look at scikit-image's segmentation module, which contains a couple of popular image segmentation methods.

Related

PCA from scratch and Sklearn PCA giving different output

I am trying to implement PCA from scratch. Following is the code:
sc = StandardScaler() #standardization
X_new = sc.fit_transform(X)
Z = np.divide(np.dot(X_new.T,X_new),X_new.shape[0]) #covariance matrix
eig_values, eig_vectors = np.linalg.eig(Z) #eigen vectors calculation
eigval_sorted = np.sort(eig_values)[::-1]
ev_index =np.argsort(eigval_sorted)[::-1]
pc = eig_vectors[:,ev_index] #eigen vectors sorts on the basis of eigen values
W = pc[:,0:2] #extracting 2 components
print(W)
and getting the following components:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
When I use the sklearn's PCA I get the following two components:
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105],
[ 0.37231836, 0.92555649, 0.02109478, 0.06541577]])
Projection onto new feature space gives following different figures:
Where am I doing it wrong and what can be done to resolve the problem?
The result of a PCA are technically not n vectors, but a subspace of dimension n. This subspace is represented by n vectors that span that subspace.
In your case, while the vectors are different, the spanned subspace is the same, so the result of the PCA is the same.
If you want to align your solution perfectly with the sklearn solution, you need to normalise your solution in the same way. Apparently sklearn prefers positive values over negative values? You'd need to dig into their documentation.
edit:
Yes, of course, what I wrote is wrong. The algorithm itself returns ordered orthonormal basis vectors. So vectors that are of length one and orthogonal to each other and they are ordered in their 'importance' to the dataset. So way more information than just the subspace.
However, if v, w, u are a solution of the PCA, so should +/- v, w, u be.
edit: It seems that np.linalg.eig has no mechanism to guarantee it will also return the same set of eigenvectors representing the eigenspace, see also here:
NumPy linalg.eig
So, a new version of numpy, or just how the stars are aligned today, can change your result. Although, for a PCA it should only vary in +/-

plot clusters of kmeans of sparse matrix

I have a python script which do clustering over a data file which is in svmlight format.
I use the function sklearn.datasets.load_svmlight_file to load the data from the data file.
I know that this function returns a sparse matrix.
I need to scatter plot the clusters, can any body help me please.
This what I have done:
import sklearn.datasets
import sys
from sklearn.cluster import KMeans
dataFilename = sys.argv[1]
X, y = sklearn.datasets.load_svmlight_file(dataFilename)
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.labels_
print(labels)
centroids = kmeans.cluster_centers_
Without having the dataset, I would suggest the following:
Since load_svmlight_file() returns a sparse matrix, turn X into a NumPy array using samples = X.toarray() prior to fitting the model.
Plot two features (for example) of the dataset using:
plt.scatter(samples[:,0], samples[:,1], c=labels). This colours the clusters by their predicted labels.
Follow this with plt.scatter(centroids[:,0], centroids[:,1], marker='D') to see the location of the centroids with diamonds.
Note that samples[:,n] represents an array containing the sample values for the nth feature of the dataset.
I hope this helps. If not, please let me know.

How to use FLANN for labeling and clustering?

I read a paper that their retrieval system is based on SIFT descriptor and fast approximate k-means clustering. I installed pyflann. If I am not mistaken the following commands only find the indices of the close datapoints to a specific sample (for example, here, the indices of 5 nearest points from dataset to testset)
from pyflann import *
from numpy import *
from numpy.random import *
dataset = rand(10000, 128)
testset = rand(1000, 128)
flann = FLANN()
result,dists = flann.nn(dataset,testset,5,algorithm="kmeans",
branching=32, iterations=7, checks=16)
I went through user manual, however, could find how can I do k-means clusterin with FLANN. and How can I fit the test based on the cluster centers. As we can use the kmeans++ clustering` in scikitlearn, and then we are fitting the dataset based on the model:
kmeans=KMeans(n_clusters=100,init='k-means++',random_state = 0, verbose=0)
kmeans.fit(dataset)
and later we can assign labels to the test set by using KDTree for example.
kdt=KDTree(kmeans.cluster_centers_)
Q=testset #query
kdt_dist,kdt_idx=kdt.query(Q,k=1) #knn
test_labels=kdt_idx #knn=1 labels
Could someone please help me how can I use the same procedure with FLANN? (I mean clustering the dataset (finding the cluster centers and quantizing features) and then quantizing testset based on cluster centers found from the previous step).
You won't be able to do the best variations with FLANN, because these use two indexes at the same time, and are ugly to implement.
But you can build a new index on the centers for every iteration. But unless you have k > 1000 it probably will not help much.

Re-using cluster_centers for another K-mean clustering

I have a matrix of dimension (nw,ny,nx) where nx and ny are dimension of an image (photon counts) and for each pixel I have a spectral profile of nw wavelength points.
I have applied K-mean clustering from scikit-learn python package with number of cluster equal to ncl=5.
dat =dat1.reshape(nw,nx*ny)
mm[:]=KMeans(n_clusters=ncl).fit(np.transpose(dat)).labels_
x=KMeans(n_clusters=ncl).fit(np.transpose(dat)).cluster_centers_
and then plotting x[i,:] (i= cluster label) I can see the 5 different average spectral profiles generated by Kmeans.
Now my question is the following: I would like to use these 5 cluster_centers in a different dataset of the same dimensions (nw,ny,nx) to retrieve the lables that here I have called mm. How can I do it?
Thank you in advance for your time.
As #sascha pointed out, you need to persist the KMeans object to predict future data
dat = dat1.reshape(nw,nx*ny)
clusterer = KMeans(n_clusters=ncl).fit(np.transpose(dat)
dat2 = dat2.reshape(nw,nx*ny)
dat2_labels = clusterer.predict(np.transpose(dat2))

Labels for cluster centers in Python sklearn

When utilizing the sklearn class sklearn.cluster for K-means clustering, a fitted k-means object has 3 attributes, including a numpy array of cluster centers (centers x features) named cluster_centers_. However, these centers don't have an attached label.
My question is: are the centers (rows) in cluster_centers_ ordered by the label value? That is, does row 1 correspond to the center for the cluster labeled 1? Or are they placed in the array randomly? A pointer to any documentation would be more than sufficient.
Thanks.
I couldn't find the documentation but yes it is ordered by cluster.
So:
kmeans.cluster_centers_[0] = centroid of cluster 0

Categories

Resources