Spectral embedding - spectral clustering - python

I'm trying to perform spectral embedding/clustering using Normalized Cuts. I wrote the following code but I have stuck to a logical bottleneck. What do I have to do after clustering the eigenvectors? I don't know how to form the clusters on my original dataset. (A is my affinity matrix)
D = np.diag(np.sum(A, 0))
D_half_inv = np.diag(1.0 / np.sqrt(np.sum(A, 0)))
M = np.dot(D_half_inv, np.dot((D - A), D_half_inv))
# compute eigenvectors and eigenvalues
(w, v) = np.linalg.eigh(M)
# renorm eigenvectors to have norm 1
var = len(w)
v1 = np.array(np.zeros((var, var)))
for j in range(var):
v[:][j] = v[:][j]/np.sqrt(np.sum(A,0))
v[:][j] = v[:][j]/np.linalg.norm(v1[:][j])
v_trailing = v[:,1:45] #omit the corresponding eigenvector of the smallest eigenvalue which is 0 and 45 is my embedding dimension
k = 20 #number of clusters
centroids,idx = kmeans2(v_trailing, k)
After that, i get labels for each eigenvector. But how can i link these labels on my original dataset?

The output mapping to the original dataset corresponds to the indices of the labels in your modified set.
So if yi is in Cm then the ith entry of A will be in Am
or to put it another way
Let C1 ..... CM be the set of clusters generated by clustering the eigenvectors the clusters you want are : A1 ..... AM where Ai= { j | yj element of Ci }

Related

Generate singular values of a matrix (Gaussian Random) using Quadrant law?

I need to generate Gaussian Random Matrices for a given rank and condition number in python.
For a given target condition number K=5, I select the largest singular value sigma_max uniformly at random in the interval [1, 500]. This sets the smallest singular value as sigma_min = sigma_max/K.
The remaining singular values need to be generated from the Quadrant law. Attached is the link to the paper for the reference( theorem 2)
On the singular values of Gaussian random
matrices by Jianhong Shen.
I can use SVD and QR decomposition to generate the final matrix.
Here is the code I'm working on in python.
rank=5
condn_num = 5(condition number)
sigma_max = np.random.uniform(1,500)
sigma_min = sigma_max / condn_num
U_ = np.random.normal(0, 1, (rows, rank))
q, r = np.linalg.qr(U_)
U = q
V_ = np.random.normal(0, 1, (cols, rank))
q, r = np.linalg.qr(V_)
V = q
final_matrix = U # s # V.transpose()
//s is a diagonal matrix of dimension(rank X rank)
I need to populate the diagonal matrix s with the remaining singular values from the Quadrant law mentioned in the paper above.
Do anybody know how this quadrant law is used to generate remaining singular values?

Get inertia for nltk k means clustering using cosine_similarity

I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...
The code below is how people usually find inertia using sklearn k means.
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
you can write your own function to obtain the inertia for Kmeanscluster in nltk.
As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
The previous comment is actually missing a small detail:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.

Calculating Mean Squared Error through Matrix Arithmetic on Numpy Matrices of Binary Images

I have 2 binary images, one is a ground truth, and one is an image segmentation that I produced.
I am trying to calculate the mean squared distance ...
Let G = {g1, g2, . . . , gN} be the points in the ground truth boundary.
Let B = {b1, b2, . . . , bM} be the points in the segmented boundary.
Define d(p, p0) be a measure of distance between points p and p0 (e.g. Euclidean, city block, etc.)
between the two images using the following algorithm.
def MSD(A,G):
'''
Takes a thresholded binary image, and a ground truth img(binary), and computes the mean squared absolute difference
:param A: The thresholded binary image
:param G: The ground truth img
:return:
'''
sim = np.bitwise_xor(A,G)
sum = 0
for i in range(0,sim.shape[0]):
for j in range(0,sim.shape[1]):
if (sim[i,j] == True):
min = 9999999
for k in range(0,sim.shape[0]):
for l in range(0,sim.shape[1]):
if (sim[k, l] == True):
e = abs(i-k) + abs(j-l)
if e < min:
min = e
mink = k
minl = l
sum += min
return sum/(sim.shape[0]*sim.shape[1])
This algorithm is too slow though and never completes.
This example and this example (Answer 3) might show method of how to get the mean squared error using Matrix arithmetic, but I do not understand how these examples make any sense or why they work.
So if I understand your formula and code correctly, you have one (binary) image B and a (ground truth) image G. "Points" are defined by the pixel positions where either image has a True (or at least nonzero) value. From your bitwise_xor I deduce that both images have the same shape (M,N).
So the quantity d^2(b,g) is at worst an (M*N, M*N)-sized array, relating each pixel of B to each pixel of G. It's even better: we only need a shape (m,n) if there are m nonzeros in B and n nonzeros in G. Unless your images are huge we can get away with keeping track of this large quantity. This will cost memory but we will win a lot of CPU time by vectorization. So then we only have to find the minimum of this distance with respect to every n possible value, for each m. Then just sum up each minimum. Note that the solution below uses extreme vectorization, and it can easily eat up your memory if the images are large.
Assuming Manhattan distance (with the square in d^2 which seems to be missing from your code):
import numpy as np
# generate dummy data
M,N = 100,100
B = np.random.rand(M,N) > 0.5
G = np.random.rand(M,N) > 0.5
def MSD(B, G):
# get indices of nonzero pixels
nnz_B = B.nonzero() # (x_inds, y_inds) tuple, x_inds and y_inds are shape (m,)
nnz_G = G.nonzero() # (x_inds', y_inds') each with shape (n,)
# np.array(nnz_B) has shape (2,m)
# compute squared Manhattan distance
dist2 = abs(np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:]).sum(axis=0)**2 # shape (m,n)
# alternatively: Euclidean for comparison:
#dist2 = ((np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:])**2).sum(axis=0)
mindist2 = dist2.min(axis=-1) # shape (m,) of minimum square distances
return mindist2.mean() # sum divided by m, i.e. the MSD itself
print(MSD(B, G))
If the above uses too much memory we can introduce a loop over the elements of nnz_B, and only vectorize in the elements of nnz_G. This will take more CPU power and less memory. This trade-off is typical for vectorization.
An efficient method for calculating this distance is using the Distance Transform. SciPy has an implementation in the ndimage package: scipy.ndimage.morphology.distance_transform_edt.
The idea is to compute a distance transform for the background of the ground-truth image G. This leads to a new image D that is 0 for each pixel that is nonzero in G, and for each zero pixel in G there will be the distance to the nearest nonzero pixel.
Next, for each nonzero pixel in B (or A in the code that you posted), you look at the corresponding pixel in D. This is the distance to G for that pixel. So, simply average all the values in D for which B is nonzero to obtain your result.
import numpy as np
import scipy.ndimage as nd
import matplotlib.pyplot as pp
# Create some test data
img = pp.imread('erika.tif') # a random image
G = img > 120 # the ground truth
img = img + np.random.normal(0, 20, img.shape)
B = img > 120 # the other image
D = nd.morphology.distance_transform_edt(~G)
msd = np.mean(D[B]**2)

Plane fitting in a 3d point cloud

I am trying to find planes in a 3d point cloud, using the regression formula Z= aX + bY +C
I implemented least squares and ransac solutions,
but the 3 parameters equation limits the plane fitting to 2.5D- the formula can not be applied on planes parallel to the Z-axis.
My question is how can I generalize the plane fitting to full 3d?
I want to add the fourth parameter in order to get the full equation
aX +bY +c*Z + d
how can I avoid the trivial (0,0,0,0) solution?
Thanks!
The Code I'm using:
from sklearn import linear_model
def local_regression_plane_ransac(neighborhood):
"""
Computes parameters for a local regression plane using RANSAC
"""
XY = neighborhood[:,:2]
Z = neighborhood[:,2]
ransac = linear_model.RANSACRegressor(
linear_model.LinearRegression(),
residual_threshold=0.1
)
ransac.fit(XY, Z)
inlier_mask = ransac.inlier_mask_
coeff = model_ransac.estimator_.coef_
intercept = model_ransac.estimator_.intercept_
Update
This functionality is now integrated in https://github.com/daavoo/pyntcloud and makes the plane fitting process much simplier:
Given a point cloud:
You just need to add a scalar field like this:
is_floor = cloud.add_scalar_field("plane_fit")
Wich will add a new column with value 1 for the points of the plane fitted.
You can visualize the scalar field:
Old answer
I think that you could easily use PCA to fit the plane to the 3D points instead of regression.
Here is a simple PCA implementation:
def PCA(data, correlation = False, sort = True):
""" Applies Principal Component Analysis to the data
Parameters
----------
data: array
The array containing the data. The array must have NxM dimensions, where each
of the N rows represents a different individual record and each of the M columns
represents a different variable recorded for that individual record.
array([
[V11, ... , V1m],
...,
[Vn1, ... , Vnm]])
correlation(Optional) : bool
Set the type of matrix to be computed (see Notes):
If True compute the correlation matrix.
If False(Default) compute the covariance matrix.
sort(Optional) : bool
Set the order that the eigenvalues/vectors will have
If True(Default) they will be sorted (from higher value to less).
If False they won't.
Returns
-------
eigenvalues: (1,M) array
The eigenvalues of the corresponding matrix.
eigenvector: (M,M) array
The eigenvectors of the corresponding matrix.
Notes
-----
The correlation matrix is a better choice when there are different magnitudes
representing the M variables. Use covariance matrix in other cases.
"""
mean = np.mean(data, axis=0)
data_adjust = data - mean
#: the data is transposed due to np.cov/corrcoef syntax
if correlation:
matrix = np.corrcoef(data_adjust.T)
else:
matrix = np.cov(data_adjust.T)
eigenvalues, eigenvectors = np.linalg.eig(matrix)
if sort:
#: sort eigenvalues and eigenvectors
sort = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[sort]
eigenvectors = eigenvectors[:,sort]
return eigenvalues, eigenvectors
And here is how you could fit the points to a plane:
def best_fitting_plane(points, equation=False):
""" Computes the best fitting plane of the given points
Parameters
----------
points: array
The x,y,z coordinates corresponding to the points from which we want
to define the best fitting plane. Expected format:
array([
[x1,y1,z1],
...,
[xn,yn,zn]])
equation(Optional) : bool
Set the oputput plane format:
If True return the a,b,c,d coefficients of the plane.
If False(Default) return 1 Point and 1 Normal vector.
Returns
-------
a, b, c, d : float
The coefficients solving the plane equation.
or
point, normal: array
The plane defined by 1 Point and 1 Normal vector. With format:
array([Px,Py,Pz]), array([Nx,Ny,Nz])
"""
w, v = PCA(points)
#: the normal of the plane is the last eigenvector
normal = v[:,2]
#: get a point from the plane
point = np.mean(points, axis=0)
if equation:
a, b, c = normal
d = -(np.dot(normal, point))
return a, b, c, d
else:
return point, normal
However as this method is sensitive to outliers you could use RANSAC to make the fit robust to outliers.
There is a Python implementation of ransac here.
And you should only need to define a Plane Model class in order to use it for fitting planes to 3D points.
In any case if you can clean the 3D points from outliers (maybe you could use a KD-Tree S.O.R filter to that) you should get pretty good results with PCA.
Here is an implementation of an S.O.R:
def statistical_outilier_removal(kdtree, k=8, z_max=2 ):
""" Compute a Statistical Outlier Removal filter on the given KDTree.
Parameters
----------
kdtree: scipy's KDTree instance
The KDTree's structure which will be used to
compute the filter.
k(Optional): int
The number of nearest neighbors wich will be used to estimate the
mean distance from each point to his nearest neighbors.
Default : 8
z_max(Optional): int
The maximum Z score wich determines if the point is an outlier or
not.
Returns
-------
sor_filter : boolean array
The boolean mask indicating wherever a point should be keeped or not.
The size of the boolean mask will be the same as the number of points
in the KDTree.
Notes
-----
The 2 optional parameters (k and z_max) should be used in order to adjust
the filter to the desired result.
A HIGHER 'k' value will result(normally) in a HIGHER number of points trimmed.
A LOWER 'z_max' value will result(normally) in a HIGHER number of points trimmed.
"""
distances, i = kdtree.query(kdtree.data, k=k, n_jobs=-1)
z_distances = stats.zscore(np.mean(distances, axis=1))
sor_filter = abs(z_distances) < z_max
return sor_filter
You could feed the function with a KDtree of your 3D points computed maybe using this implementation
import pcl
cloud = pcl.PointCloud()
cloud.from_array(points)
seg = cloud.make_segmenter_normals(ksearch=50)
seg.set_optimize_coefficients(True)
seg.set_model_type(pcl.SACMODEL_PLANE)
seg.set_normal_distance_weight(0.05)
seg.set_method_type(pcl.SAC_RANSAC)
seg.set_max_iterations(100)
seg.set_distance_threshold(0.005)
inliers, model = seg.segment()
you need to install python-pcl first. Feel free to play with the parameters. points here is a nx3 numpy array with n 3d points. Model will be [a, b, c, d] such that ax + by + cz + d = 0

theano: summation by class label

I have a matrix which represents a distances to the k-nearest neighbour of a set of points,
and there is a matrix of class labels of the nearest neighbours. (both N-by-k matrix)
What is the best way in theano to build a (N-by-#classes) matrix whose (i,j) element will be the sum of distances from i-th point to its k-NN points with the class label 'j'?
Example:
# N = 2
# k = 5
# number of classes = 3
K_val = [[1,2,3,4,6],
[2,4,5,5,7]]
l_val = [[0,1,2,0,1],
[2,0,1,2,0]]
result = [[5,8,3],
[11,5,7]]
this task in theano?
K = theano.tensor.matrix()
l = theano.tensor.matrix()
result = <..some code..>
f = theano.function(inputs=[K,l], outputs=result)
You might be interesting in having a look to this repo:
https://github.com/erogol/KLP_KMEANS/blob/master/klp_kmeans.py
Is a K-Means implementation using theano (func kpl_kmeans). I believe what you want is the matrix W used in the function find_bmu.
Hope you find it useful.

Categories

Resources