theano: summation by class label - python

I have a matrix which represents a distances to the k-nearest neighbour of a set of points,
and there is a matrix of class labels of the nearest neighbours. (both N-by-k matrix)
What is the best way in theano to build a (N-by-#classes) matrix whose (i,j) element will be the sum of distances from i-th point to its k-NN points with the class label 'j'?
Example:
# N = 2
# k = 5
# number of classes = 3
K_val = [[1,2,3,4,6],
[2,4,5,5,7]]
l_val = [[0,1,2,0,1],
[2,0,1,2,0]]
result = [[5,8,3],
[11,5,7]]
this task in theano?
K = theano.tensor.matrix()
l = theano.tensor.matrix()
result = <..some code..>
f = theano.function(inputs=[K,l], outputs=result)

You might be interesting in having a look to this repo:
https://github.com/erogol/KLP_KMEANS/blob/master/klp_kmeans.py
Is a K-Means implementation using theano (func kpl_kmeans). I believe what you want is the matrix W used in the function find_bmu.
Hope you find it useful.

Related

Getting nodes without edges (When the N is larger than 60)

First I generated an NxN matrix of zeros and ones using NumPy. After that, I generated a copy matrix from the previous matrix, I replaced the ones in the first matrix with the weight of the edges. ( The matrix is symmetric and connected and undirected and its diagonal is zero like the original matrix) and I used BSF to check if it's connected and I found it connected every time. Then I used SciPy to find the MST (Minimum Spanning Tree). After that, I illustrated the MST using Network X
for generating NxN Matrix of zeros and ones
base = np.zeros((shape,shape))
for _ in range(100):
a = np.random.randint(shape)
b = np.random.randint(shape)
if a != b:
base[a, b] = 1
base[b, a] = 1
for generating NxN Matrix with the weight of edges
# Fetch the location of the 1s.
Weightofedges = base
ones = np.argwhere(Weightofedges == 1)
ones = ones[ones[:, 0] < ones[:, 1], :]
# Assign random values.
for a, b in ones:
Weightofedges[a, b] = Weightofedges[b, a] = np.random.randint(100)
Find the MST using SciPy
from scipy.sparse.csgraph import minimum_spanning_tree
X = minimum_spanning_tree(Weightofedges)
print("The Output Of The MST By Kruskal Algorithm:")
print(" Edges: Weights:")
print(X)
print("-----------------------")
my_matrix3 = X.toarray().astype(int)
The Problem: When I input a matrix with a large number of nodes I got some nodes not connected with an edge
e.g.
Number Of Nodes equals 75
Number Of Edges equals 65
In the MST the edges must be N-1 where N is the number of nodes
This is the graph using N = 75 ( as shown there are nodes without edges )
enter image description here
You have created a weighted version of the Erdős–Rényi model - to be exact the ER model G(n,M) with n nodes and M edges. Currently, you have fixed M=100 and you observe for n>60 that your becomes disconnected. This is typical and (at least for the second ER model variant G(n,p) with n nodes and probability of an edge p) you can even calculate the threshold where you (almost surely) get a single/large connected component. But even without the math, you can intuitively see that it becomes difficult to connect 75 nodes with only 100 random edges.
I recommend that you check out the networkx package, if you want to do more with graphs on python. For example, the implementation of the G(n,p) variant: erdos_renyi_graph.

Should np.linalg.norm be squared when implementing k-means clustering algorithm?

The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.

Sorting a matrix by similarity

I have 100 matrices in which each row corresponds to an individual and column refers to sites. I want to sort the row by a measure of similarity such that the most similar individuals are next to each other in a matrix. I used k-nearest neighbours to sort the matrices by rows and I give these sorted matrices to a convolutional neural network. I want to know if there are other measures by which I can achieve the task in hand. The code I use for k-nearest neighbour is:
def sort_min_diff(amat):
mb = NearestNeighbors(len(amat), metric='manhattan').fit(amat)
v = mb.kneighbors(amat)
smallest = np.argmin(v[0].sum(axis=1))
return amat[v[1][smallest]]
X_snp = np.array(snp_matrix)
q = []
for i in range(len(X_snp)):
q.append((sort_min_diff(X_snp[i])))
q = np.array(q)
My X_snp matrix is of shape (100,60,4500) that is I have 100 such matrices. Also, my matrices are filled with 0 and 1.
Suggestions would be appreciated.

How can I improve my Shared Nearest Neighbor clustering algorithm?

I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.

Plane fitting in a 3d point cloud

I am trying to find planes in a 3d point cloud, using the regression formula Z= aX + bY +C
I implemented least squares and ransac solutions,
but the 3 parameters equation limits the plane fitting to 2.5D- the formula can not be applied on planes parallel to the Z-axis.
My question is how can I generalize the plane fitting to full 3d?
I want to add the fourth parameter in order to get the full equation
aX +bY +c*Z + d
how can I avoid the trivial (0,0,0,0) solution?
Thanks!
The Code I'm using:
from sklearn import linear_model
def local_regression_plane_ransac(neighborhood):
"""
Computes parameters for a local regression plane using RANSAC
"""
XY = neighborhood[:,:2]
Z = neighborhood[:,2]
ransac = linear_model.RANSACRegressor(
linear_model.LinearRegression(),
residual_threshold=0.1
)
ransac.fit(XY, Z)
inlier_mask = ransac.inlier_mask_
coeff = model_ransac.estimator_.coef_
intercept = model_ransac.estimator_.intercept_
Update
This functionality is now integrated in https://github.com/daavoo/pyntcloud and makes the plane fitting process much simplier:
Given a point cloud:
You just need to add a scalar field like this:
is_floor = cloud.add_scalar_field("plane_fit")
Wich will add a new column with value 1 for the points of the plane fitted.
You can visualize the scalar field:
Old answer
I think that you could easily use PCA to fit the plane to the 3D points instead of regression.
Here is a simple PCA implementation:
def PCA(data, correlation = False, sort = True):
""" Applies Principal Component Analysis to the data
Parameters
----------
data: array
The array containing the data. The array must have NxM dimensions, where each
of the N rows represents a different individual record and each of the M columns
represents a different variable recorded for that individual record.
array([
[V11, ... , V1m],
...,
[Vn1, ... , Vnm]])
correlation(Optional) : bool
Set the type of matrix to be computed (see Notes):
If True compute the correlation matrix.
If False(Default) compute the covariance matrix.
sort(Optional) : bool
Set the order that the eigenvalues/vectors will have
If True(Default) they will be sorted (from higher value to less).
If False they won't.
Returns
-------
eigenvalues: (1,M) array
The eigenvalues of the corresponding matrix.
eigenvector: (M,M) array
The eigenvectors of the corresponding matrix.
Notes
-----
The correlation matrix is a better choice when there are different magnitudes
representing the M variables. Use covariance matrix in other cases.
"""
mean = np.mean(data, axis=0)
data_adjust = data - mean
#: the data is transposed due to np.cov/corrcoef syntax
if correlation:
matrix = np.corrcoef(data_adjust.T)
else:
matrix = np.cov(data_adjust.T)
eigenvalues, eigenvectors = np.linalg.eig(matrix)
if sort:
#: sort eigenvalues and eigenvectors
sort = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[sort]
eigenvectors = eigenvectors[:,sort]
return eigenvalues, eigenvectors
And here is how you could fit the points to a plane:
def best_fitting_plane(points, equation=False):
""" Computes the best fitting plane of the given points
Parameters
----------
points: array
The x,y,z coordinates corresponding to the points from which we want
to define the best fitting plane. Expected format:
array([
[x1,y1,z1],
...,
[xn,yn,zn]])
equation(Optional) : bool
Set the oputput plane format:
If True return the a,b,c,d coefficients of the plane.
If False(Default) return 1 Point and 1 Normal vector.
Returns
-------
a, b, c, d : float
The coefficients solving the plane equation.
or
point, normal: array
The plane defined by 1 Point and 1 Normal vector. With format:
array([Px,Py,Pz]), array([Nx,Ny,Nz])
"""
w, v = PCA(points)
#: the normal of the plane is the last eigenvector
normal = v[:,2]
#: get a point from the plane
point = np.mean(points, axis=0)
if equation:
a, b, c = normal
d = -(np.dot(normal, point))
return a, b, c, d
else:
return point, normal
However as this method is sensitive to outliers you could use RANSAC to make the fit robust to outliers.
There is a Python implementation of ransac here.
And you should only need to define a Plane Model class in order to use it for fitting planes to 3D points.
In any case if you can clean the 3D points from outliers (maybe you could use a KD-Tree S.O.R filter to that) you should get pretty good results with PCA.
Here is an implementation of an S.O.R:
def statistical_outilier_removal(kdtree, k=8, z_max=2 ):
""" Compute a Statistical Outlier Removal filter on the given KDTree.
Parameters
----------
kdtree: scipy's KDTree instance
The KDTree's structure which will be used to
compute the filter.
k(Optional): int
The number of nearest neighbors wich will be used to estimate the
mean distance from each point to his nearest neighbors.
Default : 8
z_max(Optional): int
The maximum Z score wich determines if the point is an outlier or
not.
Returns
-------
sor_filter : boolean array
The boolean mask indicating wherever a point should be keeped or not.
The size of the boolean mask will be the same as the number of points
in the KDTree.
Notes
-----
The 2 optional parameters (k and z_max) should be used in order to adjust
the filter to the desired result.
A HIGHER 'k' value will result(normally) in a HIGHER number of points trimmed.
A LOWER 'z_max' value will result(normally) in a HIGHER number of points trimmed.
"""
distances, i = kdtree.query(kdtree.data, k=k, n_jobs=-1)
z_distances = stats.zscore(np.mean(distances, axis=1))
sor_filter = abs(z_distances) < z_max
return sor_filter
You could feed the function with a KDtree of your 3D points computed maybe using this implementation
import pcl
cloud = pcl.PointCloud()
cloud.from_array(points)
seg = cloud.make_segmenter_normals(ksearch=50)
seg.set_optimize_coefficients(True)
seg.set_model_type(pcl.SACMODEL_PLANE)
seg.set_normal_distance_weight(0.05)
seg.set_method_type(pcl.SAC_RANSAC)
seg.set_max_iterations(100)
seg.set_distance_threshold(0.005)
inliers, model = seg.segment()
you need to install python-pcl first. Feel free to play with the parameters. points here is a nx3 numpy array with n 3d points. Model will be [a, b, c, d] such that ax + by + cz + d = 0

Categories

Resources