Sorting a matrix by similarity - python

I have 100 matrices in which each row corresponds to an individual and column refers to sites. I want to sort the row by a measure of similarity such that the most similar individuals are next to each other in a matrix. I used k-nearest neighbours to sort the matrices by rows and I give these sorted matrices to a convolutional neural network. I want to know if there are other measures by which I can achieve the task in hand. The code I use for k-nearest neighbour is:
def sort_min_diff(amat):
mb = NearestNeighbors(len(amat), metric='manhattan').fit(amat)
v = mb.kneighbors(amat)
smallest = np.argmin(v[0].sum(axis=1))
return amat[v[1][smallest]]
X_snp = np.array(snp_matrix)
q = []
for i in range(len(X_snp)):
q.append((sort_min_diff(X_snp[i])))
q = np.array(q)
My X_snp matrix is of shape (100,60,4500) that is I have 100 such matrices. Also, my matrices are filled with 0 and 1.
Suggestions would be appreciated.

Related

locality sensitive hashing: Python lshash library to get index in a matrix

I have a problem for finding a distance of 3D float array with very large matrix of size 1M x 3. The distance has to be computed 10 times per second. Therefore, I need to implement LSH or related technique. For given query q of size 1 x 3 or 3 x 1, I have to find the nearest neighbor index in matrix D of size 1M x 3. I just need the index from D for given query q.
I implemented lshash to find the nearest neighbor (NN) very fast. However, I am not able to find the index since lshash return distance and the respective vector.
Here is my code.
from lshash import LSHash
lsh = LSHash(16,3)
for a in D:
lsh.index(a)
To search the NN of q, here is the simple way
a = lsh.query(q,distance_func='euclidean') %% or
a = lsh.query(q,num_results=1,distance_func='euclidean')
But I want the index from D which is nearest to q.
any suggestions..

Using np.cov() on a centered matrix not equivalent to matrix multiplication between the array and its transpose

I'm trying to get the eigenvectors and values for the MNIST dataset.
I'm testing out a concept on the dataset so I can carry it to a different dataset
I have a matrix M where the rows are the images and the columns are the pixel values.
I'm trying to do the above in two ways (taken from https://mml-book.github.io/book/mml-book.pdf, chapter 10, section 1 and section 5):
M is of shape 500 rows x 784 columns
First, I'm using the following code:
V = cov(M.T)
and then using:
V2 = np.dot(matrix.T,matrix) / 783
According to numpy's guide on cov(), it seems like with one variable given I the results of both should be identical, but they're not. https://numpy.org/doc/stable/reference/generated/numpy.cov.html
sorry if the question is simple and there's an obvious answer
EDIT:
if I take the highest eigenvector of both methods and scale it so the lowest number is zero and the highest is 255, I get the same vector. What am I missing here?

Computing Nearest neighbor graph using sklearn?

This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k
From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.

PCA -- Calculating Reduced Size Matrix With Numpy

I am trying to use PCA to reduce the size of an input image from 4096 x 4096 to 4096 x 163 while keeping its important attributes. However, there is something off with my method as I get incorrect results. I believe it is while constructing my matrix U. My results vs correct results are listed below.
Start code:
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
mean_array = np.mean(X, axis = 1)
X_tilde = np.reshape(mean_array, (4096,1))
X_tilde = X - X_tilde
# Construct the covariance matrix for computing u'_i
covmat = np.cov(X_tilde.T)
# Compute u'_i, which is stored in the variable v
w, v = np.linalg.eig(covmat)
# Compute u_i from u'_i, and store it in the variable U
U = np.dot(X_tilde, v)
# Normalize u_i, i.e., each column of U
U = U / np.linalg.norm(U)
My results:
PC1 explains 0.08589754681312775% of the total variance
PC2 explains 0.07613195994874819% of the total variance
First 100 PCs explains 0.943423133356313% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [-0.00908046 -0.00905446 -0.00887831 -0.00879843 -0.00850907]
First 5 elements of last column of U: [0.00047628 0.00048451 0.00045043 0.00035762 0.00025785]
Expected results:
PC1 explains 14.32% of the total variance
PC2 explains 7.08% of the total variance
First 100 PCs explains 94.84% of the total variance
Shape of U: (4096, 163)
First 5 elements of first column of U: [0.03381537 0.03353881 0.03292298 0.03238798 0.03146345]
First 5 elements of last column of U: [-0.00672667 -0.00496044 -0.00672151 -0.00759426
-0.00543667]
There must be something off with my calculations, I just can't figure out what. Let me know if you need additional information.
Proof I am using:
It looks to me like you have the steps out of order. You're dropping dimensions from the input before you calculate the eigenvectors and eigenvalues, so you're effectively randomly dropping a bunch of input at this stage with no justification.
# Reshape data to 4096 x 163
X_reshape = np.transpose(X_all, (1,2,3,0)).reshape(-1, X_all.shape[0])
X = X_reshape[:, :163]
I don't quite follow what the intent is behind the call to transpose above, but I don't think it matters. You can only drop dimensions from the input after calculating the eigenvectors and eigenvalues of the covariance matrix. And you don't drop dimensions from the data explicitly; you truncate the matrix of eigenvectors and then use that reduced eigenvector matrix for the projection step.
The covariance matrix in this case should be a 4096x4096 matrix. The eigenvalues and eigenvectors will be returned in order, with the largest eigenvalue and corresponding eigenvector at the beginning. You can then truncate the number of eigenvectors to 163 and create the dimension-reduced projection.
It's possible that I've misunderstood something about the assignment, but I am pretty sure this is the problem. I'm reluctant to say more since it's homework.

Python while loop execution speed up [duplicate]

I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.
Method 1 (slower):
def cluster(distance_matrix, cutoff=1):
indices = np.arange(0, len(distance_matrix))
boolean_distance_matrix = distance_matrix <= cutoff
centroids = []
members = []
while boolean_distance_matrix.any():
centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
mem_indices = boolean_distance_matrix[:, centroid]
mems = indices[mem_indices]
boolean_distance_matrix[mems, :] = False
boolean_distance_matrix[:, mems] = False
centroids.append(centroid)
members.append(mems)
return members, centroids
Method 2 (faster, but still slow for large matrices):
It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.
nbrs = NearestNeighbors(metric='euclidean', radius=1.5,
algorithm='kd_tree')
nbrs.fit(data)
adjacency_matrix = nbrs.radius_neighbors_graph(data)
def cluster(adjacency_matrix, gt=1):
rows = adjacency_matrix.nonzero()[0]
cols = adjacency_matrix.nonzero()[1]
members = []
member = np.ones(len(range(gt+1)))
centroids = []
appendc = centroids.append
appendm = members.append
while len(member) > gt:
un, coun = np.unique(cols, return_counts=True)
centroid = un[np.argmax(coun)]
appendc(centroid)
member = rows[cols == centroid]
appendm(member)
cols = cols[np.in1d(rows, member, invert=True)]
rows = rows[np.in1d(rows, member, invert=True)]
return members, centroids

Categories

Resources