Creating adjacency matrix from multiple clustering results of various sizes - python

I have collected outputs from several clustering algorithms on the same data set, based on which I would like to generate an adjacency matrix indicating in how many different runs any two samples were clustered together. I.e. for each I = 10 clusterings, I have a one-hot representation N x C_i indicating whether or not the sample n belongs to cluster c (for the i'th run), with the possibility of different (amount of) clusters for each run. The goal is then to build an adjacency matrix N x N on which I can threshold and select only consistent clusters for further analysis.
It is quite easy to build an algorithm that does this:
n_samples = runs[0].shape[0]
i = []
j = []
for iter_no, ca in enumerate(runs):
print("Processing adjacency", iter_no)
for col in range(ca.shape[1]):
comb = itertools.combinations(np.where(ca[:, col])[0], 2)
for c in comb:
i.append(c[0])
j.append(c[1])
i = np.array(i)
j = np.array(j)
adj_mat = scipy.sparse.coo_matrix((np.ones(len(i)), (i, j)), shape=[n_samples, n_samples])
This scales very poorly with cluster size, and I typically have N = 15000 with cluster sizes occasionally reaching 12k. Hence, I'm looking for the networkx library to possibly speed this up? Is there any obvious way to do this?
UPDATE: Solution found (see answer).

Linear algebra to the rescue:
assert len(runs) > 0
N = runs[0].shape[0]
R = len(runs)
# Iteratively populate the output matrix (dense)
S = np.zeros((N, N), dtype=np.int8)
for i, scan in enumerate(runs):
print("Adjacency", i)
S += np.matmul(scan, scan.T).astype(np.int8)
# Convert to sparse and return
return scipy.sparse.csr_matrix(S)

Related

Getting nodes without edges (When the N is larger than 60)

First I generated an NxN matrix of zeros and ones using NumPy. After that, I generated a copy matrix from the previous matrix, I replaced the ones in the first matrix with the weight of the edges. ( The matrix is symmetric and connected and undirected and its diagonal is zero like the original matrix) and I used BSF to check if it's connected and I found it connected every time. Then I used SciPy to find the MST (Minimum Spanning Tree). After that, I illustrated the MST using Network X
for generating NxN Matrix of zeros and ones
base = np.zeros((shape,shape))
for _ in range(100):
a = np.random.randint(shape)
b = np.random.randint(shape)
if a != b:
base[a, b] = 1
base[b, a] = 1
for generating NxN Matrix with the weight of edges
# Fetch the location of the 1s.
Weightofedges = base
ones = np.argwhere(Weightofedges == 1)
ones = ones[ones[:, 0] < ones[:, 1], :]
# Assign random values.
for a, b in ones:
Weightofedges[a, b] = Weightofedges[b, a] = np.random.randint(100)
Find the MST using SciPy
from scipy.sparse.csgraph import minimum_spanning_tree
X = minimum_spanning_tree(Weightofedges)
print("The Output Of The MST By Kruskal Algorithm:")
print(" Edges: Weights:")
print(X)
print("-----------------------")
my_matrix3 = X.toarray().astype(int)
The Problem: When I input a matrix with a large number of nodes I got some nodes not connected with an edge
e.g.
Number Of Nodes equals 75
Number Of Edges equals 65
In the MST the edges must be N-1 where N is the number of nodes
This is the graph using N = 75 ( as shown there are nodes without edges )
enter image description here
You have created a weighted version of the Erdős–Rényi model - to be exact the ER model G(n,M) with n nodes and M edges. Currently, you have fixed M=100 and you observe for n>60 that your becomes disconnected. This is typical and (at least for the second ER model variant G(n,p) with n nodes and probability of an edge p) you can even calculate the threshold where you (almost surely) get a single/large connected component. But even without the math, you can intuitively see that it becomes difficult to connect 75 nodes with only 100 random edges.
I recommend that you check out the networkx package, if you want to do more with graphs on python. For example, the implementation of the G(n,p) variant: erdos_renyi_graph.

How to generate a Rank 5 matrix with entries Uniform?

I want to generate a rank 5 100x600 matrix in numpy with all the entries sampled from np.random.uniform(0, 20), so that all the entries will be uniformly distributed between [0, 20). What will be the best way to do so in python?
I see there is an SVD-inspired way to do so here (https://math.stackexchange.com/questions/3567510/how-to-generate-a-rank-r-matrix-with-entries-uniform), but I am not sure how to code it up. I am looking for a working example of this SVD-inspired way to get uniformly distributed entries.
I have actually managed to code up a rank 5 100x100 matrix by vertically stacking five 20x100 rank 1 matrices, then shuffling the vertical indices. However, the resulting 100x100 matrix does not have uniformly distributed entries [0, 20).
Here is my code (my best attempt):
import numpy as np
def randomMatrix(m, n, p, q):
# creates an m x n matrix with lower bound p and upper bound q, randomly.
count = np.random.uniform(p, q, size=(m, n))
return count
Qs = []
my_rank = 5
for i in range(my_rank):
L = randomMatrix(20, 1, 0, np.sqrt(20))
# L is tall
R = randomMatrix(1, 100, 0, np.sqrt(20))
# R is long
Q = np.outer(L, R)
Qs.append(Q)
Q = np.vstack(Qs)
#shuffle (preserves rank 5 [confirmed])
np.random.shuffle(Q)
Not a perfect solution, I must admit. But it's simple and comes pretty close.
I create 5 vectors that are gonna span the space of the matrix and create random linear combinations to fill the rest of the matrix.
My initial thought was that a trivial solution will be to copy those vectors 20 times.
To improve that, I created linear combinations of them with weights drawn from a uniform distribution, but then the distribution of the entries in the matrix becomes normal because the weighted mean basically causes the central limit theorm to take effect.
A middle point between the trivial approach and the second approach that doesn't work is to use sets of weights that favor one of the vectors over the others. And you can generate these sorts of weight vectors by passing any vector through the softmax function with an appropriately high temperature parameter.
The distribution is almost uniform, but the vectors are still very close to the base vectors. You can play with the temperature parameter to find a sweet spot that suits your purpose.
from scipy.stats import ortho_group
from scipy.special import softmax
import numpy as np
from matplotlib import pyplot as plt
N = 100
R = 5
low = 0
high = 20
sm_temperature = 100
p = np.random.uniform(low, high, (1, R, N))
weights = np.random.uniform(0, 1, (N-R, R, 1))
weights = softmax(weights*sm_temperature, axis = 1)
p_lc = (weights*p).sum(1)
rand_mat = np.concatenate([p[0], p_lc])
plt.hist(rand_mat.flatten())
I just couldn't take the fact the my previous solution (the "selection" method) did not really produce strictly uniformly distributed entries, but only close enough to fool a statistical test sometimes. The asymptotical case however, will almost surely not be distributed uniformly. But I did dream up another crazy idea that's just as bad, but in another manner - it's not really random.
In this solution, I do smth similar to OP's method of forming R matrices with rank 1 and then concatenating them but a little differently. I create each matrix by stacking a base vector on top of itself multiplied by 0.5 and then I stack those on the same base vector shifted by half the dynamic range of the uniform distribution. This process continues with multiplication by a third, two thirds and 1 and then shifting and so on until i have the number of required vectors in that part of the matrix.
I know it sounds incomprehensible. But, unfortunately, I couldn't find a way to explain it better. Hopefully, reading the code would shed some more light.
I hope this "staircase" method will be more reliable and useful.
import numpy as np
from matplotlib import pyplot as plt
'''
params:
N - base dimention
M - matrix length
R - matrix rank
high - max value of matrix
low - min value of the matrix
'''
N = 100
M = 600
R = 5
high = 20
low = 0
# base vectors of the matrix
base = low+np.random.rand(R-1, N)*(high-low)
def build_staircase(base, num_stairs, low, high):
'''
create a uniformly distributed matrix with rank 2 'num_stairs' different
vectors whose elements are all uniformly distributed like the values of
'base'.
'''
l = levels(num_stairs)
vectors = []
for l_i in l:
for i in range(l_i):
vector_dynamic = (base-low)/l_i
vector_bias = low+np.ones_like(base)*i*((high-low)/l_i)
vectors.append(vector_dynamic+vector_bias)
return np.array(vectors)
def levels(total):
'''
create a sequence of stritcly increasing numbers summing up to the total.
'''
l = []
sum_l = 0
i = 1
while sum_l < total:
l.append(i)
i +=1
sum_l = sum(l)
i = 0
while sum_l > total:
l[i] -= 1
if l[i] == 0:
l.pop(i)
else:
i += 1
if i == len(l):
i = 0
sum_l = sum(l)
return l
n_rm = R-1 # number of matrix subsections
m_rm = M//n_rm
len_rms = [ M//n_rm for i in range(n_rm)]
len_rms[-1] += M%n_rm
rm_list = []
for len_rm in len_rms:
# create a matrix with uniform entries with rank 2
# out of the vector 'base[i]' and a ones vector.
rm_list.append(build_staircase(
base = base[i],
num_stairs = len_rms[i],
low = low,
high = high,
))
rm = np.concatenate(rm_list)
plt.hist(rm.flatten(), bins = 100)
A few examples:
and now with N = 1000, M = 6000 to empirically demonstrate the nearly asymptotic behavior:

Why does my k-means convergence condition give different results than sklearn?

I've written a function that executes k-means clustering with data, K, and n (number of iterations) as inputs. I can set n=100 so that the function calculates Euclidean distance, assigns clusters and calculates new cluster centroids 100 times over before completing. When comparing the results of this with sklearn's KMeans (using the same data and K) I get the same centroids.
I would like my function to run only until convergence (so as to avoid unnecessary iterations), so it no longer needs number of iterations as an input. To implement this I've used a while loop, whereas in my original function I used a for loop that iterates n times. The general steps of my code are as follows:
Randomly create the centroids and store in the variable centroids.
Calculate the Euclidean distance of each data point from the centroids generated above and store in the variable EucDist.
Assign each data point a cluster based on the EucDist from centroids.
Regroup the data points based on the cluster index cluster. Then compute the mean of separated clusters and assign it as new centroids.
Steps 2-4 are then looped over until convergence. Below is the code for my while loop (I'm aware there is probably a bit of unnecessary code in here, however I'm unsure where the problem is originating from, if it is even a problem).
while True:
previouscentroids = centroids
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
if np.all(previouscentroids - centroids == 0):
break
My understanding of this is that once centroids is the same as it was at the start of the iteration (previouscentroids), the loop will break and I'll have my final clusters. I assumed this would be the same as the centroids produced from my original function (and the same as from sklearn), as I thought once convergence is reached, no matter how many times you iterate the loop, the clusters will remain unchanged (thus so will the centroids). Below is the for loop from my original function for comparison (very little has been changed).
for i in range(n):
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
K-means is a non deterministic algorithm. If you don't have the same initialization than in scikit-learn, you are not sure to find the same clusters. In scikit-learn documentation, it is written :
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids.
Kmeans will change from each time you run it because the initial clustercenters will be chosen at random. So unless you have excatly same initial clustercenters the results wont be the same.

Vectorized implementation of kmeans++

I have implemented the kmeans++ algorithm to initialize clusters when performing K-means clustering. The loop has to run k times. I was wondering if there was any way to vectorize the algorithm to get it to run faster?
points is an array of points in d-dimensions and k is the number of centroids to return.
It works by calculating the minimum distances from the already found clusters, to all the points and then calculating the probability of choosing the next cluster from the points.
The issue is really that it scales badly when k is large.
def init_plus_plus(points, k):
centroids = np.zeros_like(points[:k])
r = np.random.randint(0, points.shape[0])
centroids[0] = points[r]
for i in range(1, k):
min_distances = self.euclidian_distance(centroids[:i], points).min(1)
prob = min_distances / min_distances.sum()
cs = np.cumsum(prob)
idx = np.sum(cs < np.random.rand())
centroids[i] = points[int(idx)]
return centroids

Python while loop execution speed up [duplicate]

I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.
Method 1 (slower):
def cluster(distance_matrix, cutoff=1):
indices = np.arange(0, len(distance_matrix))
boolean_distance_matrix = distance_matrix <= cutoff
centroids = []
members = []
while boolean_distance_matrix.any():
centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
mem_indices = boolean_distance_matrix[:, centroid]
mems = indices[mem_indices]
boolean_distance_matrix[mems, :] = False
boolean_distance_matrix[:, mems] = False
centroids.append(centroid)
members.append(mems)
return members, centroids
Method 2 (faster, but still slow for large matrices):
It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.
nbrs = NearestNeighbors(metric='euclidean', radius=1.5,
algorithm='kd_tree')
nbrs.fit(data)
adjacency_matrix = nbrs.radius_neighbors_graph(data)
def cluster(adjacency_matrix, gt=1):
rows = adjacency_matrix.nonzero()[0]
cols = adjacency_matrix.nonzero()[1]
members = []
member = np.ones(len(range(gt+1)))
centroids = []
appendc = centroids.append
appendm = members.append
while len(member) > gt:
un, coun = np.unique(cols, return_counts=True)
centroid = un[np.argmax(coun)]
appendc(centroid)
member = rows[cols == centroid]
appendm(member)
cols = cols[np.in1d(rows, member, invert=True)]
rows = rows[np.in1d(rows, member, invert=True)]
return members, centroids

Categories

Resources