Why does my k-means convergence condition give different results than sklearn? - python

I've written a function that executes k-means clustering with data, K, and n (number of iterations) as inputs. I can set n=100 so that the function calculates Euclidean distance, assigns clusters and calculates new cluster centroids 100 times over before completing. When comparing the results of this with sklearn's KMeans (using the same data and K) I get the same centroids.
I would like my function to run only until convergence (so as to avoid unnecessary iterations), so it no longer needs number of iterations as an input. To implement this I've used a while loop, whereas in my original function I used a for loop that iterates n times. The general steps of my code are as follows:
Randomly create the centroids and store in the variable centroids.
Calculate the Euclidean distance of each data point from the centroids generated above and store in the variable EucDist.
Assign each data point a cluster based on the EucDist from centroids.
Regroup the data points based on the cluster index cluster. Then compute the mean of separated clusters and assign it as new centroids.
Steps 2-4 are then looped over until convergence. Below is the code for my while loop (I'm aware there is probably a bit of unnecessary code in here, however I'm unsure where the problem is originating from, if it is even a problem).
while True:
previouscentroids = centroids
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
if np.all(previouscentroids - centroids == 0):
My understanding of this is that once centroids is the same as it was at the start of the iteration (previouscentroids), the loop will break and I'll have my final clusters. I assumed this would be the same as the centroids produced from my original function (and the same as from sklearn), as I thought once convergence is reached, no matter how many times you iterate the loop, the clusters will remain unchanged (thus so will the centroids). Below is the for loop from my original function for comparison (very little has been changed).
for i in range(n):
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)

K-means is a non deterministic algorithm. If you don't have the same initialization than in scikit-learn, you are not sure to find the same clusters. In scikit-learn documentation, it is written :
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids.

Kmeans will change from each time you run it because the initial clustercenters will be chosen at random. So unless you have excatly same initial clustercenters the results wont be the same.


K-means documents of the BERT vector

I transformed a number of sentences (documents) through BERT and got their vectors. Now I need to build the following algorithm.
Creating centroids and with k-means clustering documents
The average value of its cluster is subtracted from the vector [i] - I get a new vector [i]
I generate new random centroids and repeat 1,2, n times
The teacher said that two features should be fulfilled
the mse between the vector and its new vector (after subtracting the average cluster) should decrease
For a document, the sum of the average values of the clusters to which it was determined in n stages should, in theory, be approximately equal to the original vector of the document.
The problem is that
mse usually does decrease, but only a few stages, after it grows again and decreases again
the sum of the average clusters has never been equal to the document vector
having 256 clusters for 4000 documents after several stages, the number of clusters almost always becomes 2 and does not change
//generate centroids (256 x 768), with value between max and min value docs data.
the first stage k-means with euclidean distance
edistrix - document vectors after BERT
for i in range(0, 4000):
min_dist = 100000
for j in range (0,256):
dist = np.linalg.norm(edistrix[i] - start_centroids[j])
if rass < min_dist:
min_dist = dist
centr[i] = j // centr - the number of the nearest centroid of the document i
cluster_p[int(centr[i])]+=1 // number of cluster points
cluster_sum[int(centr[i])] += edistrix[i] // sum of cluster points
near_cluster_etapa[0] = centr // the nearest centroid id for each document vector
//other stages
for n in range(0, 7): // stage number
for i in range(0, 256): // calculate the average of the cluster
if cluster_p[i] != 0:
centr_centroidov[n][i] = np.divide(cluster_sum[i], cluster_p[i])
for i in range(0, 4000):
Dif = edistrix[i] - centr_centroidov[n][int(near_cluster_etapa[n][i])] //the difference between cluster point and average value
sred[i][n] = ((edistrix[i] - Dif)**2).mean(axis=None) // mse between the current vector and the difference Dif
edistrix[i] = Dif // the Dif become the new vector of the i-th document
all_doc_koord[i][n+1] = edistrix[i]
//generate new centroids
//reset to zero the number of points and the sum of the cluster vectors and nearest centroid of the vector (cluster_p, cluster_sum, centr)
//again k-means
for i in range(0,4000):
min_dist = 100000
for j in range (0,256):
dist = np.linalg.norm(edistrix[i][0] - start_centroids[j])
if dist < min_dist:
min_dist = dist
centr[i] = j
cluster_sum[int(centr[i])] += edistrix[i][0]
near_cluster_etapa[n+1] = centr
//check features
for i in range(0, 7):
print('{0:.10f}'.format(sred[0][i]) ) // for i documents print all mse
summ = 0
for i in range (0,7):
summ += centr_centroidov [i][int(near_cluster_etapa[i][0])] //folding the average of the cluster from stages for 0 (first) document
I really don't understand what the problem is. k-means, Euclidean distance, number of cluster points, cluster mean, mse and the difference between the vector and the cluster mean is considered true.

How to generate a Rank 5 matrix with entries Uniform?

I want to generate a rank 5 100x600 matrix in numpy with all the entries sampled from np.random.uniform(0, 20), so that all the entries will be uniformly distributed between [0, 20). What will be the best way to do so in python?
I see there is an SVD-inspired way to do so here (https://math.stackexchange.com/questions/3567510/how-to-generate-a-rank-r-matrix-with-entries-uniform), but I am not sure how to code it up. I am looking for a working example of this SVD-inspired way to get uniformly distributed entries.
I have actually managed to code up a rank 5 100x100 matrix by vertically stacking five 20x100 rank 1 matrices, then shuffling the vertical indices. However, the resulting 100x100 matrix does not have uniformly distributed entries [0, 20).
Here is my code (my best attempt):
import numpy as np
def randomMatrix(m, n, p, q):
# creates an m x n matrix with lower bound p and upper bound q, randomly.
count = np.random.uniform(p, q, size=(m, n))
return count
Qs = []
my_rank = 5
for i in range(my_rank):
L = randomMatrix(20, 1, 0, np.sqrt(20))
# L is tall
R = randomMatrix(1, 100, 0, np.sqrt(20))
# R is long
Q = np.outer(L, R)
Q = np.vstack(Qs)
#shuffle (preserves rank 5 [confirmed])
Not a perfect solution, I must admit. But it's simple and comes pretty close.
I create 5 vectors that are gonna span the space of the matrix and create random linear combinations to fill the rest of the matrix.
My initial thought was that a trivial solution will be to copy those vectors 20 times.
To improve that, I created linear combinations of them with weights drawn from a uniform distribution, but then the distribution of the entries in the matrix becomes normal because the weighted mean basically causes the central limit theorm to take effect.
A middle point between the trivial approach and the second approach that doesn't work is to use sets of weights that favor one of the vectors over the others. And you can generate these sorts of weight vectors by passing any vector through the softmax function with an appropriately high temperature parameter.
The distribution is almost uniform, but the vectors are still very close to the base vectors. You can play with the temperature parameter to find a sweet spot that suits your purpose.
from scipy.stats import ortho_group
from scipy.special import softmax
import numpy as np
from matplotlib import pyplot as plt
N = 100
R = 5
low = 0
high = 20
sm_temperature = 100
p = np.random.uniform(low, high, (1, R, N))
weights = np.random.uniform(0, 1, (N-R, R, 1))
weights = softmax(weights*sm_temperature, axis = 1)
p_lc = (weights*p).sum(1)
rand_mat = np.concatenate([p[0], p_lc])
I just couldn't take the fact the my previous solution (the "selection" method) did not really produce strictly uniformly distributed entries, but only close enough to fool a statistical test sometimes. The asymptotical case however, will almost surely not be distributed uniformly. But I did dream up another crazy idea that's just as bad, but in another manner - it's not really random.
In this solution, I do smth similar to OP's method of forming R matrices with rank 1 and then concatenating them but a little differently. I create each matrix by stacking a base vector on top of itself multiplied by 0.5 and then I stack those on the same base vector shifted by half the dynamic range of the uniform distribution. This process continues with multiplication by a third, two thirds and 1 and then shifting and so on until i have the number of required vectors in that part of the matrix.
I know it sounds incomprehensible. But, unfortunately, I couldn't find a way to explain it better. Hopefully, reading the code would shed some more light.
I hope this "staircase" method will be more reliable and useful.
import numpy as np
from matplotlib import pyplot as plt
N - base dimention
M - matrix length
R - matrix rank
high - max value of matrix
low - min value of the matrix
N = 100
M = 600
R = 5
high = 20
low = 0
# base vectors of the matrix
base = low+np.random.rand(R-1, N)*(high-low)
def build_staircase(base, num_stairs, low, high):
create a uniformly distributed matrix with rank 2 'num_stairs' different
vectors whose elements are all uniformly distributed like the values of
l = levels(num_stairs)
vectors = []
for l_i in l:
for i in range(l_i):
vector_dynamic = (base-low)/l_i
vector_bias = low+np.ones_like(base)*i*((high-low)/l_i)
return np.array(vectors)
def levels(total):
create a sequence of stritcly increasing numbers summing up to the total.
l = []
sum_l = 0
i = 1
while sum_l < total:
i +=1
sum_l = sum(l)
i = 0
while sum_l > total:
l[i] -= 1
if l[i] == 0:
i += 1
if i == len(l):
i = 0
sum_l = sum(l)
return l
n_rm = R-1 # number of matrix subsections
m_rm = M//n_rm
len_rms = [ M//n_rm for i in range(n_rm)]
len_rms[-1] += M%n_rm
rm_list = []
for len_rm in len_rms:
# create a matrix with uniform entries with rank 2
# out of the vector 'base[i]' and a ones vector.
base = base[i],
num_stairs = len_rms[i],
low = low,
high = high,
rm = np.concatenate(rm_list)
plt.hist(rm.flatten(), bins = 100)
A few examples:
and now with N = 1000, M = 6000 to empirically demonstrate the nearly asymptotic behavior:

Vectorized implementation of kmeans++

I have implemented the kmeans++ algorithm to initialize clusters when performing K-means clustering. The loop has to run k times. I was wondering if there was any way to vectorize the algorithm to get it to run faster?
points is an array of points in d-dimensions and k is the number of centroids to return.
It works by calculating the minimum distances from the already found clusters, to all the points and then calculating the probability of choosing the next cluster from the points.
The issue is really that it scales badly when k is large.
def init_plus_plus(points, k):
centroids = np.zeros_like(points[:k])
r = np.random.randint(0, points.shape[0])
centroids[0] = points[r]
for i in range(1, k):
min_distances = self.euclidian_distance(centroids[:i], points).min(1)
prob = min_distances / min_distances.sum()
cs = np.cumsum(prob)
idx = np.sum(cs < np.random.rand())
centroids[i] = points[int(idx)]
return centroids

Creating adjacency matrix from multiple clustering results of various sizes

I have collected outputs from several clustering algorithms on the same data set, based on which I would like to generate an adjacency matrix indicating in how many different runs any two samples were clustered together. I.e. for each I = 10 clusterings, I have a one-hot representation N x C_i indicating whether or not the sample n belongs to cluster c (for the i'th run), with the possibility of different (amount of) clusters for each run. The goal is then to build an adjacency matrix N x N on which I can threshold and select only consistent clusters for further analysis.
It is quite easy to build an algorithm that does this:
n_samples = runs[0].shape[0]
i = []
j = []
for iter_no, ca in enumerate(runs):
print("Processing adjacency", iter_no)
for col in range(ca.shape[1]):
comb = itertools.combinations(np.where(ca[:, col])[0], 2)
for c in comb:
i = np.array(i)
j = np.array(j)
adj_mat = scipy.sparse.coo_matrix((np.ones(len(i)), (i, j)), shape=[n_samples, n_samples])
This scales very poorly with cluster size, and I typically have N = 15000 with cluster sizes occasionally reaching 12k. Hence, I'm looking for the networkx library to possibly speed this up? Is there any obvious way to do this?
UPDATE: Solution found (see answer).
Linear algebra to the rescue:
assert len(runs) > 0
N = runs[0].shape[0]
R = len(runs)
# Iteratively populate the output matrix (dense)
S = np.zeros((N, N), dtype=np.int8)
for i, scan in enumerate(runs):
print("Adjacency", i)
S += np.matmul(scan, scan.T).astype(np.int8)
# Convert to sparse and return
return scipy.sparse.csr_matrix(S)

Fitting arbitrary gaussian functions, massive memory consumption in python

I'm trying to (in python) fit a series of an arbitrary number of gaussian functions (determined by a simple algorithm still being improved) to a data set. For my current sample data set, I have 174 gaussian functions. I have a procedure for doing the fit, but it's basically complicated guess-and-check, and consumes all 4GB of memory available.
Is there any way to accomplish this using something in scipy or numpy?
Here is what I'm trying to use, where wavelength[] is the list of x-coordinates, and fluxc[] is the list of y-coordinates:
#Pick a gaussian
for repeat in range(0,2):
for f in range(0,len(centroid)):
#Iterate over every other gaussian
for i in range(0,len(centroid)):
if i!= f:
#For every wavelength,
for w in wavelength:
#Append the value of each to an list, called others
#Optimize the centroid of the current gaussian
prev = centroid[f]
best = centroid[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(centroid[i]))-3-repeat),int(round(math.log10(centroid[i])))-6-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
#Change the value of the current item
centroid[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = centroid[f]
centroid[f] = best
#Optimize the height of the current gaussian
prev = height[f]
best = height[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(height[i]))-repeat),int(round(math.log10(height[i])))-3-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
#Change the value of the current item
height[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = height[f]
height[f] = best
#Optimize the width of the current gaussian
prev = width[f]
best = width[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(width[i]))-repeat),int(round(math.log10(width[i])))-3-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
if prev + m * 10**(p) == 0:
#Change the value of the current item
width[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = width[f]
width[f] = best
count += 1
#print '{} of {} peaks optimized, iteration {} of {}'.format(f+1,len(centroid),repeat+1,2)
complete = round(100*(count/(float(len(centroid))*2)),2)
print '{}% completed'.format(complete)
print 'New R^2 = {}'.format(bestr)
Yes, it can likely be done better (easier) using scipy. But firstly, refactor your code into smaller functions; it justs makes it a lot easier to read and understand what's going on.
As for the memory consumption: you're probably overextending a list far too much somewhere (others is a candidate: I never see it cleared (or initialized!), while it gets filled in a quadruple loop). That, or your data is simply that large (in which case you really should be using numpy arrays, just to speed up things). I can't tell, because you're introducing various variables without giving any idea of the size (how big is wavelengths? How large does others get? What and where are all the initializations of your data arrays?)
Also, fitting 174 Gaussians is just a bit crazy; either look into another way of determining whatever you want to get out of your data, or split things up. From the wavelengths variable, it appears you're trying to fit lines in a high resolution spectrum; perhaps isolating most of the lines and fitting those isolated groups separately is better. If they all overlap, I doubt any normal fitting technique is going to help you.
Lastly, perhaps a package like pandas can help (e.g., the computation subpackage).
Perhaps very lastly, since I see a lot that can be improved in the code. At some point codereview may also be useful. Though for now I guess your memory usage is the most problematic part.

