K-means documents of the BERT vector - python

I transformed a number of sentences (documents) through BERT and got their vectors. Now I need to build the following algorithm.
Creating centroids and with k-means clustering documents
The average value of its cluster is subtracted from the vector [i] - I get a new vector [i]
I generate new random centroids and repeat 1,2, n times
The teacher said that two features should be fulfilled
the mse between the vector and its new vector (after subtracting the average cluster) should decrease
For a document, the sum of the average values of the clusters to which it was determined in n stages should, in theory, be approximately equal to the original vector of the document.
The problem is that
mse usually does decrease, but only a few stages, after it grows again and decreases again
the sum of the average clusters has never been equal to the document vector
having 256 clusters for 4000 documents after several stages, the number of clusters almost always becomes 2 and does not change
//generate centroids (256 x 768), with value between max and min value docs data.
the first stage k-means with euclidean distance
edistrix - document vectors after BERT
for i in range(0, 4000):
min_dist = 100000
for j in range (0,256):
dist = np.linalg.norm(edistrix[i] - start_centroids[j])
if rass < min_dist:
min_dist = dist
centr[i] = j // centr - the number of the nearest centroid of the document i
cluster_p[int(centr[i])]+=1 // number of cluster points
cluster_sum[int(centr[i])] += edistrix[i] // sum of cluster points
near_cluster_etapa[0] = centr // the nearest centroid id for each document vector
//other stages
for n in range(0, 7): // stage number
for i in range(0, 256): // calculate the average of the cluster
if cluster_p[i] != 0:
centr_centroidov[n][i] = np.divide(cluster_sum[i], cluster_p[i])
for i in range(0, 4000):
Dif = edistrix[i] - centr_centroidov[n][int(near_cluster_etapa[n][i])] //the difference between cluster point and average value
sred[i][n] = ((edistrix[i] - Dif)**2).mean(axis=None) // mse between the current vector and the difference Dif
edistrix[i] = Dif // the Dif become the new vector of the i-th document
all_doc_koord[i][n+1] = edistrix[i]
//generate new centroids
//reset to zero the number of points and the sum of the cluster vectors and nearest centroid of the vector (cluster_p, cluster_sum, centr)
//again k-means
for i in range(0,4000):
min_dist = 100000
for j in range (0,256):
dist = np.linalg.norm(edistrix[i][0] - start_centroids[j])
if dist < min_dist:
min_dist = dist
centr[i] = j
cluster_p[int(centr[i])]+=1
cluster_sum[int(centr[i])] += edistrix[i][0]
near_cluster_etapa[n+1] = centr
//check features
/1
for i in range(0, 7):
print('{0:.10f}'.format(sred[0][i]) ) // for i documents print all mse
/2
summ = 0
for i in range (0,7):
summ += centr_centroidov [i][int(near_cluster_etapa[i][0])] //folding the average of the cluster from stages for 0 (first) document
summ
I really don't understand what the problem is. k-means, Euclidean distance, number of cluster points, cluster mean, mse and the difference between the vector and the cluster mean is considered true.

Related

Why does my k-means convergence condition give different results than sklearn?

I've written a function that executes k-means clustering with data, K, and n (number of iterations) as inputs. I can set n=100 so that the function calculates Euclidean distance, assigns clusters and calculates new cluster centroids 100 times over before completing. When comparing the results of this with sklearn's KMeans (using the same data and K) I get the same centroids.
I would like my function to run only until convergence (so as to avoid unnecessary iterations), so it no longer needs number of iterations as an input. To implement this I've used a while loop, whereas in my original function I used a for loop that iterates n times. The general steps of my code are as follows:
Randomly create the centroids and store in the variable centroids.
Calculate the Euclidean distance of each data point from the centroids generated above and store in the variable EucDist.
Assign each data point a cluster based on the EucDist from centroids.
Regroup the data points based on the cluster index cluster. Then compute the mean of separated clusters and assign it as new centroids.
Steps 2-4 are then looped over until convergence. Below is the code for my while loop (I'm aware there is probably a bit of unnecessary code in here, however I'm unsure where the problem is originating from, if it is even a problem).
while True:
previouscentroids = centroids
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
if np.all(previouscentroids - centroids == 0):
break
My understanding of this is that once centroids is the same as it was at the start of the iteration (previouscentroids), the loop will break and I'll have my final clusters. I assumed this would be the same as the centroids produced from my original function (and the same as from sklearn), as I thought once convergence is reached, no matter how many times you iterate the loop, the clusters will remain unchanged (thus so will the centroids). Below is the for loop from my original function for comparison (very little has been changed).
for i in range(n):
EucDist = np.array([]).reshape(a, 0)
for k in range(K):
Dist = np.sqrt(np.sum((data - centroids[:, k]) ** 2, axis = 1))
EucDist = np.c_[EucDist, Dist]
cluster = np.argmin(EucDist, axis = 1) + 1
points = {}
for k in range(K):
points[k + 1] = np.array([]).reshape(2, 0)
for i in range(a):
points[cluster[i]] = np.c_[points[cluster[i]], data[i]]
for k in range(K):
points[k + 1] = points[k + 1].T
for k in range(K):
centroids[:, k] = np.mean(points[k + 1], axis = 0)
K-means is a non deterministic algorithm. If you don't have the same initialization than in scikit-learn, you are not sure to find the same clusters. In scikit-learn documentation, it is written :
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids.
Kmeans will change from each time you run it because the initial clustercenters will be chosen at random. So unless you have excatly same initial clustercenters the results wont be the same.

How to find out standard deviation of estimated positions against real world position

I hope the title isn't too confusing but it's the best I could think of (feel free to suggest better titles!)
I have a physical sensor placed at a fixed location in a room, say (1, 1, 1) in a coordinate system. This sensor is able to estimate its position within the coordinate system. I let the sensor estimate the position 10 times a second for a time period of 30 seconds, so in total I have 300 position estimations which are saved to a file.
Now, in order to evaluate the position estimations, I calculated the distance from every estimation to the reference point (1, 1, 1) and saved all distances to a list. I'd like to find out the standard deviation of the distances to the reference point (1, 1, 1).
I am not that familiar with calculating standard deviations but as multiple explanations and tutorials suggested, I should
1) calculate the mean of all distances
2) substract the mean from every single distance and square it
3) add all values from step 2) to a list and calculate their mean
4) take the square root of the mean
But, I think I shouldn't use the mean of the calculated distances in step 2) but instead use the value of 0 because I don't want to calculate the standard deviation of the calculated distances to their mean but to my reference point (1, 1, 1). Since my reference point obviously has a distance of 0 to itself, I thought that this might be the right approach.
Here's my python script:
import sys
from math import sqrt, pow
# Returns the amount of samples collected - necessary for mean and standard deviation calculations
def get_sample_count(filename):
with open(filename) as f:
for i, l in enumerate(f):
pass
return i + 1
def distanceBetweenTwoPoints2D(sample_point, reference_point):
return sqrt(pow(sample_point[0] - reference_point[0], 2) + pow(sample_point[1] - reference_point[1], 2))
def distanceBetweenTwoPoints3D(sample_point, reference_point):
return sqrt(pow(sample_point[0] - reference_point[0], 2) + pow(sample_point[1] - reference_point[1], 2) + pow(sample_point[2] - reference_point[2], 2))
def standard_deviation(distances_2D, sample_distance_mean_2D, distances_3D, sample_distance_mean_3D, sample_count):
squared_distances_2D = []
squared_distances_3D = []
for distance in distances_2D:
squared = pow(distance - 0, 2)
squared_distances_2D.append(squared)
for distance in distances_3D:
squared = pow(distance - 0, 2)
squared_distances_3D.append(squared)
std2D = sqrt(sum(squared_distances_2D) / sample_count)
std3D = sqrt(sum(squared_distances_3D) / sample_count)
return std2D, std3D
def evaluateData(filename, reference_point):
sample_x_mean = 0.0
sample_y_mean = 0.0
sample_z_mean = 0.0
distances_2D = []
distances_3D = []
sample_count = get_sample_count(filename)
with open(filename) as file:
for line in file:
x = float(line.split(',')[0])
y = float(line.split(',')[1])
z = float(line.split(',')[2])
# Add individual coordinates to means
sample_x_mean += x
sample_y_mean += y
sample_z_mean += z
# Calculate distance in 2D and 3D and add to distances lists
sample_point = [x, y, z]
sample_distance_2D = distanceBetweenTwoPoints2D(sample_point, reference_point)
sample_distance_3D = distanceBetweenTwoPoints3D(sample_point, reference_point)
distances_2D.append(sample_distance_2D)
distances_3D.append(sample_distance_3D)
sample_x_mean /= sample_count
sample_y_mean /= sample_count
sample_z_mean /= sample_count
sample_distance_mean_2D = sum(distances_2D) / sample_count
sample_distance_mean_3D = sum(distances_3D) / sample_count
std2D, std3D = standard_deviation(distances_2D, sample_distance_mean_2D, distances_3D, sample_distance_mean_3D, sample_count)
return sample_count, sample_x_mean, sample_y_mean, sample_z_mean, sample_distance_mean_2D, sample_distance_mean_3D, std2D, std3D
if __name__ == "__main__":
filename = sys.argv[1]
direction = filename.split('(')[0]
x_reference = float((filename.split('(')[1].split(')')[0].split('_')[0]).replace(',', '.'))
y_reference = float((filename.split('(')[1].split(')')[0].split('_')[1]).replace(',', '.'))
z_reference = float((filename.split('(')[1].split(')')[0].split('_')[2]).replace(',', '.'))
reference_point = [x_reference, y_reference, z_reference]
print("\n")
sample_count, x_mean, y_mean, z_mean, distance_mean_2D, distance_mean_3D, std2D, std3D = evaluateData(filename, reference_point)
print("DIRECTION: {}, SAMPLE COUNT: {}".format(direction, sample_count))
print("X REFERENCE: {}, Y REFERENCE: {}, Z REFERENCE: {}".format(x_reference, y_reference, z_reference))
print("X MEAN: {}, Y MEAN: {}, Z MEAN: {}".format(x_mean, y_mean, z_mean))
print("DISTANCE MEAN 2D: {}, DISTANCE MEAN 3D: {}".format(distance_mean_2D, distance_mean_3D))
print("STD2D: {}, STD3D: {}".format(std2D, std3D))
print("\n")
Can anybody prove me right or wrong please?
Regards
Two things:
1) If you're already calculating the distance between the reference point and your observation point in distanceBetweenTwoPoints[23]D(), you wouldn't want to use the reference point as the mean. That's already sort of baked into the calculation.
2) math.stdev() will calculate standard deviation for you.
The code appears to correctly calculate the distances to the reference point, and then calculate the root mean square of these distances.
(There is a minor efficiency issue with evaluating the square root when applying Pythagoras' theorem only to square the distances again during the RMS calculation, but for 300 points it is not worth worrying about.)
If the stated aim is to calculate the standard deviation of the distances, then you do need -- as you have indicated -- to subtract off the mean of the distances before you evaluate the root mean square, because the standard deviation is not the root mean square of the distances themselves, but instead the root mean square of the deviation from the mean. This is not what you have calculated.
Which of these is more appropriate will depend on what you are trying to measure. Suppose for example that all of your estimates are at the same (non-zero) distance from the reference point. The root mean square of those distances will be equal to that value. But the standard deviation of the distances will be equal to zero because it is a measure of spread in the variable concerned, and all the values (distance from the reference point) are equal.
If you are trying to evaluate the accuracy of the the measurements, then the root mean square distance (rather than the standard deviation) is probably of greater relevance.

Vectorized implementation of kmeans++

I have implemented the kmeans++ algorithm to initialize clusters when performing K-means clustering. The loop has to run k times. I was wondering if there was any way to vectorize the algorithm to get it to run faster?
points is an array of points in d-dimensions and k is the number of centroids to return.
It works by calculating the minimum distances from the already found clusters, to all the points and then calculating the probability of choosing the next cluster from the points.
The issue is really that it scales badly when k is large.
def init_plus_plus(points, k):
centroids = np.zeros_like(points[:k])
r = np.random.randint(0, points.shape[0])
centroids[0] = points[r]
for i in range(1, k):
min_distances = self.euclidian_distance(centroids[:i], points).min(1)
prob = min_distances / min_distances.sum()
cs = np.cumsum(prob)
idx = np.sum(cs < np.random.rand())
centroids[i] = points[int(idx)]
return centroids

Instance IOU fast calculation on large image

I have a instances boolean mask which shape is (448, 1000, 1000) for 448 instances, the average pixel of instance is around 100.
Now if I have a prediction matrix which shape is (1000, 1000) and predict instance by integer, i.e. If the matrix predict 500 instance, np.unique(pred) will be 501 (500 class + 1 background).
I need to calculate the IOU (jaccard index) for each pair prediction and mask to find the maximum IOU. I have wrote codes below but it's super slow and inefficient.
c = 0 #intersection count
u = 0 #union count
pred_used = [] #record prediction used
# loop for every ground truth mask
for idx_m in range(len(mask[:,0,0])):
m = mask[idx_m,:,:] #take one mask
intersect_list = []
union_list = []
# loop every prediction
for idx_pred in range(1, int(np.max(pred))+1):
p = (pred==idx_pred) # take one prediction mask
intersect = np.sum(m.ravel() * p.ravel()) #calculate intersect
union = np.sum(m.ravel() + p.ravel() - m.ravel()*p.ravel())
intersect_list.append(intersect)
union_list.append(union_list)
if np.sum(intersect_list) > 0:
idx_max_iou = np.argmax(np.array(intersect_list))
c += intersect_list[idx_max_iou]
u += union_list[idx_max_iou]
pred_used.append(idx_max_iou)
So you have an output image sized [1000,1000] which is the predicted array/tensor by your model.
One of the first thing you can do is reshape the labels and predictions from [1000,1000] to [1000*1000, ]. This reduces the complexity from being N^2 to N. This should boostup the speed significantly .
And you can also try the IoU from Scikit which maybe a bit faster than your vesion.
You can find an example here: How to find IoU from segmentation masks?
Doc: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

Spectral embedding - spectral clustering

I'm trying to perform spectral embedding/clustering using Normalized Cuts. I wrote the following code but I have stuck to a logical bottleneck. What do I have to do after clustering the eigenvectors? I don't know how to form the clusters on my original dataset. (A is my affinity matrix)
D = np.diag(np.sum(A, 0))
D_half_inv = np.diag(1.0 / np.sqrt(np.sum(A, 0)))
M = np.dot(D_half_inv, np.dot((D - A), D_half_inv))
# compute eigenvectors and eigenvalues
(w, v) = np.linalg.eigh(M)
# renorm eigenvectors to have norm 1
var = len(w)
v1 = np.array(np.zeros((var, var)))
for j in range(var):
v[:][j] = v[:][j]/np.sqrt(np.sum(A,0))
v[:][j] = v[:][j]/np.linalg.norm(v1[:][j])
v_trailing = v[:,1:45] #omit the corresponding eigenvector of the smallest eigenvalue which is 0 and 45 is my embedding dimension
k = 20 #number of clusters
centroids,idx = kmeans2(v_trailing, k)
After that, i get labels for each eigenvector. But how can i link these labels on my original dataset?
The output mapping to the original dataset corresponds to the indices of the labels in your modified set.
So if yi is in Cm then the ith entry of A will be in Am
or to put it another way
Let C1 ..... CM be the set of clusters generated by clustering the eigenvectors the clusters you want are : A1 ..... AM where Ai= { j | yj element of Ci }

Categories

Resources