I am implementing kmeans algorithm from scratch in python and on Spark. Actually, it is my homework. The problem is to implement kmeans with predefined centroids with different initialization methods, one of them is random initialization(c1) and the other is kmeans++(c2). Also, it is required to use different distance metrics, Euclidean distance, and Manhattan distance. The formula for both of them is introduced as follows:
The second formula in each section is for the corresponding cost function which is going to be minimized. I have implemented both of them but I think there is a problem. This is the graph for the cost function per iteration of kmeans using different settings:
The first graph looks fine but the second one seems to have a problem because as far as I'm concerned, the cost of kmeans must decrease after each iteration. So, What is the problem? It's from my code or formula?
And these are my functions for computing distances and cost:
def Euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
def Manhattan_distance(point1, point2):
return np.sum(np.absolute(point1 - point2))
def cost_per_point(point, center, cost_type = 'E'):
if cost_type =='E':
return Euclidean_distance(point, center)**2
else:
return Manhattan_distance(point, center)
And here is my full code on GitHub:
https://github.com/mrasoolmirzaei/My-Data-Science-Projects/blob/master/Implementing%20Kmeans%20With%20Spark.ipynb
K-means does not minimize distances.
It minimizes the sum of squares (which is not a metric).
If you assign points to the nearest cluster by Euclidean distance, it will still minimize the sum of squares, not Euclidean distances. In particular, the sum of euclidean distances may increase.
Minimizing Euclidean distances is the Weber problem. The mean is not optimal. You need a complex geometrical median to minimize Euclidean distances.
If you assign points with Manhattan distance, it is not clear what is being minimized... You have two competing objectives. While I would assume that it will still converge, that may be tricky to prove. because using the mean may increase the sum of Manhattan distances.
I think I posted a counterexample for k-means minimizing Euclidean distance here at SO or stats.SE some time ago. So your code and analysis may even be fine - it is the assignment that is flawed.
Related
Does anyone know a fast way to find a line closest to a set a points in python? (but the line should always cross the origin, in other words f(0) = 0)
Given the equation of the line y = mx + 0 I want to find the m that optimizes this distance to every point in the set.
The image above is an example, the line should be closest to all the points. I tried doing this using scipy.optimize.minimize_scalar but the performance was not good enough, I wonder if there is a faster algorithm or a analytical way of doing this.
The distance formula can be found here: https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line. You want to minimize the sum of the distances but the distance formula contains an absolute value which will created a discontinuity in the first derivative which will prevent numerical solvers from finding the solution. Alternatively you could minimize the square of the sum of the distances. However, it's not the same as minimizing the sum of the distances.
Suppose I have a kxn array of data with columns of vectors and a distance function defined on these vectors. How do I convert the kxn array into another array of the same shape such that the euclidean norm among the converted vectors is the norm derived by the given distance function? I know you can directly calculate the distance matrix for the data by that given distance function, and derive the coordinates in R^k thereby. But this method is really expensive espesically when the distance function has a complexity O(n^2) or more. So I wonder if there is any simpler algorithm to do that.
It sounds like you are describing multidimension scaling (MDS). One way to do it in Python is with scikit-learn's sklearn.manifold.MDS.
MDS expects the NxN distance (or "dissimilarity") matrix as input, so that doesn't get around the cost of evaluating the distance function. The distance matrix is unavoidably needed for this conversion, so if the distance function itself is expensive, it seems the best thing to do is reduce the number of samples or look for a way to compute fast approximate distances it to speed it up. Also, beware that MDS is usually only approximate. A numerical optimization looks for the best fit of Euclidean norms to the given distances.
I have a set of points p and I need to transform them so that the they align with another given set of points q (find the transform T from source to target).
So far it is an easy problem. My problem is that I do have some freedom aligning these points i.e, I only have to keep the alignment error below some given threshold (alpha) and not minimize the distance. I want to exploit this alignment freedom to minimize distances between p and a different set of points r. I marked the vectors to be optimized E = Tp - r
So basically I want to use the first alignment as a hard constraint and try to minimize another set of correspondences (I attached a picture). I want to minimize |E| (the green distances) under the constraint that the black points are within the red circles (alpha) after applying the transformation T.
I tried some heuristic solutions like calculating the maximum allowed rotation around the centroid and only then taking the maximum allowed translation but none of these solutions guarantee the optimal solution.
Have you heard about Lagrange optimization?
Here's the corresponding article.
You minimize a cost function (in your case E) under certain inequality
constraints and equality constraints (in your case no equality constraints).
This may be an approach for your solution?
Step 1:
Build augmented cost function: E - L * (Tp - q - alpha)
Step 2:
Find partial derivatives w.r.t T and L
Step 3:
Solve for zeros in partial derivatives
I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?
I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).
I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.
Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!
Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:
def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
# Manually override euclidean
def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
#return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
return np.arccos(cosine_similarity(X, Y))/np.pi
k_means_.euclidean_distances = euc_dist
kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
_ = kmeans.fit(sparse_data)
return kmeans
I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step() in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?
So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance.
So in terms of using k-means, simply do:
length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
And if you need the centroids and distance matrix do:
len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances
Notes:
Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is).
TLDR: Use the spherecluster package as #σηγ pointed out.
You can normalize your data and then use KMeans.
from sklearn import preprocessing
from sklearn.cluster import KMeans
kmeans = KMeans().fit(preprocessing.normalize(X))
Unfortunately no.
Sklearn current implementation of k-means only uses Euclidean distances.
The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.
If you want to use K-means with cosine distance, you need to make your own function or class. Or, try to use other clustering algorithm such as DBSCAN.