I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).
I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.
Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!
Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:
def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
# Manually override euclidean
def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
#return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
return np.arccos(cosine_similarity(X, Y))/np.pi
k_means_.euclidean_distances = euc_dist
kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
_ = kmeans.fit(sparse_data)
return kmeans
I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step() in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?
So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance.
So in terms of using k-means, simply do:
length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
And if you need the centroids and distance matrix do:
len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances
Notes:
Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is).
TLDR: Use the spherecluster package as #σηγ pointed out.
You can normalize your data and then use KMeans.
from sklearn import preprocessing
from sklearn.cluster import KMeans
kmeans = KMeans().fit(preprocessing.normalize(X))
Unfortunately no.
Sklearn current implementation of k-means only uses Euclidean distances.
The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.
If you want to use K-means with cosine distance, you need to make your own function or class. Or, try to use other clustering algorithm such as DBSCAN.
Related
I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)
ks = range(1,30)
inertias = []
for k in ks:
km = KMeans(n_clusters=k).fit(trialsX)
inertias.append(km.inertia_)
plt.plot(ks,inertias)
Based on my reading, the optimal k value lies at the 'elbow' of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?
I'll post this, because it's the best I have come up with thus far:
It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:
y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
x_range = np.linspace(ks[0],ks[-1],1000)
y_spl_1d = y_spl.derivative(n=1)
plt.plot(x_range,y_spl_1d(x_range))
then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.
EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.
I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.
But the obvious first step to do is to run a profiler to identify slow operations!
Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.
Using the following code to cluster geolocation coordinates results in 3 clusters:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten
coordinates= np.array([
[lat, long],
[lat, long],
...
[lat, long]
])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()
Is it right to use Kmeans for location clustering, as it uses Euclidean distance and not Haversine formula as a distance function?
k-means is not a good algorithm to use for spatial clustering, for the reasons you meantioned. Instead, you could do this clustering job using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm.
This tutorial demonstrates clustering latitude-longitude spatial data with DBSCAN/haversine and avoids all those Euclidean-distance problems:
df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. Also notice that the eps value is in radians and that .fit() takes the coordinates in radian units for the haversine metric.
It highly depends on your application:
Around the equator the results should be fairly accurate. Close to one of the poles the results won't be useful at all.
It might, however, work as a pre-pocessing step or for applications with low precision requirements, e.g. small, non-overlapping and very distinct clusters.
If you really need the Haversine formula, you might want to look into this discussion. As Anony-Mousse says:
Note that Haversine distance is not appropriate for k-means or average-linkage clustering, unless you find a smart way of computing the mean that minimizes variance. Do not use the arithmetic average if you have the -180/+180 wrap-around of latitude-longitude coordinates.
Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.
Thank you
scikit-learn's KMeans class has a predict method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.
If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans estimator and its partial_fit method.
You can pass in initial values for the centroids with the init parameter to sklearn.cluster.kmeans. So then you can just do:
centroids, labels, inertia = k_means(data, k)
new_data = np.append(data, extra_pts)
new_centroids, new_labels, new_inertia = k_means(new_data, k, init=centroids)
assuming you're just adding data points and not changing k.
I think this will sometimes mean you get a suboptimal result, but it should usually be faster. You might want to occasionally redo the fit with, say, 10 random seeds and take the best one.
It's also relatively easy to write your own function that finds out which centroid is closest to a point that you are considering. Assuming you have some matrix X that is ready for kmeans:
centroids, labels, inertia = cluster.k_means(X, 5)
def pred(arr):
return np.argmin([np.linalg.norm(arr-b) for b in centroids])
You can confirm that this works via:
[pred(X[i]) == labels[i] for i in range(len(X))]
I am implementing kmeans algorithm from scratch in python and on Spark. Actually, it is my homework. The problem is to implement kmeans with predefined centroids with different initialization methods, one of them is random initialization(c1) and the other is kmeans++(c2). Also, it is required to use different distance metrics, Euclidean distance, and Manhattan distance. The formula for both of them is introduced as follows:
The second formula in each section is for the corresponding cost function which is going to be minimized. I have implemented both of them but I think there is a problem. This is the graph for the cost function per iteration of kmeans using different settings:
The first graph looks fine but the second one seems to have a problem because as far as I'm concerned, the cost of kmeans must decrease after each iteration. So, What is the problem? It's from my code or formula?
And these are my functions for computing distances and cost:
def Euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
def Manhattan_distance(point1, point2):
return np.sum(np.absolute(point1 - point2))
def cost_per_point(point, center, cost_type = 'E'):
if cost_type =='E':
return Euclidean_distance(point, center)**2
else:
return Manhattan_distance(point, center)
And here is my full code on GitHub:
https://github.com/mrasoolmirzaei/My-Data-Science-Projects/blob/master/Implementing%20Kmeans%20With%20Spark.ipynb
K-means does not minimize distances.
It minimizes the sum of squares (which is not a metric).
If you assign points to the nearest cluster by Euclidean distance, it will still minimize the sum of squares, not Euclidean distances. In particular, the sum of euclidean distances may increase.
Minimizing Euclidean distances is the Weber problem. The mean is not optimal. You need a complex geometrical median to minimize Euclidean distances.
If you assign points with Manhattan distance, it is not clear what is being minimized... You have two competing objectives. While I would assume that it will still converge, that may be tricky to prove. because using the mean may increase the sum of Manhattan distances.
I think I posted a counterexample for k-means minimizing Euclidean distance here at SO or stats.SE some time ago. So your code and analysis may even be fine - it is the assignment that is flawed.