I am using sklearn's k-means clustering to cluster my data. Now I want to have the distance between my clusters, but can't find it. I could calculate the distance between each centroid, but wanted to know if there is a function to get it and if there is a way to get the minimum/maximum/average linkage distance between each cluster. My code is very simple:
km = KMeans(n_clusters = 5, random_state = 1)
km.fit(X_tfidf )
clusterkm = km.cluster_centers_
clusters = km.labels_.tolist()
Thank you!
Unfortunately, you're going to have to compute those distances on the cluster centers yourself. Scikit doesn't provide a method for that right out of the box. Here's a comparable problem setup:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
X, y = load_iris(return_X_y=True)
km = KMeans(n_clusters = 5, random_state = 1).fit(X)
And how you'd compute the distances:
dists = euclidean_distances(km.cluster_centers_)
And then to get the stats you're interested in, you'll only want to compute on the upper (or lower) triangular corner of the distance matrix:
import numpy as np
tri_dists = dists[np.triu_indices(5, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
km.inertia_ is measure of sklearn’s KMeans is the sum of squared distances.
from the sklearn website:
inertia_: float
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Related
Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?
The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_
Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.
I am trying to implement a simple version of spectral clustering using the normalized (random walk) Laplacian matrix in Python. After testing my function with a toy dataset, I found that my Laplacian matrix has negative eigenvalues. Here is my spectral clustering code:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_kernels, euclidean_distances, pairwise_distances
from sklearn.neighbors import NearestNeighbors
def nlapl(W):
Dinv = 1 / np.sum(W, axis=1)
Id = np.eye(W.shape[0])
W = np.multiply(Dinv, W.T).T
return Id - W
def sc(X, n_clusters, gamma):
W = pairwise_kernels(X, metric='rbf', gamma=gamma)
L = nlapl(W)
lambdas, vs = np.linalg.eigh(L)
lambdas = lambdas[:n_clusters]
vs = vs[:,:n_clusters]
print("lambdas:")
print(lambdas)
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=20, random_state=0).fit(vs)
return vs, kmeans
Here is my test code:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
vs, kmeans = sc(X, 4, 2)
The function successfully identifies the clusters:
plt.figure()
plt.scatter(X[:,0], X[:,1], c=y, alpha=0.7)
plt.title('True Clusters')
plt.figure()
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, alpha=0.7)
plt.title('Spectral Clustering')
True Clusters
Spectral Clustering
However, the Laplacian matrix has negative eigenvalues:
lambdas:
[-0.03429643 -0.02670478 -0.01684407 -0.0073953 ]
I'm pretty sure that my problem is in nlapl because if I use the unnormalized laplacian D - W, the eigenvalues are [-4.96328563e-15 5.94245930e-03 1.15181852e-02 1.51614560e-01]. However, I'm having trouble figuring out where my calculation is wrong. Am I missing something obvious? Thank you in advance for any advice.
EDIT: Since my toy dataset has 4 well-separated clusters, the theoretical multiplicity of the zero eigenvalue of L should be 4. However, the apparently multiplicity of zero with the unnormalized Laplacian is 1. Admittedly, some of the purple datapoints (in True Clusters) are pretty close to the other clusters, so maybe this isn't completely unexpected?
I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...
The code below is how people usually find inertia using sklearn k means.
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
you can write your own function to obtain the inertia for Kmeanscluster in nltk.
As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
The previous comment is actually missing a small detail:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.
I'm currently learning about Mahalanobis Distance and I find it quite difficult. To get the idea better I generated 2 sets of random values (x and y) and a random point, where for all 3 mean=0 and standard deviation=1. How can I calculate the Mahalanobis Distance between them? Please find my Python code below
Many thanks for your help!
import numpy as np
from numpy import cov
from scipy.spatial import distance
generate 20 random values where mean = 0 and standard deviation = 1, assign one set to x and one to y
x = [random.normalvariate(0,1) for i in range(20)]
y = [random.normalvariate(0,1) for i in range(20)]
r_point = [random.normalvariate(0,1)] #that's my random point
sigma = cov(x, y)
print(sigma)
print("random point =", r_point)
#use the covariance to calculate the mahalanobis distance from a random point```
Here is an example that shows how to compute the Mahalanobis distance of a point r_point to some data. The Mahalanobis distance takes into account the variance and correlation of the data you are measuring the distance to (using the inverse of its covariance matrix). Here, the Mahalanobis distance and the Euclidean distance should be very close because of the distribution of the data (0-mean and standard-deviation of 1). For other data, they will be different.
import numpy as np
N = 5000
mean = 0.0
stdDev = 1.0
data = np.random.normal(mean, stdDev, (2, N)) # 2D random points
r_point = np.random.randn(2)
cov = np.cov(data)
mahalanobis_dist = np.sqrt(r_point.T # np.linalg.inv(cov) # r_point)
print("Mahalanobis distance = ", mahalanobis_dist)
euclidean_dist = np.sqrt(r_point.T # r_point)
print("Euclidean distance = ", euclidean_dist)
Is there any function or library in python which can help me find the DISTANCE between a point (having 19 features) and its 20th nearest neighbor?
I have tried Euclidean distance but as i have nearly 600 000 records (points) so I am encountering MemoryError. Is there a more efficient and pythonic way of finding the same?
An option is with sklearn.neighbors.KNearestNeighbor.
This prepares a dataset similar to yours (600000 samples with 19 features) and fits a knn model:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
N = 600000
N_ATTR = 19
samples = np.random.normal(size=(N, N_ATTR))
y = np.ones(N,)
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(samples, y)
Here we use a knn with up to 20 neighbors. The distance between queryand the all nearest neighbors can be obtained by calling the kneighborsfunction:
query = np.random.normal(size=(1, N_ATTR))
distances = knn.kneighbors(query)[0]
and for the one corresponding to the 20th neighbor:
distance_to_20th = distances[0,-1]
KNearestNeighboruses the Euclidean distance by default.