What is the most efficient way to cluster clusters? - python

I have used kmeans to cluster my data. Now I want to cluster the clusters so that the clustered clusters consist of the individual clusters from the first round of clustering.
Minimal reproducible example:
# create dataframe
n1 = 336
n2 = 200
x_list = np.array(range(0, n1))
y_list = np.array(range(0, n2))
x_list = np.repeat([x_list], n2, axis=0).flatten() # width
y_list = np.repeat(y_list, n1, axis=0).flatten() # height
# normalize x, y to avoid skewing the clustering
norm_x = np.linalg.norm(x_list)
norm_y = np.linalg.norm(y_list)
normal_array_x = np.round(x_list / norm_x, 6)
normal_array_y = np.round(y_list / norm_y, 6)
data = {'x_position_norm': normal_array_x,
'y_position_norm': normal_array_y}
features = pd.DataFrame(data).to_numpy()
kmeans = KMeans(init='k-means++', n_clusters=16800, n_init=3, max_iter=3, random_state=1)
kmeans.fit(features)
kmeans2 = KMeans(init='k-means++', n_clusters=4200, n_init=3, max_iter=3, random_state=1)
kmeans2.fit(kmeans.cluster_centers_)
At the moment, I am clustering the cluster centers. Is there a better/ more efficient way to cluster that guarantees clusters in the second round of clustering are consisting of clusters from the first round of clusters?

Each clustering method has its own characteristics. It only depends on what your data looks like.
For Bottom-up clustering, spectral clustering will work well.
Plot a cluster map
Here is a list of clustering methods.
Overview of clustering methods

Related

T-SNE for better data visualization

My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?
It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.

Highlighting particular data points in my t-SNE clustered scatterplots

Data & Objective
I have a set of 50-dimension vector embedding DataFrame df with 50000 rows:
id v1 v2 v3 ... v50
a0 0.231 0.370 0.071 -0.578
...
a49999 0.510 -0.111 0.235 -0.004
And within my df, I have selected 3 ids: [a15000, a30000, a45000]. I plan to do a t-SNE analysis to cluster my embedded vectors and analyze the data points that are the closest to my target ids.
My Work
First, I decided on the optimal number of clusters by running a silhouette-score package:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
best_n, best_silscore = -1, -1
# exploring from 2 clusters to 30
for k in range(2, 31):
kmeans = KMeans(n_clusters = k, random_state=200)
kmeans.fit(df.iloc[:, 1:])
clusters = kmeans.predict(df.iloc[:, 1:])
score = silhouette_score(df.iloc[:, 1:], clusters)
if score > best_silscore:
best_n = k
best_silscore = score
print('best k:', best_n, '\t best score:', best_silscore)
where I get the optimal number of clusters of 15 with its silhouette score of roughly 0.097.
After retrieving the number of optimal clusters, I have performed the t-SNE analysis:
from sklearn.manifold import TSNE
# clustering
k = 15
kmeans = KMeans(n_clusters=k, random_state=200)
y_pred = kmeans.fit_predict(df[:,1:])
# tsne
tsne = TSNE(n_components=2, verbose=1, perplexity=50, random_state=200)
X_embedded = tsne.fit_transform(df[:,1:])
# visualization
sns.set(rc={'figure.figsize':(20,20)})
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred,
legend='full', palette=sns.hls_palette(k, l=.4, s=.9))
plt.title('t-SNE with KMeans Labels for Embedded Vectors')
plt.show()
(with my dataset being pretty large, I put the perplexity level to the highest recommended level.)
>>> [t-SNE] Mean sigma: 0.195604
>>> [t-SNE] KL divergence after 250 iterations with early exaggeration: 94.565384
>>> [t-SNE] KL divergence after 1000 iterations: 2.455614
Problem
Now, I'm aware that it's easier to plot a separate scatterplot to highlight the three pre-selected points from earlier. Currently, they are somewhere in the scatterplot.
However, if I do a separate t-SNE analysis on just those three points, they (understandably) are way off any of the clusters in the grid. They look something like this, with the target ids marked as black dots.
Instead of this bizarre-looking scatterplot, I want something almost identical to the first one, but with the corresponding black, highlighted data points in the clusters.

KMeans Clustering: adding results to an initial dataset

I defined features for the clustering with the help of KMeans:
x = df_1.iloc[:, np.r_[9:12,26:78]]
And run the code to get 6 clusters:
kmeans = KMeans(n_clusters = 6)
kmeans.fit(x)
Now I want in my initial dataset to have a column with number (df_1("new") =...) : 1 for group of data in cluster one, 2 for group of data in cluster two, etc.
how exactly do I do that?
thanks!
You seem to be looking for fit_predict(x) (or fit(x).predict(x)), which returns the cluster for each sample.
fit_predict(X, y=None, sample_weight=None)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
So I suppose this would do:
df['cluster'] = kmeans.fit_predict(x)

Clustering geospatial data on coordinates AND non spatial feature

Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?
The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_
Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.

Get inertia for nltk k means clustering using cosine_similarity

I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can't seem to find in their documentation or online...
The code below is how people usually find inertia using sklearn k means.
inertia = []
for n_clusters in range(2, 26, 1):
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(features)
centers = clusterer.cluster_centers_
inertia.append(clusterer.inertia_)
plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
you can write your own function to obtain the inertia for Kmeanscluster in nltk.
As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[i])**2)) #here implementing inertia as given in the docs of scikit i.e sum of squared distance..
return sum(sum_)
nltk_inertia(feature_matrix, centroid)
#op 27.495250000000002
#now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2
scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors) # vectors = [np.array(f) for f in df.values] which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
The previous comment is actually missing a small detail:
feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()
def nltk_inertia(feature_matrix, centroid):
sum_ = []
for i in range(feature_matrix.shape[0]):
sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))
return sum(sum_)
You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.

Categories

Resources