KMeans Clustering: adding results to an initial dataset - python

I defined features for the clustering with the help of KMeans:
x = df_1.iloc[:, np.r_[9:12,26:78]]
And run the code to get 6 clusters:
kmeans = KMeans(n_clusters = 6)
kmeans.fit(x)
Now I want in my initial dataset to have a column with number (df_1("new") =...) : 1 for group of data in cluster one, 2 for group of data in cluster two, etc.
how exactly do I do that?
thanks!

You seem to be looking for fit_predict(x) (or fit(x).predict(x)), which returns the cluster for each sample.
fit_predict(X, y=None, sample_weight=None)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
So I suppose this would do:
df['cluster'] = kmeans.fit_predict(x)

Related

Printing Values in the Same Cluster Python

I am working on a 'songs' dataset that has 2973 records and 2973 unique genres. In the end, I want to create a function that takes as input one genre and print other similar genres.
I thought about doing this by applying label or one-hot encoding and then cluster using K-Means. Then the ultimate idea is that the function called 'genre_recommender' searches the input genre within the clusters and print other values within this cluster. I have done the encoding and the clustering but I can't progress even 1% in the function. How can I do it?
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(scaled_data)
# As it's difficult to visualise clusters when the data is high-dimensional - we'll use
# boxplots to help us see how the clusters are grouping the samples
df_bonus["cluster"] = cluster_labels
I clustered using kmeans upon 4 clusters after doing an elbow method. after this point, I am stuck.

Dendrogram, KMeans, centroids and labels

Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!

T-SNE for better data visualization

My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?
It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.

Determining accuracy for k-means clustering

I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?
Heres the code:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)
features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica
Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?
You need to take a look at clustering metrics to evaluate your predicitons, these include
Homegenity Score
V measure
Completenss Score and so on
Now take Completeness Score for example,
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
For example
from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0
Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.
Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)
Read the official documentation here : Homogenity, completeness and V-measure
First of all, you are not classifying, you are clustering the data. Classification is a different process.
The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.
That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.
Reference from this blog https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
You need to got the relation from confusion matrix with Hungarian algorithm.
The code is below:
from scipy.optimize import linear_sum_assignment as linear_assignment
def cluster_acc(y_true, y_pred):
cm = metrics.confusion_matrix(y_true, y_pred)
_make_cost_m = lambda x:-x + np.max(x)
indexes = linear_assignment(_make_cost_m(cm))
indexes = np.concatenate([indexes[0][:,np.newaxis],indexes[1][:,np.newaxis]], axis=-1)
js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
cm2 = cm[:, js]
acc = np.trace(cm2) / np.sum(cm2)
return acc
Or just import library coclust
from coclust.evaluation.external import accuracy
accuracy(labels, predicted_labels)

Labels for cluster centers in Python sklearn

When utilizing the sklearn class sklearn.cluster for K-means clustering, a fitted k-means object has 3 attributes, including a numpy array of cluster centers (centers x features) named cluster_centers_. However, these centers don't have an attached label.
My question is: are the centers (rows) in cluster_centers_ ordered by the label value? That is, does row 1 correspond to the center for the cluster labeled 1? Or are they placed in the array randomly? A pointer to any documentation would be more than sufficient.
Thanks.
I couldn't find the documentation but yes it is ordered by cluster.
So:
kmeans.cluster_centers_[0] = centroid of cluster 0

Categories

Resources