How to use KMeans for distance clustering - python

I have a dataframe with X and Y axis values
They don't have any labels
They look as shown below
X-COORDINATE
Y-COORDINATE
12
34
99
42
90
27
49
64
Is it possible to use KMeans for clustering the data?
How do I get the labels and plot the data on a graph for each cluster?

Yes, you can use k-means even if you don't have labels because k-means is an unsupervised method, but...
First of all you need to scale your data, because k-means is a distance algorithm and using distances between data points to determine their similarity. More about that here.
I found this tutorial for clustering very useful, you could start with that. It also describes how to plot your data first with silhouette or elbow plot to define perfect number of clusters.
It should look somewhat like that:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=n_clusters) # you can get n_clusters from silhouette/elbow plot or just try out different numbers
kmeans_model.fit(your_dataframe)
labels = kmeans_model.predict(your_dataframe)
print(labels)
K-Means is not always performing perfect, if you want to get better results, you could also try out other algorithms like DBSCAN, HDBSCAN, Agglomerative clustering.... It always depends on your data which one you should choose.

Related

Dendrogram, KMeans, centroids and labels

Excuse me if the questions are too simple but I am getting into machine learning having time constraints.
I must apply a mixed classification to a df with the following steps:
Apply the KMeans with 50 clusters
From the barycenters and labels obtained for each cluster, a dendrogram must be displayed, in order to choose the right k.
Then apply an HCA algorithm from the barycenters obtained in step 1 with the number of clusters from step 2.
Calculate the barycenters of each new group
Use the calculated barycenters to consolidate the clusters by the KMeans algorithm.
What I do is:
clf = KMeans(n_clusters=50)
centroids = clf.cluster_centers_
labels = clf.labels_
From there I get confused with the dendrogram. So far I have used it only over the whole df and I am not certain how to involve the barycenters and labels from the KMeans correctly.
Z = linkage(df, method='ward', metric='euclidean')
dendrogram(Z, labels=df.index, leaf_rotation=90., color_threshold=0)
plt.show()
Last but not least, I do not know how to get the barycenters in the AgglomerativeClustering.
Any clarification would be of help. Thanks in advance!

Logistic Regression vs predicting probability by splitting data into bin

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.
The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

How to evaluate K-Means Clustering since automatic indexes of clusters don't match true labels?

How do we measure the accuracy of a K-Means clustering algorithm (say, generate a confusion matrix) since the automatic indexes of cluster is probably a permutation of the original labels?
I don't exactly know what you mean too. Your original labels perhaps is the ground truth labeling. The clustering results provided by k-means is usually an integer with range given as many as the k clusters you wish the k-means algorithm to give you.
I typically use pandas.crosstab function to visualize the localizations of the groundtruth labeling with kmeans labeling with cross-tabulation.
For better visualization, you may want to use the following:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(30,10))
# plot the heatmap for correlation matrix
ax = sns.heatmap(crosstab_groundtruth_kmeans.T,
square=True, annot=True, fmt='.2f')
ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);
out:
Good luck!~
k-means is a clustering (grouping algorithm, not used for classification), hence, it is not feasible to check and analyze accuracy. Major concept of k-means is to find a cluster of data-points which maximize the "between-cluster" distance (and does not have the concept of labels, and hence, you can't get accuracy matrix). More insights: https://scikit-learn.org/stable/modules/clustering.html#k-means
The accuracy (assuming, you want to visualize which cluster consists of which data points) has to be analyzed manually using the predict method from sklearn.cluster.KMeans. It basically "Predicts the closest cluster each sample in X belongs to." (from documentation)

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

Categories

Resources