what do these commands do in the digits dataset clustering demonstration?

what do these commands do in the digits dataset clustering demonstration? - python

I have been looking at this fitting a digits dataset to a k-means cluster on Python tutorial here, and some of the codes are just confusing me.
I do understand this part where we need to train our model using 10 clusters.
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
The following show us an output of the 10 cluster centroids on the console.
it first creates figure and axes which has two row, each row has 5 axes subplots return the figure and
(8,3) is the size of the figure displaying on the console.
But after that I just do not understand how the command shows the output of cluster centroids in the for loop.
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
Also, this part is to check how accurate the clustering was in finding the similar digits within the data. I know that we need to create a labels that has the same size as the clusters filling with zero so we can place our predicted label in there.
But again, I just do not understand how do they implement it inside the for-loop.
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
Can someone please explain what each line of the commands do? Thank you.

Question 1: How does the code plot the centroids?
It's important to see that each centroid is a point in the feature space. In other words, a centroid looks like one of the training samples. In this case, each training sample is an 8 × 8 image (although they've been flattened into rows with 64 elements (because sklearn always wants input X to be a two-dimensional array). So each centroid also represents an 8 × 8 image.
The loop steps over the axes (a 2 × 5 matrix) and the centroids (kmeans.cluster_centers_ together. The purpose of zip is to ensure that for each Axes object there is a corresponding center (this is a common way to plot a bunch of n things into a bunch of n subplots). The centroids have been reshaped into a 10 × 8 × 8 array, so that each of the 10 centroids is the 8 × 8 image we're expecting.
Since each centroid is now a 2D array, you can use imshow to plot it.
Question 2: How does the code assign labels?
The easiest thing might be to take the code apart and run bits of it on their own. For example, take a look at clusters == 0. This is a Boolean array. You can use Boolean arrays to index other arrays of the same shape. The first line of code in the loop assigns this array to mask so we can use it.
Then we index into labels using the Boolean array (try it!) to say, "Change these values to the mode average of the corresponding elements of the label vector, i.e. digits.target." The index [0] is just needed because of what the scipy.stats.mode() function returns (again, try it out).

Related

K-Means algorithm Centroids are not placed in the clusters

I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156

The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.

You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.

Numpy histogram data: Why is the length of bins vector longer than the histogram values vector?

There are two outputs to numpy.histogram:
hist: values of the histogram
bin_edges: Return the bin edges (length(hist)+1)
both are vectors but in the example below, the second vector is of length 101, which is 1 higher than the first vector, which is length 100 :
import numpy as np
from numpy.random import rand, randn
n = 100 # number of bins
X = randn(n)*.1
a,bins1 = np.histogram(X,bins=n)
The following shape error occurs if I then try plt.plot(bins1,a):
ValueError: x and y must have same first dimension, but have shapes (101,) and (100,)
Why, and how do I fix the inequal shape error so I can plot the histogram?

The unequal shapes occur because bin_edges, as the name implies, specifies the bin edges. Since a bin has left and right edge, bin_edges will have be of length len(bins)+1.
As already denoted in the comments, an appropriate way to plot is plt.hist

I had this question as well because I wanted to transform my data before doing a histogram but display the results un-transformed (eg. just keep the autogenerated bin edges). The other answers here got you most of the way but what I found was useful was to do something like this:
h, bin_edges = np.histogram(np.log(X), bins=100)
plt.hist(X, bins=np.exp(bin_edges))
Of course, you could do this manually by just choosing your bin edges originally and passing them in to plt.hist without using np.histogram. But this was nice as the automated calculations simplified some things for me.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers

I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.

I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Scatter plot for a matrix of a given form

Suppose I have a matrix of the form where first column is all x points, the second column is all y points, and then the third and fourth are indicator variables telling whether the point belongs to a particular 'cluster' (can be either 1 or 0; so if in column 3 I have 1 for a third row, it means that the point of the third row, belongs to say cluster 1, which is represented by column 3).
My question is, how do I create a figure, scatter plot all the points belonging to cluster 1 and then on the same plot have scatter of the remaining points in another color. In Matlab, I would just say figure, then hold on and write out my commands. I am new to plotting in Python and not sure how this would be performed.
EDIT:
I think I made it work. How would I however, change marker size, depending on which cluster the point belongs to

Let's start with how we'd do this in MATLAB.
Supposing you have N unique clusters, you can simply loop through as many clusters as you have and plot the points in a different colour. Also, we can change the marker size at each iteration. You'll need to use logical indexing to extract out the points that belong to each cluster. Given that your matrix is stored in M, something like this comes to mind:
rng(123); %// Set random seeds
%// Total number of clusters
N = max(M(:,3));
%// Create a colour map
cmap = rand(N,3);
%// Store point sizes per cluster
sizes = [10 14 18];
figure; hold on; %// Create a blank figure and hold for changes
for ii = 1 : N
%// Determine those points belonging to the ith cluster
ind = M(:,3) == ii;
%// Get the x and y coordinates
x = M(ind,1);
y = M(ind,2);
%// Plot the points in a different colour
plot(x,y,'.','Color', cmap(ii,:), 'MarkerSize', sizes(ii));
end
%// Create labels
labels = sprintfc('Label %d', 1:N);
%// Make our legend
legend(labels{:});
The code is pretty self explanatory, you need to define your matrix M and we determine the total number of clusters by taking the max of the third column. Next we create a random colour map which has as many rows as there are clusters and there are three columns corresponding to a unique RGB colour per cluster. Each row defines a colour for each cluster which we'll use when plotting.
Next we create an array of sizes where we store the radius of each point stored in an array per cluster. We create a blank figure, hold it for changes we make to the plot then we iterate over each cluster of points. For each cluster of points, figure out the right points in M to extract out through logical indexing, extract out the x and y coordinates for those points then plot these points on your figure in a scatter formation where we manually specify the colour as a RGB tuple as well as the desired marker size.
We then create a cell array of labels that denote which set of points each cluster belongs to, then show a legend illustrating which points belong to which clusters given this array of labels.
Generating random data with random labels, where we have 20 points uniformly distributed between [0,1] for both x and y and generating a random set of up to three labels:
rng(123);
M = [rand(20,2) randi(3,20,1)];
I get this plot when I run the above code:
To get the equivalent in Python, well that's pretty easy. It's just a transcription from MATLAB to Python and the plotting mechanisms are exactly the same. You're using matplotlib and so I'm assuming numpy can be used as it's a dependency.
As such, the equivalent code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Total number of clusters
N = int(np.max(M[:,2]))
# Create a colour map
cmap = np.random.rand(N, 3)
# Store point sizes per cluster
sizes = np.array([10, 14, 18]);
plt.figure(); # Create blank figure. No need to hold on
for ii in range(N):
# Determine those points belonging to the ith cluster
ind = M[:,2] == (ii+1)
# Get the x and y coordinates
x = M[ind,0];
y = M[ind,1];
# Plot the points in a different colour
# Also add in labels for legend
plt.plot(x,y,'.',color=tuple(cmap[ii]), markersize=sizes[ii], label='Cluster #' + str(ii+1))
# Make our legend
plt.legend()
# Show the image
plt.show()
I won't bother explaining this one because it's pretty much the same as what you see in the MATLAB code. There are some nuances, such as the way hold on works in matplotlib. You don't need to use hold on because any changes you make the figure will be remembered until you decide to show the figure. You also have the nuances where numpy and Python start indexing at 0 instead of 1.
Using the same generation data code like in MATLAB:
M = np.column_stack([np.random.rand(20,2), np.random.randint(1,4,size=(20,1))])
I get this figure:

How to plot the output of scipy's k-means clustering with large number of clusters

My data is 2250 x 100. I would like to plot the output, like http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html. However, the problem is that all the examples use only a small number of clusters, usually 2 or 3. How would you plot the output of kmeans in scipy if you wanted more clusters, like a 100.
Here's what I got:
#get the centroids
centroids,_ = kmeans(data,100)
idx,_ = vq(data,centroids)
#do some plotting here...

Maybe with 10 colors and 10 point types?
Or you could plot each in a 10 x 10 grid of plots. The first would show the relationships better. The second would allow easier inspection of an arbitrary cluster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.