How to use DBSCAN method from sklearn for clustering

How to use DBSCAN method from sklearn for clustering - python

I have a three parameters database for clustering. For example, I can get image result easily from Kmean by sklearn, like that: (val is my database, its shape like (3000,3))
y_pred = KMeans(n_clusters= 4 , random_state=0).fit_predict(val)
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1,projection='3d')
ax1.scatter(val[:, 0], val[:, 1], val[:, 2], c=y_pred)
plt.show()
However, in DBSCAN, I just directly use this one:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
val = StandardScaler().fit_transform(val)
db = DBSCAN(eps=3, min_samples=4).fit(val)
labels = db.labels_
core_samples = np.zeros_like(labels, dtype=bool)
core_samples[db.core_sample_indices_] =True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
So how to get the image result of DBSCAN, just like Kmean?

You can reuse the same code from your KMeans model. All you need to do it re-assign val and y_pred to ignore the noise labels.
# DBSCAN snippet from the question
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
val = StandardScaler().fit_transform(val)
db = DBSCAN(eps=3, min_samples=4).fit(val)
labels = db.labels_
# re-assign y_pred and core (as val)
y_pred, core = labels[labels != -1], val[labels != -1]
# plotting snippet from the question
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1,projection='3d')
ax1.scatter(core[:, 0], core[:, 1], core[:, 2], c=y_pred)
plt.show()

Related

Clustering between two sets of data points - Python

I'm hoping to use k-means clustering to plot and return the position of each cluster's centroid. The following groups two sets of xy scatter points into 6 clusters.
Using the df below, the coordinates in A and B and C and D are plotted as a scatter. I'm hoping to plot and return the centroid of each cluster.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
df = pd.DataFrame(np.random.randint(-50,50,size=(100, 4)), columns=list('ABCD'))
fig, ax = plt.subplots()
Y_sklearn = df[['A','B','C','D']].values
model = KMeans(n_clusters = 4)
model.fit(Y_sklearn)
plt.scatter(Y_sklearn[:,0],Y_sklearn[:,1], c = model.labels_);
plt.scatter(Y_sklearn[:,2],Y_sklearn[:,3], c = model.labels_);
plt.show()

Solution
When you make a plot from KMeans prediction, if the number of features are more than two, you can only select two of the features (in your case, say, columns A and B) as the x and y coordinates on the 2D plane of the scatterplot. A better way to properly represent your higher-dimensional data on a 2D-plane would be some form of dimension-reduction: such as PCA. However, to keep the scope of this answer manageable I am only resorting to using the first two columns of the data X_train or X_test below and NOT using PCA to get the most important two dimensions.
I tried writing this answer so that anyone could start from zero experience and still follow along the code and run it to see what it does. Yes, it is long, and hence I have broken it down into multiple sections, so you could skip them if needed.
⭐For your convenience you could get the entire code in this colab notebook:
🔥
⭐⭐⭐ Jump to Section G to see the code used to make the plots.
👉 👉 👉 Section A gives a summary and is useful if you are just interested in the code to add the cluster-centers to your scatterplot.
List of Sections
A. Identification of Clusters in Data using KMeans Method
B. Import Libraries
C. Dummy Data
D. Custom Functions
E. Calculate True Cluster Centers
F. Define, Fit and Predict using KMeans Model
F.1. Predict for y_train using X_train
F.2. Predict for y_test using X_test
G. Make Figure with train, test and prediction data
References
A. Identification of Clusters in Data using KMeans Method
We will use sklearn.cluster.KMeans to identify the clusters. The attribute model.cluster_centers_ will give us the predicted cluster centers. Say, we want to find out 5 clusters in our training data, X_train with shape: (n_samples, n_features) and labels, y_train with shape: (n_samples,). The following code block fits the model to the data (X_train) and then predicts y and saves the prediction in y_pred_train variable.
# Define model
model = KMeans(n_clusters = 5)
# Fit model to training data
model.fit(X_train)
# Make prediction on training data
y_pred_train = model.predict(X_train)
# Get predicted cluster centers
model.cluster_centers_ # shape: (n_cluster, n_features)
## Displaying cluster centers on a plot
# if you just want to add cluster centers
# to your existing scatter-plot,
# just do this --->>
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
marker='s', color='orange', s = 100,
alpha=0.5, label='pred')
This is the result ⭐⭐⭐ Jump to section G to see the code used to make the plots.
B. Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import pprint
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
C. Dummy Data
We will use data generated in the following code-block. By design we create a dataset with 5 clusters and the following specifications. And then split the data into train and test blocks using sklearn.model_selection.train_test_split.
## Creating data with
# n_samples = 2500
# n_features = 4
# Expected clusters = 5
# centers = 5
# cluster_std = [1.0, 2.5, 0.5, 1.5, 2.0]
NUM_SAMPLES = 2500
RANDOM_STATE = 42
NUM_FEATURES = 4
NUM_CLUSTERS = 5
CLUSTER_STD = [1.0, 2.5, 0.5, 1.5, 2.0]
TEST_SIZE = 0.20
def dummy_data():
## Creating data with
# n_samples = 2500
# n_features = 4
# Expected clusters = 5
# centers = 5
# cluster_std = [1.0, 2.5, 0.5, 1.5, 2.0]
X, y = make_blobs(
n_samples = NUM_SAMPLES,
random_state = RANDOM_STATE,
n_features = NUM_FEATURES,
centers = NUM_CLUSTERS,
cluster_std = CLUSTER_STD
)
return X, y
def test_dummy_data(X, y):
assert X.shape == (NUM_SAMPLES, NUM_FEATURES), "Shape mismatch for X"
assert set(y) == set(np.arange(NUM_CLUSTERS)), "NUM_CLUSTER mismatch for y"
## D. Create Dummy Data
X, y = dummy_data()
test_dummy_data(X, y)
## Create train-test-split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
D. Custom Functions
We will use the following 3 custom defined functions:
get_cluster_centers()
scatterplot()
add_cluster_centers()
def get_cluster_centers(X, y, num_clusters=None):
"""Returns the cluster-centers as numpy.array of
shape: (num_cluster, num_features).
"""
num_clusters = NUM_CLUSTERS if (num_clusters is None) else num_clusters
return np.stack([X[y==i].mean(axis=0) for i in range(NUM_CLUSTERS)])
def scatterplot(X, y,
cluster_centers=None,
alpha=0.5,
cmap='viridis',
legend_title="Classes",
legend_loc="upper left",
ax=None):
if ax is not None:
plt.sca(ax)
scatter = plt.scatter(X[:, 0], X[:, 1],
s=None, c=y, alpha=alpha, cmap=cmap)
legend = ax.legend(*scatter.legend_elements(),
loc=legend_loc, title=legend_title)
ax.add_artist(legend)
if cluster_centers is not None:
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
marker='o', color='red', alpha=1.0)
ax = plt.gca()
return ax
def add_cluster_centers(true_cluster_centers=None,
pred_cluster_centers=None,
markers=('o', 's'),
colors=('red, ''orange'),
s = (None, 200),
alphas = (1.0, 0.5),
center_labels = ('true', 'pred'),
legend_title = "Cluster Centers",
legend_loc = "upper right",
ax = None):
if ax is not None:
plt.sca(ax)
for idx, cluster_centers in enumerate([true_cluster_centers,
pred_cluster_centers]):
if cluster_centers is not None:
scatter = plt.scatter(
cluster_centers[:, 0], cluster_centers[:, 1],
marker = markers[idx],
color = colors[idx],
s = s[idx],
alpha = alphas[idx],
label = center_labels[idx]
)
legend = ax.legend(loc=legend_loc, title=legend_title)
ax.add_artist(legend)
return ax
E. Calculate True Cluster Centers
We will calculate the true cluster centers for train and test datasets and save the results to a dict: true_cluster_centers.
true_cluster_centers = {
'train': get_cluster_centers(X = X_train, y = y_train, num_clusters = NUM_CLUSTERS),
'test': get_cluster_centers(X = X_test, y = y_test, num_clusters = NUM_CLUSTERS)
}
# Show result
pprint.pprint(true_cluster_centers, indent=2)
Output:
{ 'test': array([[-2.44425795, 9.06004013, 4.7765817 , 2.02559904],
[-6.68967507, -7.09292101, -8.90860337, 7.16545582],
[ 1.99527271, 4.11374524, -9.62610383, 9.32625443],
[ 6.46362854, -5.90122349, -6.2972843 , -6.04963714],
[-4.07799392, 0.61599582, -1.82653858, -4.34758032]]),
'train': array([[-2.49685525, 9.08826 , 4.64928719, 2.01326914],
[-6.82913109, -6.86790673, -8.99780554, 7.39449295],
[ 2.04443863, 4.12623661, -9.64146529, 9.39444917],
[ 6.74707792, -5.83405806, -6.3480674 , -6.37184345],
[-3.98420601, 0.45335025, -1.23919526, -3.98642807]])}
F. Define, Fit and Predict using KMeans Model
model = KMeans(n_clusters = NUM_CLUSTERS, random_state = RANDOM_STATE)
model.fit(X_train)
## Output
# KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
# n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
# random_state=42, tol=0.0001, verbose=0)
F.1. Predict for y_train using X_train
## Process Prediction: train data
y_pred_train = model.predict(X_train)
# get model predicted cluster-centers
pred_train_cluster_centers = model.cluster_centers_ # shape: (n_cluster, n_features)
# sanity check
assert all([
y_pred_train.shape == (NUM_SAMPLES * (1 - TEST_SIZE),),
set(y_pred_train) == set(y_train)
])
F.2. Predict for y_test using X_test
## Process Prediction: test data
y_pred_test = model.predict(X_test)
# get model predicted cluster-centers
pred_test_cluster_centers = model.cluster_centers_ # shape: (n_cluster, n_features)
# sanity check
assert all([
y_pred_test.shape == (NUM_SAMPLES * TEST_SIZE,),
set(y_pred_test) == set(y_test)
])
G. Make Figure with train, test and prediction data
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
FONTSIZE = {'title': 16, 'suptitle': 20}
TITLE = {
'train': 'Train Data Clusters',
'test': 'Test Data Clusters',
'suptitle': 'Cluster Identification using KMeans Method',
}
CENTER_LEGEND_LABELS = ('true', 'pred')
LAGEND_PARAMS = {
'data': {'title': "Classes", 'loc': "upper left"},
'cluster_centers': {'title': "Cluster Centers", 'loc': "upper right"}
}
SCATTER_ALPHA = 0.4
CMAP = 'viridis'
CLUSTER_CENTER_PLOT_PARAMS = dict(
markers = ('o', 's'),
colors = ('red', 'orange'),
s = (None, 200),
alphas = (1.0, 0.5),
center_labels = CENTER_LEGEND_LABELS,
legend_title = LAGEND_PARAMS['cluster_centers']['title'],
legend_loc = LAGEND_PARAMS['cluster_centers']['loc']
)
SCATTER_PLOT_PARAMS = dict(
alpha = SCATTER_ALPHA,
cmap = CMAP,
legend_title = LAGEND_PARAMS['data']['title'],
legend_loc = LAGEND_PARAMS['data']['loc'],
)
## plot train data
data_label = 'train'
ax = axs[0]
plt.sca(ax)
ax = scatterplot(X = X_train, y = y_train,
cluster_centers = None,
ax = ax, **SCATTER_PLOT_PARAMS)
ax = add_cluster_centers(
true_cluster_centers = true_cluster_centers[data_label],
pred_cluster_centers = pred_train_cluster_centers,
ax = ax, **CLUSTER_CENTER_PLOT_PARAMS)
plt.title(TITLE[data_label], fontsize = FONTSIZE['title'])
## plot test data
data_label = 'test'
ax = axs[1]
plt.sca(ax)
ax = scatterplot(X = X_test, y = y_test,
cluster_centers = None,
ax = ax, **SCATTER_PLOT_PARAMS)
ax = add_cluster_centers(
true_cluster_centers = true_cluster_centers[data_label],
pred_cluster_centers = pred_test_cluster_centers,
ax = ax, **CLUSTER_CENTER_PLOT_PARAMS)
plt.title(TITLE[data_label], fontsize = FONTSIZE['title'])
plt.suptitle(TITLE['suptitle'],
fontsize = FONTSIZE['suptitle'])
plt.show()
# save figure
fig.savefig("kmeans_fit_result.png", dpi=300)
Result:
References
Documentation: sklearn.cluster.KMeans
Documnetation: sklearn.model_selection.train_test_split
Documentation: matplotlib.pyplot.legend
Documentation: sklearn.decomposition.PCA
Managing legend in scatterplot using matplotlib
Demo of KMeans Assumptions

Based on how you make the scatter plot, I guess A and B correspond to the xy coordinates of the first set of points, while C and D correspond to the xy coordinates of the second set of points. If so, you cannot apply Kmeans to the dataframe directly, since there are only two features, i.e., x and y coordinates. Finding the centroids is actually quite simple, all you need is model_zero.cluster_centers_.
Let's first construct a dataframe that will be better for visualization
import numpy as np
# set the seed for reproducible datasets
np.random.seed(365)
# cov matrix of a 2d gaussian
stds = np.eye(2)
# four cluster means
means_zero = np.random.randint(10,20,(4,2))
sizes_zero = np.array([20,30,15,35])
# four cluster means
means_one = np.random.randint(0,10,(4,2))
sizes_one = np.array([20,20,25,35])
points_zero = np.vstack([np.random.multivariate_normal(mean,stds,size=(size)) for mean,size in zip(means_zero,sizes_zero)])
points_one = np.vstack([np.random.multivariate_normal(mean,stds,size=(size)) for mean,size in zip(means_one,sizes_one)])
all_points = np.hstack((points_zero,points_one))
As you can see, the four clusters are constructed by sampling points from four Gaussians with different means. With this dataframe, here is how you can plot it
import matplotlib.patheffects as PathEffects
from sklearn.cluster import KMeans
df = pd.DataFrame(all_points, columns=list('ABCD'))
fig, ax = plt.subplots(figsize=(10,8))
scatter_zero = df[['A','B']].values
scatter_one = df[['C','D']].values
model_zero = KMeans(n_clusters=4)
model_zero.fit(scatter_zero)
model_one = KMeans(n_clusters=4)
model_one.fit(scatter_one)
plt.scatter(scatter_zero[:,0],scatter_zero[:,1],c=model_zero.labels_,cmap='bwr');
plt.scatter(scatter_one[:,0],scatter_one[:,1],c=model_one.labels_,cmap='bwr');
# plot the cluster centers
txts = []
for ind,pos in enumerate(model_zero.cluster_centers_):
txt = ax.text(pos[0],pos[1],
'cluster %i \n (%.1f,%.1f)' % (ind,pos[0],pos[1]),
fontsize=12,zorder=100)
txt.set_path_effects([PathEffects.Stroke(linewidth=5, foreground="aquamarine"),PathEffects.Normal()])
txts.append(txt)
for ind,pos in enumerate(model_one.cluster_centers_):
txt = ax.text(pos[0],pos[1],
'cluster %i \n (%.1f,%.1f)' % (ind,pos[0],pos[1]),
fontsize=12,zorder=100)
txt.set_path_effects([PathEffects.Stroke(linewidth=5, foreground="lime"),PathEffects.Normal()])
txts.append(txt)
zero_mean = np.mean(model_zero.cluster_centers_,axis=0)
one_mean = np.mean(model_one.cluster_centers_,axis=0)
txt = ax.text(zero_mean[0],zero_mean[1],
'point set zero',
fontsize=15)
txt.set_path_effects([PathEffects.Stroke(linewidth=5, foreground="violet"),PathEffects.Normal()])
txts.append(txt)
txt = ax.text(one_mean[0],one_mean[1],
'point set one',
fontsize=15)
txt.set_path_effects([PathEffects.Stroke(linewidth=5, foreground="violet"),PathEffects.Normal()])
txts.append(txt)
plt.show()
Running this code, you will get

3D plotting of a dataset that uses K-means

X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(x,y,z, marker="s"[kmeans.labels_], s=40, cmap="RdBu")
I am trying to Plot the clusters in 3D by colouring all labels belonging to their class, and plot the centroids using a separate symbol. I managed to get the KMeans technique working, atleast I believe I did. But I'm stuck trying to plot it in 3D. I believe there can be a simple solution I'm just not seeing it. Does anyone have any idea what I need to change in my solution to achieve this?

import matplotlib.pyplot as plt
from sklearn.datasets import make_swiss_roll
from mpl_toolkits.mplot3d import Axes3D
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
# print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],c="black",s=150,label="Centers",alpha=1)
ax.scatter(X[x,0],X[x,1],X[x,2],c="blue",s=40,label="C1")
ax.scatter(X[y,0],X[y,1],X[y,2],c="yellow",s=40,label="C2")
ax.scatter(X[z,0],X[z,1],X[z,2],c="red",s=40,label="C3")

Try with this, now the clusters are black X:
from sklearn.datasets import make_swiss_roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(X[x][:, 0], X[x][:, 1], X[x][:, 2], color='red')
ax.scatter(X[y][:, 0], X[y][:, 1], X[y][:, 2], color='blue')
ax.scatter(X[z][:, 0], X[z][:, 1], X[z][:, 2], color='yellow')
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2],
marker='x', s=169, linewidths=10,
color='black', zorder=50)

Visualizing Kmeans cluster after application of TSNE

k_model = KMeans(n_clusters = 3).fit(actor_w2vec)
cluster_dict = {i: np.where(k_model.labels_ == i)[0] for i in range(k_model.n_clusters)}
I have applied KMeans on word2vec vector (3411x128). cluster_dict contains the cluster label(i.e. 0,1,2) as key and index number(1,2,3,4,....3411) as value such that these values are distributed among three clusters.
Now i want to visualize these cluster so i used TSNE to reduce the 128 dimension vector to 2 dimension
node_embeddings = actor_w2vec
transform = TSNE #PCA
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(node_embeddings)
but i don't know how combine these two in order to create a graph or scatter plot where all the point belonging to one cluster are combined together

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
#remember to scale your data if the ranges are too broad
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)
kmeans_model = KMeans(n_clusters=3, max_iter=500, random_state=42)
y_km = kmeans_model.fit_predict(scaled_features)
pca_model = PCA(n_components=2, random_state=42)
transformed = pca_model.fit_transform(scaled_features)
centers = pca_model.transform(kmeans_model.cluster_centers_)
fig = px.scatter(x=transformed[:, 0], y=transformed[:, 1], color=y_km)
fig.add_scatter(
x=centers[:, 0],
y=centers[:, 1],
marker=dict(size=20, color="LightSeaGreen"), name="Centers"
)
fig.show()
If you only do kmeans.fit(df), you could get the labels from kmeans.labels_

import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
plt.rcParams['figure.dpi'] = 150
# create dataset
X, y = make_blobs(
n_samples=150, n_features=2,
centers=3, cluster_std=0.5,
shuffle=True, random_state=0
)
# plot
plt.scatter(
X[:, 0], X[:, 1],
edgecolor='black', s=50
)
plt.show()
km = KMeans(
n_clusters=3, init='random',
n_init=10, max_iter=10000,
tol=1e-04, random_state=0
)
y_km = km.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=y_km, s=50, cmap=plt.cm.Paired, alpha=0.4)
plt.scatter(km.cluster_centers_[:, 0],km.cluster_centers_[:, 1],
s=250, marker='*', label='centroids',
edgecolor='black',
c=np.arange(0,3),cmap=plt.cm.Paired,)
The line import seaborn as sns; sns.set() is not necessary, it just makes a nicer style.
For plotting you can use matplotlib.pyplot . Furthermore you can look on the shape of your data with node_embeddings_2d.shape, so you can make sure that plt.scatter takes the right arguments.
Good luck! ;)

Finding Accuracy for this K-Means model

This program predicts the cluster to which the coordinates belong to, where it divides the given points into two clusters 0 and 1.
How do I get the accuracy of this model for the variable - prediction
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#from sklearn.metrics import accuracy_score
X = np.array([[1, 2],[5, 8],[1.5, 1.8],[8, 8],[6,7],[9, 11]])
print(X)
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print("Centroids :\n ",centroids)
print("Labels : ",labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths
= 5, zorder = 10)
plt.show()
prediction=kmeans.predict ( [ [ 5,6 ] ] )
print(prediction)

If you know the correct values for the coordinates' labels, you can use scikit-learn's accuracy_score:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_true, y_pred))
This does seem tricky for a clustering problem though. Think about how you would determine whether a prediction is correct or not and calculate the accuracy around that.

Plot confusion matrix sklearn with multiple labels

I am plotting a confusion matrix for a multiple labelled data, where labels look like:
label1: 1, 0, 0, 0
label2: 0, 1, 0, 0
label3: 0, 0, 1, 0
label4: 0, 0, 0, 1
I am able to classify successfully using the below code. I only need some help to plot confusion matrix.
for i in range(4):
y_train= y[:,i]
print('Train subject %d, class %s' % (subject, cols[i]))
lr.fit(X_train[::sample,:],y_train[::sample])
pred[:,i] = lr.predict_proba(X_test)[:,1]
I used the following code to print confusion matrix, but it always return a 2X2 matrix
prediction = lr.predict(X_train)
print(confusion_matrix(y_train, prediction))

I found a function that can plot the confusion matrix which generated from sklearn.
import numpy as np
def plot_confusion_matrix(cm,
target_names,
title='Confusion matrix',
cmap=None,
normalize=True):
"""
given a sklearn confusion matrix (cm), make a nice plot
Arguments
---------
cm: confusion matrix from sklearn.metrics.confusion_matrix
target_names: given classification classes such as [0, 1, 2]
the class names, for example: ['high', 'medium', 'low']
title: the text to display at the top of the matrix
cmap: the gradient of the values displayed from matplotlib.pyplot.cm
see http://matplotlib.org/examples/color/colormaps_reference.html
plt.get_cmap('jet') or plt.cm.Blues
normalize: If False, plot the raw numbers
If True, plot the proportions
Usage
-----
plot_confusion_matrix(cm = cm, # confusion matrix created by
# sklearn.metrics.confusion_matrix
normalize = True, # show proportions
target_names = y_labels_vals, # list of names of the classes
title = best_estimator_name) # title of graph
Citiation
---------
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
"""
import matplotlib.pyplot as plt
import numpy as np
import itertools
accuracy = np.trace(cm) / float(np.sum(cm))
misclass = 1 - accuracy
if cmap is None:
cmap = plt.get_cmap('Blues')
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
if target_names is not None:
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
thresh = cm.max() / 1.5 if normalize else cm.max() / 2
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
if normalize:
plt.text(j, i, "{:0.4f}".format(cm[i, j]),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
else:
plt.text(j, i, "{:,}".format(cm[i, j]),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
plt.show()
It will look like this

This works the best for me :
from sklearn.metrics import multilabel_confusion_matrix
y_unique = y_test.unique()
mcm = multilabel_confusion_matrix(y_test, y_pred, labels = y_unique)
mcm

I see this is still an open issue in sklearn's repository:
https://github.com/scikit-learn/scikit-learn/issues/3452
However there have been some attempts at implementing it. From the same #3452 thread issue:
https://github.com/Magellanea/scikit-learn/commit/514287c1d5dad2f0ab4918dc4da5cf7053fe6734#diff-b04acd877dd793f28ae7be13a999ed88R187
You can check the code proposed in the function and see if that fits your needs.

from sklearn.metrics import multilabel_confusion_matrix
mul_c = multilabel_confusion_matrix(
test_Y,
pred_k,
labels=["benign", "dos","probe","r2l","u2r"])
mul_c

I found an easy solution with sklearn and seaborn libraries.
from sklearn.metrics import confusion_matrix, classification_report
from matplotlib import pyplot as plt
import seaborn as sns
def plot_confusion_matrix(y_test,y_scores, classNames):
y_test=np.argmax(y_test, axis=1)
y_scores=np.argmax(y_scores, axis=1)
classes = len(classNames)
cm = confusion_matrix(y_test, y_scores)
print("**** Confusion Matrix ****")
print(cm)
print("**** Classification Report ****")
print(classification_report(y_test, y_scores, target_names=classNames))
con = np.zeros((classes,classes))
for x in range(classes):
for y in range(classes):
con[x,y] = cm[x,y]/np.sum(cm[x,:])
plt.figure(figsize=(40,40))
sns.set(font_scale=3.0) # for label size
df = sns.heatmap(con, annot=True,fmt='.2', cmap='Blues',xticklabels= classNames , yticklabels= classNames)
df.figure.savefig("image2.png")
classNames = ['A', 'B', 'C', 'D', 'E']
plot_confusion_matrix(y_test,y_scores, classNames)
#y_test is your ground truth
#y_scores is your predicted probabilities

Just use pandas with gradient coloring:
cm = confusion_matrix(y_true, y_pred)
cm = pd.DataFrame(data=cm, columns = np.unique(y_true), index = np.unique(y_true))
cm = (cm / cm.sum(axis = 1).values.reshape(-1,1)) # to fractions of 1
cm.style.background_gradient().format(precision=2)
By now pandas has nice options for table formatting and decoration.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use DBSCAN method from sklearn for clustering - python

Related

Clustering between two sets of data points - Python

3D plotting of a dataset that uses K-means

Visualizing Kmeans cluster after application of TSNE

Finding Accuracy for this K-Means model

Plot confusion matrix sklearn with multiple labels

Categories

Resources