Finding Accuracy for this K-Means model

Finding Accuracy for this K-Means model - python

This program predicts the cluster to which the coordinates belong to, where it divides the given points into two clusters 0 and 1.
How do I get the accuracy of this model for the variable - prediction
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#from sklearn.metrics import accuracy_score
X = np.array([[1, 2],[5, 8],[1.5, 1.8],[8, 8],[6,7],[9, 11]])
print(X)
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print("Centroids :\n ",centroids)
print("Labels : ",labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths
= 5, zorder = 10)
plt.show()
prediction=kmeans.predict ( [ [ 5,6 ] ] )
print(prediction)

If you know the correct values for the coordinates' labels, you can use scikit-learn's accuracy_score:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_true, y_pred))
This does seem tricky for a clustering problem though. Think about how you would determine whether a prediction is correct or not and calculate the accuracy around that.

Related

3D plotting of a dataset that uses K-means

X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(x,y,z, marker="s"[kmeans.labels_], s=40, cmap="RdBu")
I am trying to Plot the clusters in 3D by colouring all labels belonging to their class, and plot the centroids using a separate symbol. I managed to get the KMeans technique working, atleast I believe I did. But I'm stuck trying to plot it in 3D. I believe there can be a simple solution I'm just not seeing it. Does anyone have any idea what I need to change in my solution to achieve this?

import matplotlib.pyplot as plt
from sklearn.datasets import make_swiss_roll
from mpl_toolkits.mplot3d import Axes3D
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
# print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],c="black",s=150,label="Centers",alpha=1)
ax.scatter(X[x,0],X[x,1],X[x,2],c="blue",s=40,label="C1")
ax.scatter(X[y,0],X[y,1],X[y,2],c="yellow",s=40,label="C2")
ax.scatter(X[z,0],X[z,1],X[z,2],c="red",s=40,label="C3")

Try with this, now the clusters are black X:
from sklearn.datasets import make_swiss_roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters == 3
kmeans = kmeans.fit(X) # Fitting the input data
labels = kmeans.predict(X) # Getting the cluster labels
centroids = kmeans.cluster_centers_ # Centroid values
print("Centroids are:", centroids) # From sci-kit learn
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(labels==0)
y = np.array(labels==1)
z = np.array(labels==2)
ax.scatter(X[x][:, 0], X[x][:, 1], X[x][:, 2], color='red')
ax.scatter(X[y][:, 0], X[y][:, 1], X[y][:, 2], color='blue')
ax.scatter(X[z][:, 0], X[z][:, 1], X[z][:, 2], color='yellow')
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2],
marker='x', s=169, linewidths=10,
color='black', zorder=50)

How to use DBSCAN method from sklearn for clustering

I have a three parameters database for clustering. For example, I can get image result easily from Kmean by sklearn, like that: (val is my database, its shape like (3000,3))
y_pred = KMeans(n_clusters= 4 , random_state=0).fit_predict(val)
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1,projection='3d')
ax1.scatter(val[:, 0], val[:, 1], val[:, 2], c=y_pred)
plt.show()
However, in DBSCAN, I just directly use this one:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
val = StandardScaler().fit_transform(val)
db = DBSCAN(eps=3, min_samples=4).fit(val)
labels = db.labels_
core_samples = np.zeros_like(labels, dtype=bool)
core_samples[db.core_sample_indices_] =True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
So how to get the image result of DBSCAN, just like Kmean?

You can reuse the same code from your KMeans model. All you need to do it re-assign val and y_pred to ignore the noise labels.
# DBSCAN snippet from the question
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
val = StandardScaler().fit_transform(val)
db = DBSCAN(eps=3, min_samples=4).fit(val)
labels = db.labels_
# re-assign y_pred and core (as val)
y_pred, core = labels[labels != -1], val[labels != -1]
# plotting snippet from the question
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1,projection='3d')
ax1.scatter(core[:, 0], core[:, 1], core[:, 2], c=y_pred)
plt.show()

Visualizing Kmeans cluster after application of TSNE

k_model = KMeans(n_clusters = 3).fit(actor_w2vec)
cluster_dict = {i: np.where(k_model.labels_ == i)[0] for i in range(k_model.n_clusters)}
I have applied KMeans on word2vec vector (3411x128). cluster_dict contains the cluster label(i.e. 0,1,2) as key and index number(1,2,3,4,....3411) as value such that these values are distributed among three clusters.
Now i want to visualize these cluster so i used TSNE to reduce the 128 dimension vector to 2 dimension
node_embeddings = actor_w2vec
transform = TSNE #PCA
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(node_embeddings)
but i don't know how combine these two in order to create a graph or scatter plot where all the point belonging to one cluster are combined together

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
#remember to scale your data if the ranges are too broad
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)
kmeans_model = KMeans(n_clusters=3, max_iter=500, random_state=42)
y_km = kmeans_model.fit_predict(scaled_features)
pca_model = PCA(n_components=2, random_state=42)
transformed = pca_model.fit_transform(scaled_features)
centers = pca_model.transform(kmeans_model.cluster_centers_)
fig = px.scatter(x=transformed[:, 0], y=transformed[:, 1], color=y_km)
fig.add_scatter(
x=centers[:, 0],
y=centers[:, 1],
marker=dict(size=20, color="LightSeaGreen"), name="Centers"
)
fig.show()
If you only do kmeans.fit(df), you could get the labels from kmeans.labels_

import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
plt.rcParams['figure.dpi'] = 150
# create dataset
X, y = make_blobs(
n_samples=150, n_features=2,
centers=3, cluster_std=0.5,
shuffle=True, random_state=0
)
# plot
plt.scatter(
X[:, 0], X[:, 1],
edgecolor='black', s=50
)
plt.show()
km = KMeans(
n_clusters=3, init='random',
n_init=10, max_iter=10000,
tol=1e-04, random_state=0
)
y_km = km.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=y_km, s=50, cmap=plt.cm.Paired, alpha=0.4)
plt.scatter(km.cluster_centers_[:, 0],km.cluster_centers_[:, 1],
s=250, marker='*', label='centroids',
edgecolor='black',
c=np.arange(0,3),cmap=plt.cm.Paired,)
The line import seaborn as sns; sns.set() is not necessary, it just makes a nicer style.
For plotting you can use matplotlib.pyplot . Furthermore you can look on the shape of your data with node_embeddings_2d.shape, so you can make sure that plt.scatter takes the right arguments.
Good luck! ;)

Why is my code not predicting and only computes the target value

I have this code that load_digits and uses an SVM model for predicting digits. But after fitting the model, its prediction on new values is incorrect and computes target values that do not correspond to the given input. Below is the code:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
my_OCR_model = svm.SVC(gamma = 0.001, C = 100)
X, y = digits.data[:-10], digits.target[:-10]
my_OCR_model.fit(X, y)
print(my_OCR_model.predict(X[[-6]]))
print(y[-6])
plt.imshow(digits.images[-6], cmap=plt.cm.gray_r, interpolation="nearest")
plt.show()

Remove the slicing.
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
my_OCR_model = svm.SVC(gamma = 0.001, C = 100)
X, y = digits.data, digits.target # remove slicing here
my_OCR_model.fit(X, y)
print(my_OCR_model.predict(X[[-6]]))
print(y[-6])
plt.imshow(digits.images[-6], cmap=plt.cm.gray_r, interpolation="nearest")
plt.show()
Alternatively, if you had a good reason for slicing, keep the same data for X, y, and images. Use this as the last line:
plt.imshow(digits.images[:-10][-6], cmap=plt.cm.gray_r, interpolation="nearest")
plt.show()

sklearn clustering: calculate silhouette coefficient on TF-IDF-weigthed data

I'd like to calculate the silhouette_score like the scikit-learn example silhouette_analysis.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
sampleText = []
sampleText.append("Some text for document clustering")
tfidf_matrix = tfidf_vectorizer.fit_transform(sampleText)
How do I have to convert my tfidf_matrix to do things like this:
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
for num_clusters in range(2,6):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(tfidf_matrix) + (num_clusters + 1) * 10])
km = KMeans(n_clusters=num_clusters,
n_init=10, # number of iterations with different seeds
random_state=1 # fixes the seed
)
cluster_labels = km.fit_predict(tfidf_matrix)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(tfidf_matrix, cluster_labels)

tf-idf is multidimensional and must be reduced to two dimensions. This could be done by reducing the tf-idf to the two features with the highest variance. I use the PCA to reduce tf-idf. The complete example:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
sampleText = []
sampleText.append("Some text for document clustering")
tfidf_matrix = tfidf_vectorizer.fit_transform(sampleText)
X = tfidf_vectorizer.fit_transform(jobDescriptions).todense()
from sklearn.decomposition import PCA
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
for num_clusters in range(2,6):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(data2D) + (num_clusters + 1) * 10])
km = KMeans(n_clusters=num_clusters,
n_init=10, # number of iterations with different seeds
random_state=1 # fixes the seed
)
cluster_labels = km.fit_predict(data2D)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(data2D, cluster_labels)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding Accuracy for this K-Means model - python

Related

3D plotting of a dataset that uses K-means

How to use DBSCAN method from sklearn for clustering

Visualizing Kmeans cluster after application of TSNE

Why is my code not predicting and only computes the target value

sklearn clustering: calculate silhouette coefficient on TF-IDF-weigthed data

Categories

Resources