Calculating Loss function for kmeans in pandas dataframe

Calculating Loss function for kmeans in pandas dataframe - python

I have a dataframe containing 5 columns. I am trying to cluster the points for three variables X, Y and Z and find the loss function for kmeans clustering. The following code takes care of that, but if I run this for my real dataframe with 160,000 row, it takes for ever! I assume it can be done a lot faster.
PS: It seems that KMeans module in sklearn does not provide the loss function that's why I am writing my own code.
from sklearn.cluster import KMeans
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 5), columns=list('XYZVW'))
kmeans = KMeans(n_clusters = 6, random_state = 0).fit(df[['X','Y', 'Z']].values)
df['Cluster'] = kmeans.labels_
loss = 0.0
for i in range(df.shape[0]):
cluster = int(df.loc[i, "Cluster"])
a = np.array(df.loc[i,['X','Y', 'Z']])
b = kmeans.cluster_centers_[cluster]
loss += np.linalg.norm(a-b)
print(loss)

It seems that scipy package takes care of the loss function and it is pretty fast. Here's the code:
from scipy.cluster.vq import vq, kmeans, whiten
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 5), columns=list('XYZVW'))
centers, loss = kmeans(df[['X','Y', 'Z']].values, 6)
df['Cluster'] = vq(features, centers)[0]
That being said, I am still interested to know the fastest way of calculating loss function using sklearn kmeans module.

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
inertia_ : float
Sum of distances of samples to their closest cluster center.

Related

how to label data in csv file as outlier detecetd by DBSCAN clusttering

i am using DBSCAN for clustering data so I can label the data which is anomalous here is my code I wanted to print 1 in front of outliers records in my csv file but for now my code is just telling the record no and printing those records which are outliers
data wrangling
import pandas as pd# visualization
import matplotlib.pyplot as plt# algorithm
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# import data
df = pd.read_csv("C:/Users/user1/Desktop/4005_20200101_20200331.csv")
print(df.head())
setting up data to cluster
X = df
scale and standardizing data
X = StandardScaler().fit_transform(X)
#Instantiating our DBSCAN Model. In the code below, epsilon = 3 and min_samples is the minimum number of points needed to constitute a cluster.
instantiating DBSCAN
dbscan = DBSCAN(eps=3, min_samples=4)
fitting model
model = dbscan.fit(X)
#Storing the labels formed by the DBSCAN
labels = model.labels_
#Identifying which points make up our “core points”
import numpy as np
from sklearn import metrics
identify core samples
core_samples = np.zeros_like(labels, dtype=bool)
core_samples[dbscan.core_sample_indices_] = True
print(core_samples)
#Calculating the number of clusters
declare the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(n_clusters)
#Computing the Silhouette Score
#print("Silhoette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)
outliers = df[model.labels_ == -1]
print(outliers)
I wanted to print 1 in front of outliers records in my csv file

DBSCAN fit_predict on precomputed metrics outputs strange clusters

I am trying to exercise with ML. Specifically, attempting to apply DBSCAN on precomputed distances matrix (just to check how this work). Yes, I know I could use Euclidean metrics but I wanted to test the precomputed.
I am unsure why the labels are all same value for a data set with random pairs in 3 different regions- expecting DBSCAN to separate those. Note: even if I use non-overlapping ranges for the data1/2/3 I still get a single cluster output.
Here is the code:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import pdist, squareform
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data1 = np.array ([[random.randint(1,400) for i in range(2)] for j in range (50)], dtype=np.float64)
data2 = np.array ([[random.randint(300,700) for i in range(2)] for j in range (50)], dtype=np.float64)
data3 = np.array ([[random.randint(600,900) for i in range(2)] for j in range (50)], dtype=np.float64)
data= np.append (np.append (data1,data2,axis=0), data3, axis=0)
d = pdist(data, lambda u, v: np.sqrt(((u-v)**2).sum()))
distance_matrix = squareform(d)
cluster = DBSCAN (eps=0.3, min_samples=2,metric='precomputed')
dbscan_model = cluster.fit_predict (distance_matrix)
plt.scatter (data[:,0], data[:,1], s=100, c=dbscan_model)
plt.show ()

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred

The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

Determining a threshold value for a bimodal distribution via KMeans clustering

I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.

Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]

fit_transform PCA inconsistent results

I am trying to do PCA from sklearn with n_components = 5. I apply the dimensionality reduction on my data using fit_transform(data).
Initially I tried to do the classical matrix multiplication between pca.components_ values and my x_features data, but the results are different. So I am wether doing my multiplication incorrectly or I did not understand how fit_transform work.
Below is a mock-up to compare classic matrix multiplication and fit_transform:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
my_matrix = np.random.randn(100, 5)`
mdl = decomposition.PCA(n_components=5)
mdl_FitTrans = mdl.fit_transform(my_matrix)
pca_components = mdl.components_
mdl_FitTrans_manual = np.dot(pca_components, my_matrix.transpose())
mdl_FitTrans_manualT = mdl_FitTrans_manual.transpose()
I am expecting mdl_FitTrans == mdl_FitTrans_manual but the result is False.

Check out, how the transform() method is implemented in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a5ab948/sklearn/decomposition/base.py#L101
According to it, manual reduction is done as following:
import numpy as np
from sklearn import decomposition
np.random.seed(0)
data = np.random.randn(100, 100)
mdl = decomposition.PCA(n_components=5)
mdl_fit = mdl.fit(data)
data_transformed = mdl_fit.transform(data)
data_transformed_manual = np.dot(data - mdl_fit.mean_, mdl.components_.T)
np.all(data_transformed == data_transformed_manual)
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating Loss function for kmeans in pandas dataframe - python

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html inertia_ : float Sum of distances of samples to their closest cluster center.

Related

how to label data in csv file as outlier detecetd by DBSCAN clusttering

DBSCAN fit_predict on precomputed metrics outputs strange clusters

Outlier detection with Local Outlier Factor (LOF)

Determining a threshold value for a bimodal distribution via KMeans clustering

fit_transform PCA inconsistent results

Categories

Resources