Count data points for each K-means cluster - python

I have a dataset for banknotes wavelet data of genuine and forged banknotes with 2 features which are:
X axis: Variance of Wavelet Transformed image
Y axis: Skewness of Wavelet Transformed image
I run on this dataset K-means to identify 2 clusters of the data which are basically genuine and forged banknotes.
Now I have 3 questions:
How can I count the data points of each cluster?
How can I set a color of each data point based on it's cluster?
How do I know without another feature in the data if the datapoint is genuine or forged? I know the data set has a "class" which shows 1 and 2 for genuine and forged but can I identify this without the "class" feature?
My code:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.patches as patches
data = pd.read_csv('Banknote-authentication-dataset-all.csv')
V1 = data['V1']
V2 = data['V2']
bn_class = data['Class']
V1_min = np.min(V1)
V1_max = np.max(V1)
V2_min = np.min(V2)
V2_max = np.max(V2)
normed_V1 = (V1 - V1_min)/(V1_max - V1_min)
normed_V2 = (V2 - V2_min)/(V2_max - V2_min)
V1_mean = normed_V1.mean()
V2_mean = normed_V2.mean()
V1_std_dev = np.std(normed_V1)
V2_std_dev = np.std(normed_V2)
ellipse = patches.Ellipse([V1_mean, V2_mean], V1_std_dev*2, V2_std_dev*2, alpha=0.4)
V1_V2 = np.column_stack((normed_V1, normed_V2))
km_res = KMeans(n_clusters=2).fit(V1_V2)
clusters = km_res.cluster_centers_
plt.xlabel('Variance of Wavelet Transformed image')
plt.ylabel('Skewness of Wavelet Transformed image')
scatter = plt.scatter(normed_V1,normed_V2, s=10, c=bn_class, cmap='coolwarm')
#plt.scatter(V1_std_dev, V2_std_dev,s=400, Alpha=0.5)
plt.scatter(V1_mean, V2_mean, s=400, Alpha=0.8, c='lightblue')
plt.scatter(clusters[:,0], clusters[:,1],s=3000,c='orange', Alpha=0.8)
unique = list(set(bn_class))
plt.text(1.1, 0, 'Kmeans cluster centers', bbox=dict(facecolor='orange'))
plt.text(1.1, 0.11, 'Arithmetic Mean', bbox=dict(facecolor='lightblue'))
plt.text(1.1, 0.33, 'Class 1 - Genuine Notes',color='white', bbox=dict(facecolor='blue'))
plt.text(1.1, 0.22, 'Class 2 - Forged Notes', bbox=dict(facecolor='red'))
plt.savefig('figure.png',bbox_inches='tight')
plt.show()
Appendix image for better visibility

How to count the data points of each cluster
You can do this easily by using fit_predict instead of fit, or calling predict on your training data after fitting it.
Here's a working example:
kM = KMeans(...).fit_predict(V1_V2)
labels = kM.labels_
clusterCount = np.bincount(labels)
clusterCount will now hold your information for how many points are in each cluster. You can just as easily do this with fit then predict, but this should be more efficient:
kM = KMeans(...).fit(V1_V2)
labels = kM.predict(V1_V2)
clusterCount = np.bincount(labels)
To set its color, use kM.labels_ or the output of kM.predict() as a coloring index.
labels = kM.predict(V1_V2)
plt.scatter(normed_V1, normed_V2, s=10, c=labels, cmap='coolwarm') # instead of c=bn_class
For a new data point, notice how the KMeans you have quite nicely separates out the majority of the two classes. This separability means you can actually use your KMeans clusters as predictors. Simply use predict.
predictedClass = KMeans.predict(newDataPoint)
Where a cluster is assigned the value of the class which it has the majority of. Or even a percentage chance.

Related

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Python find peaks of distribution

In a dataset like this: (y is angle and x is datapoints)
How to find the weighted average of each "band" (in that case would be 0.1 and -90 something) whilst ignoring potential random points.
I was thinking of the FFT transform but that might not be the right approach.
Perhaps transforming that in a graph alike normal distribution and find the peaks?
Solve Using KMeans
Step 1. Generate Data
from random import randint, choice
from numpy import random
import numpy as np
from matplotlib import pyplot as plt
def gen_pts(mean_, std_, n):
"""Generate gaussian distributed random data
mean: mean_
standard deviation: std_
number points: n
"""
return np.random.normal(loc=mean_, scale = std_, size = n)
# Number of groups of horizontal blobs
n_groups = 20
# Genereate random count for each group
counts = [randint(100, 200) for _ in range(n_groups)]
# Generate random mean for each group (i.e. 0 or -90)
means = [random.choice([0, -90]) for _ in range(n_groups)]
# All the groups
data = [gen_pts(mean_, 5, n) for mean_, n in zip(means, counts)]
# Concatenate groups into 1D array
X = np.concatenate(data, axis=0)
# Show Data
plt.plot(X)
plt.show()
Step 2-Find Cluster Centers
# Reshape 1D data so it's suitable for kmeans model
X = X.reshape(-1,1)
# Get model for two clusters
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
# Fit Data to model
pred_y = kmeans.fit_predict(X)
# Cluster Centers
centers = kmeans.cluster_centers_
print(*centers)
# Output: [-89.79165334] [-0.07875314]

Gaussian Processes in scikit-learn: good performance on training data, bad performance on testing data

I wrote a Python script that uses scikit-learn to fit Gaussian Processes to some data.
IN SHORT: the problem I am facing is that while the Gaussian Processses seem to learn very well the training dataset, the predictions for the testing dataset are off, and it seems to me there is a problem of normalization behind this.
IN DETAIL: my training dataset is a set of 1500 time series. Each time series has 50 time components. The mapping learnt by the Gaussian Processes is between a set of three coordinates x,y,z (which represent the parameters of my model) and one time series. In other words, there is a 1:1 mapping between x,y,z and one time series, and the GPs learn this mapping. The idea is that, by giving to the trained GPs new coordinates, they should be able to give me the predicted time series associated to those coordinates.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
Here examples of performance on the training data.
Here examples of performance on the testing data.
first you should use the sklearn preprocessing tool to treat your data.
from sklearn.preprocessing import StandardScaler
There are other useful tools to organaize but this specific one its to normalize the data.
Second you should normalize the training set and the test set with the same parameters¡¡ the model will fit the "geometry" of the data to define the parameters, if you train the model with other scale its like use the wrong system of units.
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
this will use the same tranformation in the sets.
and finaly you need to normalize the features not the traget, I mean to normalize the X entries not the Y output, the normalization helps the model to find the answer faster changing the topology of the objective function in the optimization process the outpu doesnt affect this.
I hope this respond your question.

standard scaling data before when using spectral biclustering in scikit learn?

Hej,
I have a dataset from different cohorts and I want to bicluster them with the sklearn function Spectral Biclustering.
As you can see in the link above this approach is using a kind of normalization to calculate the SVD.
Is it necessary to normalize the data before biclustering, eg with StandardScaling (zero mean and std of one)? Because the function above still uses a kind of normalization.
Is that enough or do I have to normalise them before, eg when the data is coming from different distributions?
I am getting different results with and without standardscaling and I can not find information in the original paper if it is necessary or not.
You can find the code and an example of my dataset. This is real data so I do not know the truth. I calculated at the end the consensus score to compare the 2 biclusters. Unfortunately the clusters are not the same.
I tried it also with artificial data (see example last link) and here the results are the same, but not with the real data.
So how do I know which approach is the right one?
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler
n_clusters = (4, 4)
data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0)
# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)
# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()
data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]
models_all = []
for name, data in zip(data_type,data_all):
# spectral biclustering on the shuffled dataset
model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
, svd_method='randomized', n_jobs=-1
, random_state=0
)
model.fit(data)
newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_row = [i[1] for i in newOrder_row]
newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_col = [i[1] for i in newOrder_col]
# reorder the data matrix
X_plot = data_scaled.copy()
X_plot = X_plot.reindex(order_row) # rows
X_plot = X_plot[[str(x) for x in order_col]] # columns
# use clustermap without clustering
cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
,row_cluster=False, row_colors=None
, col_cluster=False, col_colors=None
, yticklabels=1, xticklabels=1
, standard_scale=None, z_score=None, robust=False
, vmin=-3, vmax=5
)
ax = cm.ax_heatmap
# set labelsize smaller
cm_ax = plt.gcf().axes[-2]
cm_ax.tick_params(labelsize=5.5)
# plot lines for the different clusters
hor_lines = [sum(item) for item in model.biclusters_[0]]
hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))
ver_lines = [sum(item) for item in model.biclusters_[1]]
ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))
for pp in range(len(hor_lines)-1):
cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')
for pp in range(len(ver_lines)-1):
cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')
# title
title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
plt.title(title)
cm.savefig(title,dpi=300)
plt.show()
# save models
models_all.append(model)
# compare models
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))

Relation between 2D KDE bandwidth in sklearn vs bandwidth in scipy

I'm attempting to compare the performance of sklearn.neighbors.KernelDensity versus scipy.stats.gaussian_kde for a two dimensional array.
From this article I see that the bandwidths (bw) are treated differently in each function. The article gives a recipe for setting the correct bw in scipy so it will be equivalent to the one used in sklearn . Basically it divides the bw by the sample standard deviation. The result is this:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
where x is the sample array I'm using to obtain the KDE. This works just fine in 1D, but I can't make it work in 2D.
Here's a MWE of what I got:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])
( iso2 is presented as an exponential since sklearn returns the log values)
The results I get for iso1 and iso2 are different and I'm lost as to how should I affect the bandwidth (in either function) to make them equal (as they should).
Add
I was advised over at sklearn chat (by ep) that I should scale the values in (x,y) before calculating the kernel with scipy in order to obtain comparable results with sklearn.
So this is what I did:
# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)
ie: I scaled both dimensions before getting the kernel with scipy while leaving the line that obtains the kernel in sklearn untouched.
This gave better results but there's still differences in the kernels obtained:
where the red dot is the (x1,y1) point in the code. So as can be seen, there are still differences in the shapes of the density estimates, albeit very small ones. Perhaps this is the best that can be achieved?
A couple of years later I tried this and think I got it to work with no re-scaling needed for the data. Bandwidth values do need some scaling though:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
The evaluation of both KDEs for the same point is not exactly equal. For example here's an evaluation for the (x1, y1) point:
iso1 = 0.00984751705005 # Scipy
iso2 = 0.00989788224787 # Sklearn
but I guess it's close enough.
Here's the MWE for the 2D case and the output which, as far as I can see, look almost exactly the same:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(
values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]
# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)
# Scipy
plt.subplot(gs[0])
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)
# Sklearn
plt.subplot(gs[1])
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)
fig.tight_layout()
plt.savefig('KDEs', dpi=300, bbox_inches='tight')

Categories

Resources