Determining a threshold value for a bimodal distribution via KMeans clustering - python

I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.

Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]

Related

DBSCAN fit_predict on precomputed metrics outputs strange clusters

I am trying to exercise with ML. Specifically, attempting to apply DBSCAN on precomputed distances matrix (just to check how this work). Yes, I know I could use Euclidean metrics but I wanted to test the precomputed.
I am unsure why the labels are all same value for a data set with random pairs in 3 different regions- expecting DBSCAN to separate those. Note: even if I use non-overlapping ranges for the data1/2/3 I still get a single cluster output.
Here is the code:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import pdist, squareform
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data1 = np.array ([[random.randint(1,400) for i in range(2)] for j in range (50)], dtype=np.float64)
data2 = np.array ([[random.randint(300,700) for i in range(2)] for j in range (50)], dtype=np.float64)
data3 = np.array ([[random.randint(600,900) for i in range(2)] for j in range (50)], dtype=np.float64)
data= np.append (np.append (data1,data2,axis=0), data3, axis=0)
d = pdist(data, lambda u, v: np.sqrt(((u-v)**2).sum()))
distance_matrix = squareform(d)
cluster = DBSCAN (eps=0.3, min_samples=2,metric='precomputed')
dbscan_model = cluster.fit_predict (distance_matrix)
plt.scatter (data[:,0], data[:,1], s=100, c=dbscan_model)
plt.show ()

Clustering binary image using sparse matrix representation

I have seen this question on StackOverflow, and I do want to solve the problem of clustering binary images such as this one from the same abovementioned question using sparse matrix representation:
I know that there are more simple and efficient methods for clustering (KMeans, Mean-shift, etc...), and I'm aware that this problem can be solved with other solutions such as connected components,
but my aim is to use sparse matrix representation approach.
What I have tried so far is:
Reading the image and defining distance function (Euclidean distance as an example):
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import numpy as np
from math import sqrt
from scipy.sparse import csr_matrix
original = cv2.imread("km.png", 0)
img = original.copy()
def distance(p1, p2):
"""
Euclidean Distance
"""
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
constructing the Sparse Matrix depending on the nonzero points of the binary image, I'm considering two clusters for simplicity, and later it can be extended to k clusters:
data = np.nonzero(img)
data = tuple(zip(data[0],data[1]))
data = np.asarray(data).reshape(-1,2)
# print(data.shape)
l = data.shape[0] # 68245
# num clusters
k = 2
cov_mat = csr_matrix((l, l, k), dtype = np.int8).toarray()
First Loop to get the centroids as the most far points from each other:
# inefficient loop!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[j])
# Centroids
# TODO: check if more than two datapoints with same max distance then take first!
ci = cov_mat[...,0].argmax()
Calculating the distance from each centroid:
# inefficient loop!
# TODO: repeat for k clusters!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[ci[0]])
cov_mat[i, j, 1] = distance(data[i], data[ci[1]])
Clustering depending on min distance:
# Labeling
cov_mat[cov_mat[...,0]>cov_mat[...,1]] = +1
cov_mat[cov_mat[...,0]<cov_mat[...,1]] = -1
# Labeling Centroids
cov_mat[ci[0], ci[0]] = +1
cov_mat[ci[1], ci[1]] = -1
obtaining the indices of the clusters:
# clusters Indicies
cl1 = cov_mat[...,0].argmax()
cl2 = cov_mat[...,0].argmin()
# TODO: pass back the indices to the image to cluster it.
This approach is time costly, can you please tell me how to increase the efficiency please? thanks in advance.
As pointed out in the comments, when working with NumPy arrays, vectorised code should be preferred over scalar code (i.e., code with explicit for loops). Besides, as I see it, in this particular problem a dense NumPy array is a better choice than a Scipy sparse array. If you are not constrained to utilize a sparse array, this code would do:
import numpy as np
from skimage.io import imread
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
n_clusters = 2
palette = [[255, 0, 0], [0, 255, 0]]
img = imread('https://i.stack.imgur.com/v1NDe.png')
rows, cols = img.nonzero()
coords = np.stack([rows, cols], axis=1)
labels = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(coords)
result = np.zeros(shape=img.shape+(3,), dtype=np.uint8)
for label in range(n_clusters):
idx = (labels == label)
result[rows[idx], cols[idx]] = palette[label]
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 8))
ax0.imshow(img, cmap='gray')
ax0.set_axis_off()
ax0.set_title('Binary image')
ax1.imshow(result)
ax1.set_axis_off()
ax1.set_title('Cluster labels');

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Python find peaks of distribution

In a dataset like this: (y is angle and x is datapoints)
How to find the weighted average of each "band" (in that case would be 0.1 and -90 something) whilst ignoring potential random points.
I was thinking of the FFT transform but that might not be the right approach.
Perhaps transforming that in a graph alike normal distribution and find the peaks?
Solve Using KMeans
Step 1. Generate Data
from random import randint, choice
from numpy import random
import numpy as np
from matplotlib import pyplot as plt
def gen_pts(mean_, std_, n):
"""Generate gaussian distributed random data
mean: mean_
standard deviation: std_
number points: n
"""
return np.random.normal(loc=mean_, scale = std_, size = n)
# Number of groups of horizontal blobs
n_groups = 20
# Genereate random count for each group
counts = [randint(100, 200) for _ in range(n_groups)]
# Generate random mean for each group (i.e. 0 or -90)
means = [random.choice([0, -90]) for _ in range(n_groups)]
# All the groups
data = [gen_pts(mean_, 5, n) for mean_, n in zip(means, counts)]
# Concatenate groups into 1D array
X = np.concatenate(data, axis=0)
# Show Data
plt.plot(X)
plt.show()
Step 2-Find Cluster Centers
# Reshape 1D data so it's suitable for kmeans model
X = X.reshape(-1,1)
# Get model for two clusters
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
# Fit Data to model
pred_y = kmeans.fit_predict(X)
# Cluster Centers
centers = kmeans.cluster_centers_
print(*centers)
# Output: [-89.79165334] [-0.07875314]

Principal component analysis dimension reduction in python

I have to implement my own PCA function function Y,V = PCA(data, M, whitening) that computes the first M principal
components and transforms the data, so that y_n = U^T x_n. The function should further
return V that explains the amount of variance that is explained by the transformation.
I have to reduce the dimension of data D=4 to M=2 > given function below <
def PCA(data,nr_dimensions=None, whitening=False):
""" perform PCA and reduce the dimension of the data (D) to nr_dimensions
Input:
data... samples, nr_samples x D
nr_dimensions... dimension after the transformation, scalar
whitening... False -> standard PCA, True -> PCA with whitening
Returns:
transformed data... nr_samples x nr_dimensions
variance_explained... amount of variance explained by the the first nr_dimensions principal components, scalar"""
if nr_dimensions is not None:
dim = nr_dimensions
else:
dim = 2
what I have done is the following:
import numpy as np
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import multivariate_normal
import pdb
import sklearn
from sklearn import datasets
#covariance matrix
mean_vec = np.mean(data)
cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0] - 1)
print('Covariance matrix \n%s' % cov_mat)
#now the eigendecomposition of the cov matrix
cov_mat = np.cov(data.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' % eig_vecs)
print('\nEigenvalues \n%s' % eig_vals)
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
This is the point where I don't know what to do now and how to reduce dimension.
Any help would be welcome! :)
Here is a simple example for the case where the initial matrix A that contains the samples and features has shape=[samples, features]
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column since I assume that it's column is a variable/feature
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
PCA is actually the same as singular value decomposition, so you can either use numpy.linalg.svd:
import numpy as np
def PCA(U,ndim,whitening=False):
L,G,R=np.linalg.svd(U,full_matrices=False)
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]
If you want to use the eigenvalue problem, then assuming that the number of samples is higher than the number of features (or your data would be underfit), it is inefficient to calculate the spatial correlations (left eigenvectors) directly. Instead, using SVD use the right eigenfunctions:
def PCA(U,ndim,whitening=False):
K=U.T # U # Calculating right eigenvectors
G,R=np.linalg.eigh(K)
G=G[:,::-1]
R=R[::-1]
L=U # R # reconstructing left ones
nrm=np.linalg.norm(L,axis=0,keepdims=True) #normalizing them
L/=nrm
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]

Categories

Resources