Implementation details of K-means++ without sklearn - python

I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.
For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen
I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?
For my implementation, I now just initialized the first element
self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]
for k in range(1, self.num_clusters):
for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?

You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.
This would look like this:
centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf)
for j in range(1,self.n_clusters):
distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance)
p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
sample = np.random.choice(X.shape[0], p = p)
centers.append(X[sample])

Related

Is k-means++ meant to be perfect every time? What other initialization strategies can yield the best k-means?

I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?

Computing Nearest neighbor graph using sklearn?

This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).
Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.
What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?
I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?
def autoselect_K(X, n_neighbors_max, threshold):
# get the pairwise euclidean distance between every observation
D = sklearn.metrics.pairwise.euclidean_distances(X, X)
chosen_k = n_neighbors_max
for k in range(2, n_neighbors_max):
k_avg = []
# loop over each row in the distance matrix
for row in D:
# sort the row from smallest distance to largest distance
sorted_row = numpy.sort(row)
# calculate the mean of the smallest k+1 distances
k_avg.append(numpy.mean(sorted_row[0:k]))
# find the median of the averages
kmedian_dist = numpy.median(k_avg)
if kmedian_dist >= threshold:
chosen_k = k
break
# return the number of nearest neighbors to use
return chosen_k
From your code, it appears that you are looking for a classification result based on the nearest neighbour.
In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.
Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you
to narrow the search to the top-N (nearest neighbour) samples. Afterward
you can apply a domain specific technique on these N samples to get the desired result.

Speeding up nearest Neighbor with scipy.spatial.cKDTree

I'm trying to optimize my nearest neighbor distance code that I need to calculate for many iterations of the same dataset
I am calculating the nearest neighbor distance for the points in dataset A, to the points in dataset B. Both datasets contain ~ (1000-2000) 2-dimensional points. While the points in dataset A stay the same, I have lots of different iterations for dataset B (~100000), B0,B1, ...B100000.I wonder if I can somehow speed this up given that A stays the same.
To calculate the nearest neighbor distances I use
for i in range(100000):
tree = spatial.cKDTree(B[i])
mindist1, minid = tree.query(A)
score[i] = np.mean(mindist1**4))**0.25
# And some other calculations
...
I wonder if there is a way to speed this up given A stays the same throughout the entire loop. It seems to me like there should be a smarter way to do this given that A is the same.

Should np.linalg.norm be squared when implementing k-means clustering algorithm?

The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.

Simulating correlated multivariate data

I'm trying to generate synthetic realizations from historical hurricane data. A hurricane is parameterized in my problem by a set of descriptors (i.e. storm size, storm intensity, storm speed, and storm heading - all referenced to the values at the time the hurricane crosses some shoreline). The realizations will be used to make probabilistic forecasts of hurricane-generated flooding. The assumption is that the historical hurricane data comes from some underlying multivariate distribution. The idea is to draw additional samples from this underlying distribution (preserving moments, correlation, physical bounds such as positive storm size, etc).
I've implemented a nearest neighbor Gaussian dispersion method modified from a technique developed by Taylor and Thompson - published in Computational Statistics and Data Analysis, 1986. I'd like to see if there are better ways to do this.
Data sample (Gulf of Mexico hurricanes 1940-2005):
def TT_alg(data_list, sample_size, num_neighbors=5, metric=2):
dummy_list = []
dimension = len(data_list[0])
# transform the data to the interval [0,1]
aa = numpy.array([(max([row[i] for row in data_list]) - min([row[i] for row in data_list])) for i in range(dimension)])
bb = numpy.array([min([row[j] for row in data_list]) for j in range(dimension)])
data_array = numpy.array(data_list)
data_array_normed = (data_array - bb) / aa
# setup nearest neighbor tree
tree = scipy.spatial.KDTree(data_array_normed)
# perform nearest neighbor random walk
for ijk in range(sample_size):
sample = random.choice(data_array_normed)
kNN = tree.query(sample, k=num_neighbors, p=metric)
x_mu = numpy.array([numpy.average([data_array_normed[i][j] for i in kNN[1]]) for j in range(dimension)])
x_si = numpy.array([numpy.std([data_array_normed[i][j] for i in kNN[1]]) for j in range(dimension)])
s_gs = [numpy.random.normal(mu, si) for mu, si in zip(x_mu, x_si)]
dummy_list.append(s_gs)
dummy_array = numpy.array(dummy_list)
# go back to original scale
data_array_unnormed = (dummy_array * aa) + bb
return data_array_unnormed.tolist()
Example for neighborhood_size=5 and distance_metric=Euclidean.
Your data are almost certainly not gaussian, the speed, intensity, and size must all be positive and size and intensity are clearly skewed. A log-normal distribution is plausible. I'd recommend log-transforming your data before attempting distributional fits.
One way to try to capture the correlation structure (which is definitely present in your posted data!) would be to estimate the mean M and variance/covariance matrix V of the log-transformed data. Then decompose the variance/covariance matrix using Cholesky decomposition to get V = transpose(C) C. If Z is a vector of independent normals, then X = M + transpose(C) Z will be a vector of correlated normals with the desired mean and variance/covariance structure. Exponentiating the elements of X will yield your simulated results. The results should avoid artifacts such as the negative storm sizes visible in your last graph. See this paper for more details.

Categories

Resources