k-means with selected initial centers - python

I am trying to k-means clustering with selected initial centroids.
It says here
that to specify your initial centers:
init : {‘k-means++’, ‘random’ or an ndarray}
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My code in Python:
X = np.array([[-19.07480000, -8.536],
[22.010800000,-10.9737],
[12.659700000,19.2601]], np.float64)
km = KMeans(n_clusters=3,init=X).fit(data)
# print km
centers = km.cluster_centers_
print centers
Returns an error:
RuntimeWarning: Explicit initial center position passed: performing only one init in k-means instead of n_init=10
n_jobs=self.n_jobs)
and return the same initial centers. Any idea how to form the initial centers so it can be accepted?

The default behavior of KMeans is to initialize the algorithm multiple times using different random centroids (i.e. the Forgy method). The number of random initializations is then controlled by the n_init= parameter (docs):
n_init : int, default: 10
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
If you pass an array as the init= argument then only a single initialization will be performed using the centroids explicitly specified in the array. You are getting a RuntimeWarning because you are still passing the default value of n_init=10 (here are the relevant lines of source code).
It's actually totally fine to ignore this warning, but you can make it go away completely by passing n_init=1 if your init= parameter is an array.

Related

How does GPFlow handle the length-scales of additive multidimensional Kernels?

I am working with GPFlow to implement a basic Gaussian Process Regressor. In particular, I have defined the following kernel for my GPR:
kernel = gpflow.kernels.RBF(lengthscales=1., active_dims=[0,1]) + \
gpflow.kernels.RBF(lengthscales=1., active_dims=[2,3])
self.gp = gpflow.models.GPR(data=(gpXVar, gpYVar),kernel=self.kernel)
The overall parameter space is 4-dimensional where I expect dimensions (0,1) to share the same lengthscale parameter and (2,3) to share another lengthscale parameter). I bound the input space to the unit hypercube, and I normalize my labels. My question is: when setting the prior distributions over my kernel lengthscales, how does the internal implementation of gpflow inform my choices?
In particular, I only add data points to the regression if they are a minimum Euclidean distance away from any other point in the dataset (1e-4). Since I bound the input domain to the unit cube, I know the maximum distance between samples on any given dimension is 1, and the total Euclidean distance within the input domain is 2=Sqrt(1+1+1+1). From the excellent case study here:
https://betanalpha.github.io/assets/case_studies/gaussian_processes.html#3_Inferring_A_Gaussian_Process
I know that the prior distribution should lower bound the lengthscale to be 1e-4, but how am I to upper bound the lengthscales? Is the upper bound the maximum distance for a single dimension (1), is it the maximum distance for a given component kernel (sqrt(2)), or is it the maximum distance over the entire input domain (2)?
I have been pouring over the source code for GPFlow, but I can't seem to figure out which is correct.
Thank you.

Output all guesses from scipy.optimize.leastsq()

I'm hoping to make an animation about how the least-squares regression analysis provided by scipy.optimize.leastsq() converges on a specific result. Is there any way to get the function to, say, append to a list a tuple of guess values for each iteration until the function converges to the local minima? Or, is there a different library which includes this feature?
Below is what I have:
# initial guess for gaussian distributions to optimize [height, position, width].
# if more than 2 distributions required, add a new set of [h,p,w] initial parameters to 'initials' for each new distribution.
# new parameters should be of the same format for consistency; i.e. [h,p,w],[h,p,w],[h,p,w]... etc.
# A 'w' guess of 1 is typically a sufficient estimation.
initials = [6.5,13,1],[4.5,19,1]
# determines the number of gaussian functions to compute from the initial guesses
n = len(initials)
# formats initials into a 1D array
var = np.concatenate(initials)
# data matrix
M = np.array(master)
# defines a typical gaussian function, of independent variable x,
# amplitude a, position b, and width parameter c.
def gaussian(x,a,b,c):
return a*np.exp((-(x-b)**2.0)/c**2.0)
# defines the expected resultant as a sum of intrinsic gaussian functions
def GaussSum(x, p):
return sum(gaussian(x, p[3*k], p[3*k+1], p[3*k+2]) for k in range(n))
# defines condition of minimization, reducing the square of the difference between the data (y) and the function 'func(x,p)'
def residuals(p, y, x):
return (y - GaussSum(x,p))**2
# executes least-squares regression analysis to optimize initial parameters
cnsts = leastsq(residuals, var, args=(M[:,1],M[:,0]))[0]
what I'm eventually hoping for is for 'cnsts' to be a list of tuples of every guess from the initial guess to the final guess.
If I'm understanding your question correctly, you want to make a guess at each of the different coefficients while fitting a linear regression line, then have a list of all the coefficents that have been guessed? Similar to how a NN will back-propagate the error to better fit a model?
Linear regression isn't guessing the different coefficents. It's just calculating them... https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/#FindaLinear

partially define initial centroid for scikit-learn K-Means clustering

Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.

Evaluate SmoothBivariateSpline for two 1d array lists

I have three arrays x,y,z. I wanted to smooth the z-data. So, I have used SmoothBivariateSpline function. But when I eval the result, I get completely different values compared to my previous z-data. Below is my code:
def envinterpolate(x,y,z):
x_interp = np.linspace(min(x),max(x),len(x)*4)
y_interp = np.linspace(min(y),max(y),len(x)*4)
sbsp = SmoothBivariateSpline(x,y,z)
z_interp = sbsp.ev(x_interp,y_interp)
return z_interp
Is there anything wrong in my code while evaluating the values of spline?
Attaching the plot,after trying s=0 parameter(redline my actual z-data,blackline z-interp data)
By convention, "smoothing" refers specifically to cases where you don't want the interpolant to pass exactly through your input data points (for example if you know that your input data is noisy).
SmoothBivariateSpline takes a parameter s that controls the degree of smoothing that is applied to the interpolant:
s : float, optional
Positive smoothing factor defined for estimation condition: sum((w[i]*(z[i]-s(x[i], y[i])))**2, axis=0) <= s Default s=len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of z[i].
If you don't want any smoothing you could simply set s=0.

sklearn.mixture.GMM to fit multiple Gaussian curves into a histogram, an EM algorithm error

I'm using sklearn.mixture.GMM to fit two Gaussian curves to an array of data and consequently overlay it with data histogram (dat disturbution is mixture of 2 Gaussian curves).
My data is a list of float number and here is the line of code i am using :
clf = mixture.GMM(n_components=1, covariance_type='diag')
clf.fit(listOffValues)
if i set n_components to 1, I get the following error:
"(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.
and if i use n_components to 2 there error is:
(self.n_components, X.shape[0]))
ValueError: GMM estimation with 2 components, but got only 1 samples.
For the first error, I tried changing all init parameters of GMM, but it didn't make any difference.
Tried an array of random numbers and the code is working perfectly fine.
I cant figure out what possibly can be the issue.
Is there an implementation issue I'm overlooking?
Thank you for your help.
If I understood you correctly - you would like to fit you data distribution with gaussians and you have only one feature per element. Than you should reshape your vector to be a column vector:
listOffValues = np.reshape(listOffValues, (-1, 1))
otherwise, if your listOffValues corresponds to some curve that you want to fit it with several gaussians, than you should use curve_fit. See Gaussian fit for Python

Categories

Resources