Semi-supervised Gaussian mixture model clustering in Python

Semi-supervised Gaussian mixture model clustering in Python - python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?

The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

Related

The use of feature scaling in scikit learn

I'm studing machine learning from here and the course uses 'scikit learn' from regression - https://www.udemy.com/machinelearning/
I can see that for some training regression algorithms, the author uses feature scaling and for some he doesn't because some 'scikit learn' regression algorithms take care of feature scaling by themselves.
How to know in which training algorithm we need to do feature scaling and where we don't need to ?

No machine learning technique needs feature scaling, for some algoirthms scaled inputs make the optimizing easier on the computer which results in faster training time.
Typically, algorithms that leverage distance or assume normality benefit from feature scaling. https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

It depends on the algorithm you are using and your dataset.
Support Vector Machines (SVM), these models converge faster if you scale your features . The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges
In K-means clustering, you find out the Euclidean distance for clustering different data points together. Thus it seems to be a good reason to scale your features so that the centroid doesn't get much affected by the large or abnormal values.
In case of regression, scaling your features will not be of much help since the relation of coefficients between original dataset and the relation of coefficients between scaled dataset will be the same.
In case of Decision Trees, they don't usually require feature scaling.
In case of models which have learning rates involved and are using gradient descent, the input scale does effect the gradients. So feature scaling would be considered in this case.

A very simple answer. Some algorithm does the feature scaling even if you don't and some do not. So, if the algorithm does not, you need to manually scale the features.
You can google which algorithm does the feature scaling, but its good to be safe by manually scaling the feature. Always make sure, the features are scaled, otherwise, the algorithm would give output offset to ideal.

Multiple-output Gaussian Process regression in scikit-learn

I am using scikit learn for Gaussian process regression (GPR) operation to predict data. My training data are as follows:
x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points
y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)
The test points (2-D) where mean and variance/standard deviation need to be predicted are:
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()])
x_test = (np.array(positions)).T
Now, after running the GPR (GausianProcessRegressor) fit (Here, the product of ConstantKernel and RBF is used as Kernel in GaussianProcessRegressor), mean and variance/standard deviation can be predicted by following the line of code:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
While printing the predicted mean (y_pred_test) and variance (sigma), I get following output printed in the console:
In the predicted values (mean), the 'nested array' with three objects inside the inner array is printed. It can be presumed that the inner arrays are the predicted mean values of each data source at each 2-D test point locations. However, the printed variance contains only a single array with 16 objects (perhaps for 16 test location points). I know that the variance provides an indication of the uncertainty of the estimation. Hence, I was expecting the predicted variance for each data source at each test point. Is my expectation wrong? How can I get the predicted variance for each data source at each test points? Is it due to wrong code?

Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your sigma is indeed as expected according to the scikit-learn docs on the predict method (i.e. there is no coding error in your case):
Returns:
y_mean : array, shape = (n_samples, [n_output_dims])
Mean of predictive distribution a query points
y_std : array, shape = (n_samples,), optional
Standard deviation of predictive distribution at query points. Only returned when return_std is True.
y_cov : array, shape = (n_samples, n_samples), optional
Covariance of joint predictive distribution a query points. Only returned when return_cov is True.
Combined with my previous remark about the covariance matrix, the first choice would be to try the predict function with the argument return_cov=True instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...
Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
most GPR implementations model only a single response variable, due to
the difficulty in the formulation of covariance function for
correlated multiple response variables, which describes not only the
correlation between data points, but also the correlation between
responses. In the paper we propose a direct formulation of the
covariance function for multi-response GPR, based on the idea that [...]
Despite the high uptake of GPR for various modelling tasks, there
still exists some outstanding issues with the GPR method. Of
particular interest in this paper is the need to model multiple
response variables. Traditionally, one response variable is treated as
a Gaussian process, and multiple responses are modelled independently
without considering their correlation. This pragmatic and
straightforward approach was taken in many applications (e.g. [7, 26,
27]), though it is not ideal. A key to modelling multi-response
Gaussian processes is the formulation of covariance function that
describes not only the correlation between data points, but also the
correlation between responses.
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Typical GPs are usually designed for single-output scenarios wherein
the output is a scalar. However, the multi-output problems have
arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.
The study of MOGP has a long history and is known as multivariate
Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Gaussian process analysis of processes with multiple outputs is
limited by the fact that far fewer good classes of covariance
functions exist compared with the scalar (single-output) case. [...]
The difficulty of finding “good” covariance models for multiple
outputs can have important practical consequences. An incorrect
structure of the covariance matrix can significantly reduce the
efficiency of the uncertainty quantification process, as well as the
forecast efficiency in kriging inferences [16]. Therefore, we argue,
the covariance model may play an even more profound role in co-kriging
[7, 17]. This argument applies when the covariance structure is
inferred from data, as is typically the case.
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).

First of all, if the parameter used is "sigma", that's referring to standard deviation, not variance (recall, variance is just standard deviation squared).
It's easier to conceptualize using variance, since variance is defined as the Euclidean distance from a data point to the mean of the set.
In your case, you have a set of 2D points. If you think of these as points on a 2D plane, then the variance is just the distance from each point to the mean. The standard deviation than would be the positive root of the variance.
In this case, you have 16 test points, and 16 values of standard deviation. This makes perfect sense, since each test point has its own defined distance from the mean of the set.
If you want to compute the variance of the SET of points, you can do that by summing the variance of each point individually, dividing that by the number of points, then subtracting the mean squared. The positive root of this number will yield the standard deviation of the set.
ASIDE: this also means that if you change the set through insertion, deletion, or substitution, the standard deviation of EVERY point will change. This is because the mean will be recomputed to accommodate the new data. This iterative process is the fundamental force behind k-means clustering.

See retained variance in scikit-learn manifold learning methods

I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?

I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.

Scikit Learn Variable Bias

I am using Scikit to make some prediction on a very large set of data. The data is very wide, but not very long so I want to set some weights to the parts of the data. If I know some parts of the data are more important then other parts how should I inform SCikit of this, or does it kinda break the whole machine learning approach to do some pre-teaching.

The most straightforward way of doing this is perhaps by using Principal Component Analysis on your data matrix X. Principal vectors form an orthogonal basis of X, and they are each one a linear combination of the original feature space (normally columns) of X. The decomposition is such that each principal vector has a corresponding eigenvalue (or singular value depending on how you compute PCA) a scalar that reflects how much reconstruction can be made solely on the basis of that principal vector alone, in a least-squares sense.
The magnitude of coefficients of principal vectors can be interpreted as importance of the individual features of your data, since each coefficient maps 1:1 to a feature or column of the matrix. By selecting one or two principal vectors and examining their magnitudes, you may have a preliminary insight of what columns are more relevant, of course up to how much these vectors approximate the matrix.
This is the detailed scikit-learn API description. Again, PCA is a simple but just one way of doing it, among others.

This probably depends a bit on the machine learning algorithm you're using -- many will discover feature importances on their own (as elaborated via the feature_importances_ property in random forest and others).
If you're using a distance-based measure (e.g. k-means, knn) you could manually weight the features differently by scaling the values of each feature accordingly (though it's possible scikit does some normalization...).
Alternatively, if you know some features really don't carry much information you could simply eliminate them, though you'd lose any diagnostic value these features might unexpectedly bring. There are some tools in scikit for feature selection that might help make this kind of judgement.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.

Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.

For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.