Scipy Global Optimization Algorithms: rescaling needed?

Scipy Global Optimization Algorithms: rescaling needed? - python

Dear community members,
I am considering four global optimization algorithms offered in Scipy, namely, differential_evolution, dual_annealing, basinhopping and shgo to calibrate a model. My parameter space is 13 dimensional and parameter values lies in very different ranges as they represent different physical quantities ranging from 100s for one parameter to 1e-6 for another parameter. I was wondering if I need to scale all the parameters to the same scale (like linear scaling from 0 to 1)? As per my understanding, at least differential_evolution method is kind of affine-invariant so scaling is not required. Do other methods perform better if parameters are pre-scaled? Do any of these methods scale parameters at their level? Any insights will be helpful.
Thanks

Related

Gaussian process regression with sci-kit learn

Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).

If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.

Results of sklearn/statsmodels ordinary least squares under singular covariance matrix

When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.
The question is then twofold:
What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.

two related questions and answers
Differences in Linear Regression in R and Python
Statsmodels with partly identified model
To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.
Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.
We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.
Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.
To 2) What does it output as coefficients then?
The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)
Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.
A software package has essentially 3 options to handle singular cases
raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse
statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).
One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.

That's indeed the case. As you can see here
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq or scipy.optimize.nnls, which in turn compute the pseudoinverse of the feature matrix via SVD decomposition (they do not exploit the Normal Equation - for which you would have the mentioned issue - as it is less efficient). Moreover, observe that each sklearn.linear_model.LinearRegression's instance returns the singular values of the feature matrix into the singular_ attribute and its rank into the rank_ attribute.
A similar argument applies to statsmodels.regression.linear_model.OLS, where the fit() method of class RegressionModel uses the following:
The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
(see here for reference).

I noticed the same thing, seems sklearn and statsmodel are pretty robust, a little too robust making you wondering how to interprete the results after all. Guess it is still up to the modeler to do due diligence to identify any collinearity between variables and eliminate unnecessary variables. Funny sklearn won't even give you pvalue, which is the most important measure out of these regression. When play with the variables, the coefficient will change, that is why I pay much more attention on pvalues.

Alternative to scipy.optimize.minimize constrained optimization?

I am interested in finding optimized parameters of a model (by minimizing the model's output with the known value). The parameters I am interested in finding have bounds and they are also constrained by an inequality that looks like 1 - sum(x_par) >= 0, where x_par is a list of some of the parameters out of the total parameter list. I have used scipy.optimize.minimize to minimize this problem with different methods (such as COBYLA and SLSQP), but the fitting performance by this function is quite poor and the error is generally above 50%.
I have noticed that scipy.optimize.curve_fit and scipy.optimize.differential_evolution work very well in terms of fitting the given values, but these functions do not allow constraints on parameters. I am looking for an alternative in python to optimize my problem that allows constraining parameters and can do a better job in fitting the given curve/values than scipy.optimize.minimize.

You might find lmfit useful. This module is a wrapper around many of the scipy.optimized routines (including leastsq, differential_evolution, most of the scaler minimizers) that replaces all variables with Parameter objects that can be fixed or free, have bounds applied, or be constrained as mathematical expressions of other Parameters, all independent of the method used to solve the minimization problem. There is also a Model class to support many curve fitting problems, and support for improved analysis of confidence intervals for parameters.
With some care, inequality constraints can be applied, as is discussed briefly at
http://lmfit.github.io/lmfit-py/constraints.html#using-inequality-constraints .

See retained variance in scikit-learn manifold learning methods

I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?

I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.

Semi-supervised Gaussian mixture model clustering in Python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?

The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.