I am interested in finding optimized parameters of a model (by minimizing the model's output with the known value). The parameters I am interested in finding have bounds and they are also constrained by an inequality that looks like 1 - sum(x_par) >= 0, where x_par is a list of some of the parameters out of the total parameter list. I have used scipy.optimize.minimize to minimize this problem with different methods (such as COBYLA and SLSQP), but the fitting performance by this function is quite poor and the error is generally above 50%.
I have noticed that scipy.optimize.curve_fit and scipy.optimize.differential_evolution work very well in terms of fitting the given values, but these functions do not allow constraints on parameters. I am looking for an alternative in python to optimize my problem that allows constraining parameters and can do a better job in fitting the given curve/values than scipy.optimize.minimize.
You might find lmfit useful. This module is a wrapper around many of the scipy.optimized routines (including leastsq, differential_evolution, most of the scaler minimizers) that replaces all variables with Parameter objects that can be fixed or free, have bounds applied, or be constrained as mathematical expressions of other Parameters, all independent of the method used to solve the minimization problem. There is also a Model class to support many curve fitting problems, and support for improved analysis of confidence intervals for parameters.
With some care, inequality constraints can be applied, as is discussed briefly at
http://lmfit.github.io/lmfit-py/constraints.html#using-inequality-constraints .
Related
Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.
When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.
The question is then twofold:
What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.
two related questions and answers
Differences in Linear Regression in R and Python
Statsmodels with partly identified model
To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.
Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.
We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.
Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.
To 2) What does it output as coefficients then?
The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)
Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.
A software package has essentially 3 options to handle singular cases
raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse
statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).
One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.
That's indeed the case. As you can see here
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq or scipy.optimize.nnls, which in turn compute the pseudoinverse of the feature matrix via SVD decomposition (they do not exploit the Normal Equation - for which you would have the mentioned issue - as it is less efficient). Moreover, observe that each sklearn.linear_model.LinearRegression's instance returns the singular values of the feature matrix into the singular_ attribute and its rank into the rank_ attribute.
A similar argument applies to statsmodels.regression.linear_model.OLS, where the fit() method of class RegressionModel uses the following:
The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
(see here for reference).
I noticed the same thing, seems sklearn and statsmodel are pretty robust, a little too robust making you wondering how to interprete the results after all. Guess it is still up to the modeler to do due diligence to identify any collinearity between variables and eliminate unnecessary variables. Funny sklearn won't even give you pvalue, which is the most important measure out of these regression. When play with the variables, the coefficient will change, that is why I pay much more attention on pvalues.
Dear community members,
I am considering four global optimization algorithms offered in Scipy, namely, differential_evolution, dual_annealing, basinhopping and shgo to calibrate a model. My parameter space is 13 dimensional and parameter values lies in very different ranges as they represent different physical quantities ranging from 100s for one parameter to 1e-6 for another parameter. I was wondering if I need to scale all the parameters to the same scale (like linear scaling from 0 to 1)? As per my understanding, at least differential_evolution method is kind of affine-invariant so scaling is not required. Do other methods perform better if parameters are pre-scaled? Do any of these methods scale parameters at their level? Any insights will be helpful.
Thanks
I am not too convinced by the parameter optimization classes sklearn provides, in fact GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) just loops over the parameters I pass in via param_grid. RandomizedSearch (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) seems like a we-gave-up-approach to me.
One of my approaches was to re-use adaptive integration of a function (used in numerical mathematics): basically reduce the search-space of a parameter in every iteration until a certain error-threshold is reached. The biggest advantage of this method is that the error is also reduced in each iteration.
The problem is that certain parameter-values may be untouched if you consider the different score-values (precision, roc-auc etc.) as functions. I still got good results in one case, not so good in other cases.
What would be a good mathematical approach to optimize values more efficient than GridSearch and RandomizedSearch?
I have some functional, such as S[f] = \int_\Omega f^2(x) dx. If you're familiar with physics, it's the action. This object takes in a function defined on a certain domain \Omega and gives you a number. The math jargon for this is functional.
Now I need to minimize this thing with respect to f. I know SciPy has an optimize package that allows one to minimize multivariable functions, but I am curious if there is a better way considering if I used this I would be minimizing over ~10,000 variables (because the functions are essentially just lists of 10,000 numbers).
Do I have any other options?
You could use symbolic regression to find the function.
There are several packages available:
deap
glyph
gplearn
monkeys
Here is a good paper on symbolic regression by Schmidt and Lipson.
Although it is more designed for doing Neural Network stuff, Tensorflow sounds like it would work for you. It has the ability to differentiate vector equations and also optimize them using gradient descent.