Scipy.Optimize.Minimize inefficient? Double calls to cost/gradient function

Scipy.Optimize.Minimize inefficient? Double calls to cost/gradient function - python

I'm relatively new to using SciPy; I'm currently using it to minimize a cost function for a multi-layer-perceptron model. I can't use scikit-learn because I need to have the ability to set the coefficients (they are read-only in the MLPClassifer) and add random permutations and noise to any and all parameters. I haven't finished the implementation quite yet, but I am confused about the parameters required for the minimize function.
For example, I have a function that I have written to calculate the "cost" (energy to minimize) of the function, and it calculates the gradient at the same time. That's nothing special as it's common practice. However, when calling scipy.optimize.minimize, it asks for two different functions: one that returns the scalar that is to be minimized (i.e., the cost in my case) and one that calculates the gradient of the current state. Example:
j,grad = myCostFunction(X,y)
Unless I am mistaken, it seems that it would need to call my function twice, with each call needing to be specified to return either the cost or the gradient, like so:
opt = scipy.optimize.minimize(fun=myJFunction, jac=myGradFunction, args = args,...)
Isn't this a waste of computation time? My data set will be > 1 million samples and 10ish features, so reducing redundant computation would be preferred since I will be training and retraining this thing tens of thousands of times for my project.
Another point of confusion is with the args input. Are the arguments passed like this:
# This is what I expect happens
myJFunction(x0,*args)
myGradFunction(x0,*args)
or like this:
# This is what I wish it did
myJFunction(x0,arg0,arg1,arg2)
myGradFunction(x0,arg3,arg4,arg5)
Thanks in advance!

After doing some experimentation and searching, I found the answers to my own questions.
While I can't say for sure about the scipy.optimize.minimize function, using other optimization functions (for example, scipy.optimize.fmin_tnc) explicitly states that the callable function func can either (1) return both the energy and the gradient, (2) return the energy and specify the gradient function for that parameter fprime (slower), or (3) return only the energy and have the function estimate the gradient through perturbation (much slower).
See the docs here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_tnc.html
I was very happy to see that I could use only one function to return both parameters. I assume it is the same case for the minimize function, but I have not tested it to be sure (See Edit 1)
As for my second question, if you specify two different functions, the *args parameters are passed to both functions the same; you cannot specify individual parameters for both.
EDIT 1: Reading through the minimize documentation more, if the parameter jac is set to True, then the optimizer assumes that the func returns energy and gradient. Reading the docs thoroughly is helpful, it seems.

Related

Optimization method selection & dealing with convergence and variability

The Problem
I am looking to tackle a minimization problem using scipy's optimization utilities.
Specifically, I've been using this function:
result = spo.minimize(s21_mag, goto_start, options={"disp": True}, bounds=bnds)
My s21_mag function takes a couple of seconds to return an output (due to physically moving motors). It consists of 3 parameters (3 moving parts), with no constraints - just three bounds (identical for all 3 parameters):
bnds = ((0,45000),(0,45000),(0,45000))
The limit on the amount of iterations is not very constraint (1000 is probably a good enough upper limit for me), but I expect the optimizer to try many configurations in this set of iterations to identify an optimal value. So far, some methods I've tried just seem to converge somewhere with meaningless progress.
Here's progress beyond the 50th iteration (full code here) - the goal is the maximization of S21 at a specific frequency (purple vertical line):
This is with no method passed tospo.minimize(), so it uses the default (and it looks like it applies the exact same movement to each motor).
Questions
Although scipy's minimization function offers a wide variety of optimization methods/algorithms, how could I (as a beginner in optimization math) select the one that would work best for my application? What kind of aspects of my problem should I take into account to jump to such conclusions? Assume I have no idea about the initial value of each parameter and want the optimizer to figure that out (I usually just set it to the midpoint, i.e. initial: x1=x2=x3=22500).
The same set of parameters as an input to my s21_mag function could yield different results at different times the function is called.
This happens for two reasons:
(a) The parameter step of the optimizer can get extremely small (particularly as the number of iterations increase and the convergence is approached), whereas the motor expects a minimum value of ~100 to make a step.
Is there a way to somehow set a minimum step? Otherwise, it tries to step from e.g. 1234.0 to 1234.0001 and eventually gets "stuck" between trying tiny changes.
(b) The output of the function goes through a measuring instrument, which exhibits a little bit of noise (e.g. one measurement may yield 5.42 dB, while another measurement (with the exact same parameters) may yield 5.43 dB).
Is there a way to deal with these kinds of small variabilities/errors to avoid confusions for the optimizer?

Why are finite differences of static input variables used to calculate the Jacobian? (OpenMDAO 2.4)

I have been using the SLSQP algorithm to run some MDO problems with ExplicitComponents only. Each component has a runtime of around 10 seconds and 60-100 input variables. Most of the input variables are static input variables that will remain constant during the entire optimization. The static input variables originate from an IndepVarComp. The ExplicitComponents are black boxes, so no information is available on the partials.
I noticed that when the Jacobian is calculated in the compute_totals(), the components are linearized with respect to all their input values. In the compute_approximations() a finite difference is calculated over all the input values, including the static input values. So, my question is: why is a finite difference calculation performed over these static input variables? As the values remain constant, I’m not sure why this information would be useful?
Furthermore, if I understand it correctly, the components are linearized to get the sub-Jacobians, which are then used to calculate the total Jacobian. However, is it possible to directly calculate a finite-difference over the entire group instead of linearizing each component? With the runtimes of my components and amount of input variables, it will take a long time to perform the linearization of each component. However, the optimization problem has only 3 design variables. So, if I could perform three finite difference calculations over the entire MDA to calculate the total Jacobian, the total runtime will decrease significantly.

To answer your questions in reverse order:
1) Can you FD over the entire model instead of each individual component? Yes!
You can set up FD over any group in your model, including the top-level group. Then the FD is taken across that group rather than across each component in it.
We call that computing a semi-total derivative, because in general you can select a sub-group in your model, in which case the FD is approximating a total-derivative across that group but that total-derivative is still effectively a partial-derivative for the overall model. hence semi-total derivative.
2) Why is a finite difference calculation performed over these static input variables?
In theory, you're correct that you don't really need partial derivatives of the inputs that can't change. As of OpenMDAO 2.4, we don't handle that situation automatically though, and we don't have plans to add that in the near future. However, the framework is only taking FD across the partials you tell it to. It sounds like you are declaring your partials like this:
self.declare_partials(of=['*'], wrt=['*'], method='fd')
So you're specifically asking the framework to compute all those partials. Instead, you could specify in the wrt argument only the inputs you know are actually changing. Of course, this is mathematically incorrect because there is a derivative wrt to the static-inputs. If someone later on connects something to those inputs and tries and optimization, they would get a wrong answer. But as long as your careful, you can specifically ask for only the partials you wanted from any component and simple leave the non-changing inputs as effectively 0.

Custom minimizer based on Levenberg-Marquardt in scipy.optimize.basinhopping

I'm having troubles to minimize a complex non linear function in python. This function is actually the chiSquare of a fitting model used to fit experimental data. In order to get the global minimum, I'm using the basinhopping function in scipy. This function is a wrapper of the minimize() function that adds some perturbation to look for different local minima. Right now my problem is that it has troubles to find the local minima.
There are a bunch of solvers that can be used in minimize(), and since I'm using bounds I chose between 'L-BFGS-B', 'SLSQP' and 'TNC'. None of them really find local minima. Is there a method based on the popular Levenberg-Marquardt algorithm that can be use to minimize? Maybe this does not make sense otherwise it would be already implemented, but I can't understand why.
My original idea was actually to use the leastsqbound function(https://pypi.python.org/pypi/leastsqbound) that I know is very good at providing accurate covariance matrix despide the bounds, and include it in a larger algorithm that would look for global minima (like the basinhopping function). Do you know if something like this already exist?
Thanks a lot for you advices!

Scipy has a Levenberg-Marquardt implementation: scipy.optimize.leastsq. It does not have the right return type to use with minimize (and therefore basin_hopping). However, it appears this could be remedied fairly straightforwardly.
Though I have not run it, this should do the trick:
def leastsq_for_minimize( *args, **kwargs ):
results = leastsq( *args, **kwargs )
optimize_results = scipy.optimize.OptimizeResult()
# Some code here to correctly copy results to optimize results
return optimize_results
scipy.optimize.basinhopping(
# your arguments here
minimizer_kwargs=dict(method=leastsq_for_minimize),
)
minimize documentation
basinhopping documentation:
OptimizeResult documentation

How to disable the local minimization process in scipy.optimize.basinhopping?

I am using scipy.optimize.basinhopping for finding the minima of a scalar function. I wonder whether it is possible to disable the local minimization part of scipy.optimize.basinhopping? As we can see from the output message below, minimization_failures and nit are nearly the same, indicating that the local minimization part may be useless for the global optimization process of basinhopping --- reason why I would like to disable the local minimization part, for the sake of efficiency.

You can avoid running the minimizer by using a custom minimizer that does nothing.
See the discussion on "Custom minimizers" in the documentation of minimize():
**Custom minimizers**
It may be useful to pass a custom minimization method, for example
when using a frontend to this method such as `scipy.optimize.basinhopping`
or a different library. You can simply pass a callable as the ``method``
parameter.
The callable is called as ``method(fun, x0, args, **kwargs, **options)``
where ``kwargs`` corresponds to any other parameters passed to `minimize`
(such as `callback`, `hess`, etc.), except the `options` dict, which has
its contents also passed as `method` parameters pair by pair. Also, if
`jac` has been passed as a bool type, `jac` and `fun` are mangled so that
`fun` returns just the function values and `jac` is converted to a function
returning the Jacobian. The method shall return an ``OptimizeResult``
object.
The provided `method` callable must be able to accept (and possibly ignore)
arbitrary parameters; the set of parameters accepted by `minimize` may
expand in future versions and then these parameters will be passed to
the method. You can find an example in the scipy.optimize tutorial.
Basically, you need to write a custom function that returns an OptimizeResult and pass it to basinhopping via the method part of minimizer_kwargs, for example
from scipy.optimize import OptimizeResult
def noop_min(fun, x0, args, **options):
return OptimizeResult(x=x0, fun=fun(x0), success=True, nfev=1)
...
sol = basinhopping(..., minimizer_kwargs=dict(method=noop_min))
Note: I don't know how skipping local minimization affects the convergence properties of the basinhopping algorithm.

You can use minimizer_kwargs to specify to minimize() what options your prefer to the local minimization step. See the dedicated part of the docs.
It is then up to what type of solver you ask minimize for. You can try setting a larger tol to make the local minimization step terminate earlier.
EDIT, in reply to the comment "What if I want to disable the local minimization part completely?"
The basinhopping algorithm from the docs works like:
The algorithm is iterative with each cycle composed of the following
features
random perturbation of the coordinates
local minimization accept or
reject the new coordinates based on the minimized function value
If the above is accurate there is no way to skip the local minimization step entirely, because its output is required by the algorithm to proceed further, i.e. keep or discard the new coordinate. However, I am not an expert of this algorithm.

Parallel many dimensional optimization

I am building a script that generates input data [parameters] for another program to calculate. I would like to optimize the resulting data. Previously I have been using the numpy powell optimization. The psuedo code looks something like this.
def value(param):
run_program(param)
#Parse output
return value
scipy.optimize.fmin_powell(value,param)
This works great; however, it is incredibly slow as each iteration of the program can take days to run. What I would like to do is coarse grain parallelize this. So instead of running a single iteration at a time it would run (number of parameters)*2 at a time. For example:
Initial guess: param=[1,2,3,4,5]
#Modify guess by plus minus another matrix that is changeable at each iteration
jump=[1,1,1,1,1]
#Modify each variable plus/minus jump.
for num,a in enumerate(param):
new_param1=param[:]
new_param1[num]=new_param1[num]+jump[num]
run_program(new_param1)
new_param2=param[:]
new_param2[num]=new_param2[num]-jump[num]
run_program(new_param2)
#Wait until all programs are complete -> Parse Output
Output=[[value,param],...]
#Create new guess
#Repeat
Number of variable can range from 3-12 so something such as this could potentially speed up the code from taking a year down to a week. All variables are dependent on each other and I am only looking for local minima from the initial guess. I have started an implementation using hessian matrices; however, that is quite involved. Is there anything out there that either does this, is there a simpler way, or any suggestions to get started?
So the primary question is the following:
Is there an algorithm that takes a starting guess, generates multiple guesses, then uses those multiple guesses to create a new guess, and repeats until a threshold is found. Only analytic derivatives are available. What is a good way of going about this, is there something built already that does this, is there other options?
Thank you for your time.
As a small update I do have this working by calculating simple parabolas through the three points of each dimension and then using the minima as the next guess. This seems to work decently, but is not optimal. I am still looking for additional options.
Current best implementation is parallelizing the inner loop of powell's method.
Thank you everyone for your comments. Unfortunately it looks like there is simply not a concise answer to this particular problem. If I get around to implementing something that does this I will paste it here; however, as the project is not particularly important or the need of results pressing I will likely be content letting it take up a node for awhile.

I had the same problem while I was in the university, we had a fortran algorithm to calculate the efficiency of an engine based on a group of variables. At the time we use modeFRONTIER and if I recall correctly, none of the algorithms were able to generate multiple guesses.
The normal approach would be to have a DOE and there where some algorithms to generate the DOE to best fit your problem. After that we would run the single DOE entries parallely and an algorithm would "watch" the development of the optimizations showing the current best design.
Side note: If you don't have a cluster and needs more computing power HTCondor may help you.

Are derivatives of your goal function available? If yes, you can use gradient descent (old, slow but reliable) or conjugate gradient. If not, you can approximate the derivatives using finite differences and still use these methods. I think in general, if using finite difference approximations to the derivatives, you are much better off using conjugate gradients rather than Newton's method.
A more modern method is SPSA which is a stochastic method and doesn't require derivatives. SPSA requires much fewer evaluations of the goal function for the same rate of convergence than the finite difference approximation to conjugate gradients, for somewhat well-behaved problems.

There are two ways of estimating gradients, one easily parallelizable, one not:
around a single point, e.g. (f( x + h directioni ) - f(x)) / h;
this is easily parallelizable up to Ndim
"walking" gradient: walk from x0 in direction e0 to x1,
then from x1 in direction e1 to x2 ...;
this is sequential.
Minimizers that use gradients are highly developed, powerful, converge quadratically (on smooth enough functions).
The user-supplied gradient function
can of course be a parallel-gradient-estimator.
A few minimizers use "walking" gradients, among them Powell's method,
see Numerical Recipes p. 509.
So I'm confused: how do you parallelize its inner loop ?
I'd suggest scipy fmin_tnc
with a parallel-gradient-estimator, maybe using central, not one-sided, differences.
(Fwiw,
this
compares some of the scipy no-derivative optimizers on two 10-d functions; ymmv.)

I think what you want to do is use the threading capabilities built-in python.
Provided you your working function has more or less the same run-time whatever the params, it would be efficient.
Create 8 threads in a pool, run 8 instances of your function, get 8 result, run your optimisation algo to change the params with 8 results, repeat.... profit ?

If I haven't gotten wrong what you are asking, you are trying to minimize your function one parameter at the time.
you can obtain it by creating a set of function of a single argument, where for each function you freeze all the arguments except one.
Then you go on a loop optimizing each variable and updating the partial solution.
This method can speed up by a great deal function of many parameters where the energy landscape is not too complex (the dependency between the parameters is not too strong).
given a function
energy(*args) -> value
you create the guess and the function:
guess = [1,1,1,1]
funcs = [ lambda x,i=i: energy( guess[:i]+[x]+guess[i+1:] ) for i in range(len(guess)) ]
than you put them in a while cycle for the optimization
while convergence_condition:
for func in funcs:
optimize fot func
update the guess
check for convergence
This is a very simple yet effective method of simplify your minimization task. I can't really recall how this method is called, but A close look to the wikipedia entry on minimization should do the trick.

You could do parallel at two parts: 1) parallel the calculation of single iteration or 2) parallel start N initial guessing.
On 2) you need a job controller to control the N initial guess discovery threads.
Please add an extra output on your program: "lower bound" that indicates the output values of current input parameter's decents wont lower than this lower bound.
The initial N guessing thread can compete with each other; if any one thread's lower bound is higher than existing thread's current value, then this thread can be dropped by your job controller.

Parallelizing local optimizers is intrinsically limited: they start from a single initial point and try to work downhill, so later points depend on the values of previous evaluations. Nevertheless there are some avenues where a modest amount of parallelization can be added.
As another answer points out, if you need to evaluate your derivative using a finite-difference method, preferably with an adaptive step size, this may require many function evaluations, but the derivative with respect to each variable may be independent; you could maybe get a speedup by a factor of twice the number of dimensions of your problem. If you've got more processors than you know what to do with, you can use higher-order-accurate gradient formulae that require more (parallel) evaluations.
Some algorithms, at certain stages, use finite differences to estimate the Hessian matrix; this requires about half the square of the number of dimensions of your matrix, and all can be done in parallel.
Some algorithms may also be able to use more parallelism at a modest algorithmic cost. For example, quasi-Newton methods try to build an approximation of the Hessian matrix, often updating this by evaluating a gradient. They then take a step towards the minimum and evaluate a new gradient to update the Hessian. If you've got enough processors so that evaluating a Hessian is as fast as evaluating the function once, you could probably improve these by evaluating the Hessian at every step.
As far as implementations go, I'm afraid you're somewhat out of luck. There are a number of clever and/or well-tested implementations out there, but they're all, as far as I know, single-threaded. Your best bet is to use an algorithm that requires a gradient and compute your own in parallel. It's not that hard to write an adaptive one that runs in parallel and chooses sensible step sizes for its numerical derivatives.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.