Tolerance for termination in Nelder-Mead optimization

Tolerance for termination in Nelder-Mead optimization - python

I am trying to optimize a certain function using the Nelder-Mead method and I need help understanding some of the arguments. I am fairly new to the world of numerical optimizations so, please, forgive my ignorance of what might be obvious to more experienced users. I note that I already looked at minimize(method=’Nelder-Mead’) and at scipy.optimize.minimize but it was not of as much help as I would have hoped. I am trying to optimize function $f$ under two conditions: (i) I want the optimization to stop once the $f$ value is below a certain value and (ii) once the argument is around the optimal value, I don't want the optimizer to increase the step again (i.e., once it gets below the threshold value and stays below for a couple of iterations, I would like the optimization to terminate). Here is the optimization code I use:
scipy.optimize.minimize(fun=f, x0=init_pos, method="nelder-mead",
options={"initial_simplex": simplex,
"disp": True, "maxiter" : 25,
"fatol": 0.50, "adaptive": True})
where f is my function (f : RxR -> [0,sqrt(2))). I understand that x0=init_pos are initial values for f, "initial_simplex": simplex is the initial triangle (in my 2D case), "maxiter" : 25 means that the optimizer will run up to 25 iterations before terminating.
Here are things I do not understand/I am not sure about:
The website 1 says "fatol: Absolute error in func(xopt) between iterations that is acceptable for convergence." Since the optimal value for my function is f(xopt)=0, does "fatol": 0.50 mean that the optimization will terminate once the f(x) will have the value of 0.5 or less? If not, how do I modify the condition for termination (in my case, how do I assure that it does stop once f(x)<=0.5)? I am ok if the optimizer runs a few more iterations around the region giving <0.5 but right now it tends to jump out of the near optimal region in a completely random way and I would like to be able to prevent it (if possible).
Likewise, as far as I understand, "xatol: Absolute error in xopt between iterations that is acceptable for convergence." means that the optimization will terminate once the difference between the optimal and the present arguments is at most xatol. Since in principle I do not know a-priori what the xopt is, does it mean in practice that once |x_n - x_(n+1)|, the optimizer will stop? If no, is there a way of adding a constraint to stop the function once it is near the optimal point?
I will appreciate if someone can either answer or give me a better reference than the SciPy documentation.

this condition stops the algorithm as soon as |f(x_n) - f(x_(n+1))| < fatol
same : this condition stops the algorithm as soon as |x_n - x_(n+1)| < xatol

Related

Optimization method selection & dealing with convergence and variability

The Problem
I am looking to tackle a minimization problem using scipy's optimization utilities.
Specifically, I've been using this function:
result = spo.minimize(s21_mag, goto_start, options={"disp": True}, bounds=bnds)
My s21_mag function takes a couple of seconds to return an output (due to physically moving motors). It consists of 3 parameters (3 moving parts), with no constraints - just three bounds (identical for all 3 parameters):
bnds = ((0,45000),(0,45000),(0,45000))
The limit on the amount of iterations is not very constraint (1000 is probably a good enough upper limit for me), but I expect the optimizer to try many configurations in this set of iterations to identify an optimal value. So far, some methods I've tried just seem to converge somewhere with meaningless progress.
Here's progress beyond the 50th iteration (full code here) - the goal is the maximization of S21 at a specific frequency (purple vertical line):
This is with no method passed tospo.minimize(), so it uses the default (and it looks like it applies the exact same movement to each motor).
Questions
Although scipy's minimization function offers a wide variety of optimization methods/algorithms, how could I (as a beginner in optimization math) select the one that would work best for my application? What kind of aspects of my problem should I take into account to jump to such conclusions? Assume I have no idea about the initial value of each parameter and want the optimizer to figure that out (I usually just set it to the midpoint, i.e. initial: x1=x2=x3=22500).
The same set of parameters as an input to my s21_mag function could yield different results at different times the function is called.
This happens for two reasons:
(a) The parameter step of the optimizer can get extremely small (particularly as the number of iterations increase and the convergence is approached), whereas the motor expects a minimum value of ~100 to make a step.
Is there a way to somehow set a minimum step? Otherwise, it tries to step from e.g. 1234.0 to 1234.0001 and eventually gets "stuck" between trying tiny changes.
(b) The output of the function goes through a measuring instrument, which exhibits a little bit of noise (e.g. one measurement may yield 5.42 dB, while another measurement (with the exact same parameters) may yield 5.43 dB).
Is there a way to deal with these kinds of small variabilities/errors to avoid confusions for the optimizer?

Linear programming (optimization) with Pulp

I would like to ask you something regarding a Linear program for optimization. I already successfully setup the model. However, I have problems in setting up a metaheuristic to reduce the computation time. The basic optimization model can be seen here:
In the metaheuristic algorithm there is an while loop with a condition as follows:
while $ \sum_{i=1}^I b_i y_i \leq \sum_{k=1}^K q_k $ do
I tried to realize this condition with the following code:
while lpSum(b[i]*y[i] for i in I)<=lpSum(q[k] for k in K):
If calculate the two sums separately I get the right results for both. However, when I put them into this condition, the code runs into an endless loop, even when the condition gets fulfilled and it should break the loop. I guess it has to do with the data type, and that the argument can't be an LpAffineExpression. However, I am really struggling to understand this problem.
I hope you understood my problem and I would really appreciate your ideas and explanations a lot! Please tell me, if you need more information on something specific - sorry for being a beginner.
Thanks a lot in advance and best regards,
Bernhard

lpSums do not have a value, like a regular sum has.
Any Python objects can have be compared to other objects using built-in equations like __eq__. That is how I can say date(2000, 1, 1) < date(2000, 1, 2). However, lpAffineExpressionss (which lpSums are a type of) are meant to be used in constraints. Their contents are variables, which are solved by the LP solver, so they do not yet have any values.
Thus the return value for lpSum(x) <= lpSum(y) is not true or false, like with normal equations, but it's an equation. And an equation is not None, or False, or any other falsey value. What you are saying is equivalent to while <some object>:, which is always true. Hence your infinite loop.
I don't know what "using a metaheuristic to reduce computation time" implies in this context - maybe you run a few iterations of the LP solver and then employ your metaheuristic on the result.
If that is the case, use b[i].value() to get the value the variable b[i] was given in that solution, and be sure to compute the total in a regular sum.

SLSQP yields complete different - vs COBYLA

Why does SLSQP gets stuck around the initial values, while COBYLA moves towards
the right direction ?
Optimization Problem is implemented using OpenMDAO 2.2.X;
3 design variables --> input to external code comp --> that outputs y which is
the objective(y, scaler=-1). There are no constraints.
The plot below shows the behavior of the two optimizer for the same problem. I have tried to change the finite difference setup of the SLSQP but it did not help. The output is Optimization terminated successfully. (Exit mode 0).
The sample driver and wrapper codes are uploaded:
https://gist.github.com/stackoverflow38/0219eda12d4c56ce84c68d201d1f1926

I do not see anything obviously wrong with your problem setup, but without gf_run.py its not possible to run the model you've provided to test it out. So in lieu of that, the best guess I can give you is one of the following options:
1) COBYLA is a gradient free optimizer that has a bit more ability to search over the design space. Perhaps its finding a a different optimum, while SLSQP is getting stuck at a lesser optimum near the starting point. To test this, you can use the result from COBYLA as the initial guess for SLSQP. If SLSQP converges to the same (or close to the same) point as COBYLA then, its likely a local-optimum problem.
2) SLSQP uses gradients, which you are approximating using central difference. Those derivative approximations could be poor, even using the 2nd order central differencing. Its not clear if the underlying code has some kind of an implicit solver in it (like a newton solver or a while-loop convergence). If it does have some kind of internal solver, then you need to make sure the tolerance is set pretty tight--- at least two order of magnitude lower than your FD step size would be preferable. Even then, it simply may not be possible to get a quality FD approximation around an code with a solver in it. You could also try changing the FD step size a bit.

I think the reason was that FD steps were too small that the gradients were inaccurate. I have used so far steps of 1e-3 to 1e-6. And now I use 1, the optimizer did not get stuck with step of 1 for all design variables. I am guessing that the difference in the output of the external code was very small with small FD steps (1e-3) such that the optimizer did not calculate the gradient accurately.

Baum Welch (EM Algorithm) likelihood (P(X)) is not monotonically converging

So I am sort of an amateur when comes to machine learning and I am trying to program the Baum Welch algorithm, which is a derivation of the EM algorithm for Hidden Markov Models. Inside my program I am testing for convergence using the probability of each observation sequence in the new model and then terminating once the new model is less than or equal to the old model. However, when I run the algorithm it seems to converge somewhat and gives results that are far better than random but when converging it goes down on the last iteration. Is this a sign of a bug or am I doing something wrong?
It seems to me that I should have been using the summation of the log of each observation's probability for the comparison instead since it seems like the function I am maximizing. However, the paper I read said to use the log of the sum of probabilities(which I am pretty sure is the same as the sum of the probabilities) of the observations(https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf).
I fixed this on another project where I implemented backpropogation with feed-forward neural nets by implementing a for loop with pre-set number of epochs instead of a while loop with a condition for the new iteration to be strictly greater than but I am wondering if this is a bad practice.
My code is at https://github.com/icantrell/Natural-Language-Processing
inside the nlp.py file.
Any advice would be appreciated.
Thank You.

For EM iterations, or any other iteration proved to be non-decreasing, you should be seeing increases until the size of increases becomes small compared with floating point error, at which time floating point errors violate the assumptions in the proof, and you may see not only a failure to increase, but a very small decrease - but this should only be very small.
One good way to check these sorts of probability based calculations is to create a small test problem where the right answer is glaringly obvious - so obvious that you can see whether the answers from the code under test are obviously correct at all.
It might be worth comparing the paper you reference with https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness. I think equations such as (11) and (12) are not intended for you to actually calculate, but as arguments to motivate and prove the final result. I think the equation corresponding to the traditional EM step, which you do calculate, is equation (15) which says that you change the parameters at each step to increase the expected log-likelihood, which is the expectation under the distribution of hidden states calculated according to the old parameters, which is the standard EM step. In fact, turning over I see this is stated explicitly at the top of P 8.

Parallel many dimensional optimization

I am building a script that generates input data [parameters] for another program to calculate. I would like to optimize the resulting data. Previously I have been using the numpy powell optimization. The psuedo code looks something like this.
def value(param):
run_program(param)
#Parse output
return value
scipy.optimize.fmin_powell(value,param)
This works great; however, it is incredibly slow as each iteration of the program can take days to run. What I would like to do is coarse grain parallelize this. So instead of running a single iteration at a time it would run (number of parameters)*2 at a time. For example:
Initial guess: param=[1,2,3,4,5]
#Modify guess by plus minus another matrix that is changeable at each iteration
jump=[1,1,1,1,1]
#Modify each variable plus/minus jump.
for num,a in enumerate(param):
new_param1=param[:]
new_param1[num]=new_param1[num]+jump[num]
run_program(new_param1)
new_param2=param[:]
new_param2[num]=new_param2[num]-jump[num]
run_program(new_param2)
#Wait until all programs are complete -> Parse Output
Output=[[value,param],...]
#Create new guess
#Repeat
Number of variable can range from 3-12 so something such as this could potentially speed up the code from taking a year down to a week. All variables are dependent on each other and I am only looking for local minima from the initial guess. I have started an implementation using hessian matrices; however, that is quite involved. Is there anything out there that either does this, is there a simpler way, or any suggestions to get started?
So the primary question is the following:
Is there an algorithm that takes a starting guess, generates multiple guesses, then uses those multiple guesses to create a new guess, and repeats until a threshold is found. Only analytic derivatives are available. What is a good way of going about this, is there something built already that does this, is there other options?
Thank you for your time.
As a small update I do have this working by calculating simple parabolas through the three points of each dimension and then using the minima as the next guess. This seems to work decently, but is not optimal. I am still looking for additional options.
Current best implementation is parallelizing the inner loop of powell's method.
Thank you everyone for your comments. Unfortunately it looks like there is simply not a concise answer to this particular problem. If I get around to implementing something that does this I will paste it here; however, as the project is not particularly important or the need of results pressing I will likely be content letting it take up a node for awhile.

I had the same problem while I was in the university, we had a fortran algorithm to calculate the efficiency of an engine based on a group of variables. At the time we use modeFRONTIER and if I recall correctly, none of the algorithms were able to generate multiple guesses.
The normal approach would be to have a DOE and there where some algorithms to generate the DOE to best fit your problem. After that we would run the single DOE entries parallely and an algorithm would "watch" the development of the optimizations showing the current best design.
Side note: If you don't have a cluster and needs more computing power HTCondor may help you.

Are derivatives of your goal function available? If yes, you can use gradient descent (old, slow but reliable) or conjugate gradient. If not, you can approximate the derivatives using finite differences and still use these methods. I think in general, if using finite difference approximations to the derivatives, you are much better off using conjugate gradients rather than Newton's method.
A more modern method is SPSA which is a stochastic method and doesn't require derivatives. SPSA requires much fewer evaluations of the goal function for the same rate of convergence than the finite difference approximation to conjugate gradients, for somewhat well-behaved problems.

There are two ways of estimating gradients, one easily parallelizable, one not:
around a single point, e.g. (f( x + h directioni ) - f(x)) / h;
this is easily parallelizable up to Ndim
"walking" gradient: walk from x0 in direction e0 to x1,
then from x1 in direction e1 to x2 ...;
this is sequential.
Minimizers that use gradients are highly developed, powerful, converge quadratically (on smooth enough functions).
The user-supplied gradient function
can of course be a parallel-gradient-estimator.
A few minimizers use "walking" gradients, among them Powell's method,
see Numerical Recipes p. 509.
So I'm confused: how do you parallelize its inner loop ?
I'd suggest scipy fmin_tnc
with a parallel-gradient-estimator, maybe using central, not one-sided, differences.
(Fwiw,
this
compares some of the scipy no-derivative optimizers on two 10-d functions; ymmv.)

I think what you want to do is use the threading capabilities built-in python.
Provided you your working function has more or less the same run-time whatever the params, it would be efficient.
Create 8 threads in a pool, run 8 instances of your function, get 8 result, run your optimisation algo to change the params with 8 results, repeat.... profit ?

If I haven't gotten wrong what you are asking, you are trying to minimize your function one parameter at the time.
you can obtain it by creating a set of function of a single argument, where for each function you freeze all the arguments except one.
Then you go on a loop optimizing each variable and updating the partial solution.
This method can speed up by a great deal function of many parameters where the energy landscape is not too complex (the dependency between the parameters is not too strong).
given a function
energy(*args) -> value
you create the guess and the function:
guess = [1,1,1,1]
funcs = [ lambda x,i=i: energy( guess[:i]+[x]+guess[i+1:] ) for i in range(len(guess)) ]
than you put them in a while cycle for the optimization
while convergence_condition:
for func in funcs:
optimize fot func
update the guess
check for convergence
This is a very simple yet effective method of simplify your minimization task. I can't really recall how this method is called, but A close look to the wikipedia entry on minimization should do the trick.

You could do parallel at two parts: 1) parallel the calculation of single iteration or 2) parallel start N initial guessing.
On 2) you need a job controller to control the N initial guess discovery threads.
Please add an extra output on your program: "lower bound" that indicates the output values of current input parameter's decents wont lower than this lower bound.
The initial N guessing thread can compete with each other; if any one thread's lower bound is higher than existing thread's current value, then this thread can be dropped by your job controller.

Parallelizing local optimizers is intrinsically limited: they start from a single initial point and try to work downhill, so later points depend on the values of previous evaluations. Nevertheless there are some avenues where a modest amount of parallelization can be added.
As another answer points out, if you need to evaluate your derivative using a finite-difference method, preferably with an adaptive step size, this may require many function evaluations, but the derivative with respect to each variable may be independent; you could maybe get a speedup by a factor of twice the number of dimensions of your problem. If you've got more processors than you know what to do with, you can use higher-order-accurate gradient formulae that require more (parallel) evaluations.
Some algorithms, at certain stages, use finite differences to estimate the Hessian matrix; this requires about half the square of the number of dimensions of your matrix, and all can be done in parallel.
Some algorithms may also be able to use more parallelism at a modest algorithmic cost. For example, quasi-Newton methods try to build an approximation of the Hessian matrix, often updating this by evaluating a gradient. They then take a step towards the minimum and evaluate a new gradient to update the Hessian. If you've got enough processors so that evaluating a Hessian is as fast as evaluating the function once, you could probably improve these by evaluating the Hessian at every step.
As far as implementations go, I'm afraid you're somewhat out of luck. There are a number of clever and/or well-tested implementations out there, but they're all, as far as I know, single-threaded. Your best bet is to use an algorithm that requires a gradient and compute your own in parallel. It's not that hard to write an adaptive one that runs in parallel and chooses sensible step sizes for its numerical derivatives.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tolerance for termination in Nelder-Mead optimization - python

this condition stops the algorithm as soon as |f(x_n) - f(x_(n+1))| < fatol same : this condition stops the algorithm as soon as |x_n - x_(n+1)| < xatol

Related

Optimization method selection & dealing with convergence and variability

Linear programming (optimization) with Pulp

SLSQP yields complete different - vs COBYLA

Baum Welch (EM Algorithm) likelihood (P(X)) is not monotonically converging

Parallel many dimensional optimization

Categories

Resources