Rewriting SAS optimization in Python

Rewriting SAS optimization in Python - python

I am trying to rewrite something similar to the following SAS optimization code in Python. The goal of the code is to find parameters for a continuous (in this case normal) distribution that best fits some empirical data. I am familiar with Python but very new to SAS.
proc nlin data=mydata outest=est METHOD=MARQUARDT SMETHOD=GOLDEN SAVE;
parms m = -1.0 to 1.0 by 0.1
s = 1.0 to 2.0 by 0.01;
bounds -10<m<10;
bounds 0<s<10;
model NCDF = CDF('NORMAL',XValue,m,s);
run;
In Python, I have something set up like this:
import pandas as pd
from scipy import stats
from scipy.optimize import minimize
def distance_empirical_fitted_cdf( params: list, data: pd.Series ):
m,s = params
empirical_cdf = ( data / 100 ).cumsum()
cost = 0
for point in range( 10 ):
emprical_cdf_at_point = empirical_cdf.iloc[ point ]
fitted_cdf_at_point = stats.norm.cdf( x = point, loc = m, scale = s )
cost += ( fitted_cdf_at_point - empirical_cdf_at_point ) ** 2
return cost
result = minimize( distance_empirical_fitted_cdf, x0=[0,1.5], args=(distribution),
bounds=[(-10,10),(0,10)] )
fitted_m, fitted_s = result.x
The code I have now gets me fairly close to the existing code's output in most cases, but not in all. Ideally, I could get them to match or be as close as possible and understand why they don't.
As far as I can tell, there are two sources of discrepancy. First, the SAS code is able to take a set of possible starting values (in this case a range from -1 to 1 for m and 0 to 10 for s) to initialize the parameters. Is there an equivalent of this in Python?
Second, the SAS code is specifically using the Marquardt optimization method and the Golden step size search method. The only Python code I could find referencing the Marquardt method is scipy.optimize.least_squares with method="lm", but this doesn't support bounds (and is much further off compared to scipy.optimize.minimize when I try without bounds).
The only Python code I could find referencing the golden step size search method is scipy.optimize.golden, but the documentation says that this is for minimizing functions of one variable and doesn't seem to support bounds either.
Any insight on getting the Python output closer to the SAS output would be greatly appreciated, thanks!

Not an answer, as its still inconclusive as to why (or how) the two sets of code produce different results.
So this will likely change as more info is introduced.
But in the meantime here are observations that might be useful, but is too much to all fit into the comments section.
Algorithm
The Marquardt Method as called by SAS is more commonly referred to as the Levenberg-Marquardt algorithm (atleast to me), and abbreviated LMA or LM.
Bounds
The math of LMA is not defined to handle bounds, which is why no bound options are provided in scipy.
The MINPACK-1 implementation used in scipy.optimize.leastsq for the Levenberg-Marquardt algorithm does not explicitly support bounds on parameters, and expects to be able to fully explore the available range of values for any Parameter. Simply placing hard constraints (that is, resetting the value when it exceeds the desired bounds) prevents the algorithm from determining the partial derivatives, and leads to unstable results.
The discrepancy in results may arise from here, depending on how SAS decides to handle the provided bounds params.
It may be the case that SAS is ignoring the provided bound values, re-running until a solution within the bounds are found, or using other methods entirely.
Another possible cause is that a better solution exists outside of your provided bounds.
Then either SAS limits its solutions to within the bounds but python doesn't, or vice versa.
This results in one code returning a local minima within bounds, but the other returning a better (possibly global) minima out of bounds.
However, I can't find where (the SAS docs for NLIN) explain how this is handled, so this is still inconclusive.
Step-Size Search
Note that the NLIN procedure's SMETHOD is used to search for an optimal step size.
It's unclear reading from the SAS docs on SMETHOD, what the "step size" is exactly,
but I believe in this context it could refer to the "damping parameter" lambda.
As this parameter is quite important to the performance of LMA, different ways of determining the parameter could also affect convergence (and thus final results.)
Again, all of this depends on how the two results differ.
If the two results are significantly different (completely different output CDF), then chances are only one CDF would match the actual data.
The code that doesn't match the actual data is probably doing something wrong, and needs to be scrutinized.

Related

How can I use implicit components to assemble a full system?

I'm working on a panel method code at the moment. To keep us from being bogged down in the minutia, I won't show the code - this is a question about overall program structure.
Currently, I solve my system by:
Generating the corresponding rows of the A matrix and b vector in an explicit component for each boundary condition
Assembling the partial outputs into the full A, b.
Solving the linear system, Ax=b, using a LinearSystemComp.
Here's a (crude) diagram:
I would prefer to be able to do this by just writing one implicit component to represent each boundary condition, vectorising the inputs/outputs to represent multiple rows/cols in the matrix, then allowing openMDAO to solve for the x while driving the residual for each boundary condition to 0.
I've run into trouble trying to make this work, as each implicit component is underdetermined (more rows in the output vector x than the component output residuals; that is, A1.x - b1= R1, length(R1) < length(x). Essentially, I would like openMDAO to take each of these underdetermined implicit systems, and find the value of x that solves the determined full system - without needing to do all of the assembling stuff myself.
Something like this:
To try and make my goal clearer, I'll explain what I actually want from the perspective of my panel method. I'd like a component, let's say Influence, that computes the potential induced by a given panel at a given point in the panel's reference frame. I'd like to vectorise the input panels and points such that it can compute the influence coefficent of many panels on one point, of many points on one panel, or of many points on many panels.
I'd then like a system of implicit boundary conditions to find the correct value of mu to solve the system. These boundary conditions, again, should be able to be vectorised to compute the violation of the boundary condition at many points under the influence of many panels.
I get confused again at this part. Not every boundary condition will use the influence coefficient values - some, like the Kutta condition, are just enforced on the mu vector, e.g .
How would I implement this as an implicit component? It has no inputs, and doesn't output the full mu vector.
I appreciate that the question is rather long and rambling, but I'm pretty confused. To summarise:
How can I use openMDAO to solve multiple individually underdetermined (but combined, fully determined) implicit systems?
How can I use openMDAO to write an implicit component that takes no inputs and only uses a portion of the overall solution vector?

In the OpenMDAO docs there is a close analog to what you are trying to accomplish, with the node-voltage analysis tutorial. In that code, the balance comp is used to create an implicit relationship that is similar to what you're describing. Its singular on its own, but part of a larger group is a well defined system.
You'll need to find a way to build similar components for your model. Each "row" in your equation will be associated with one state variables (one entry in your x vector).
In the simplest case, each row (or set of rows) would have one input which is the associated row of the A matrix, and a second input which is ALL of the other values for x, and a final input which is the entry of the b vector (right hand side vector). Then you could evaluate the residual for that specific row, which would be the following
R['x_i'] = np.sum(A*x_full) - b
where x_full is the assembly of the full x-vector from the x_other input and the x_i state variable.
#########
Having proposed the above solution, I have to say that I don't think this is a particularly efficient way to build or solve this linear system. It is modular, and might give you some flexibility, but you're jumping through a lot of hoops to avoid doing some index-math, and shoving everything into a matrix.
Granted, the derivatives might be a bit easier in your design, because the matrix assembly is going to get handled "magically" by the connections you have to create between the various row-components. So maybe its worth the trade... but i would say you might be better of trying a more traditional coding approach and using JAX or some other AD code to make the derivatives easier.

Is it possible to specify starting values for each parameter (instead of bounds) for scipy's differential evolution?

Scipy's differential evolution implementation (https://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.optimize.differential_evolution.html) uses either a Latin hypercube or a random method for population initialization. Latin hypercube sampling tries to maximize coverage of the available parameter space. ‘random’ initializes the population randomly. I am wondering if it would be possible to specify starting values for each parameter, instead of relying on these default algorithms.
For complex models (particularly those that are mathematically intractable and thus need to be simulated), I have observed that 2 independent runs of scipy's differential evolution likely give different results after X iterations of the algorithm (I usually set X = 100 to avoid running the agorithm during several days). I think it is because (1) population initialization is not identical between 2 independent runs (because of the stochastic nature of the population initialization methods 'random' and 'hypercube') and (2) there's noise in model prediction. I am thus thinking of running ~10 independent runs of DE with 100 iterations, pick-up the best-fitting parameter set across the 10 runs and use this set as the starting values for a final run with more iterations (say 200). The problem is that I see no way to manually enter these starting values within scipy's DE implementation. I would be very grateful if somebody in the community could help me.

This has indeed been possible since version 1.1 of SciPy (note that you're referring to the dated 0.17.0 documentation). In particular, the recent versions lets you specify any array, instead of just 'hypercube' or 'random'. From the documentation, a possible value of init is:
array specifying the initial population. The array should have shape (M, len(x)), where len(x) is the number of parameters. init is clipped to bounds before use.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html
If, for some reason, you're forced to use the old version, it's still possible to obtain what you want by just using the underlying DifferentialEvolutionSolver instead. There, you can either monkey patch one of the initializing functions, or just run them and manually override the population attribute post-initialization.

CPLEX finds optimal LP solution but returns no basis error

I am solving a series of LP problems using the CPLEX Python API.
Since many of the problems are essentially the same, save a hand full of parameters. I want to use a warm start with the solution of the previous problem for most of them, by calling the function cpx.start.set_start(col_status, row_status, col_primal, row_primal, col_dual, row_dual) where cpx = cplex.Cplex().
This function is documented here. Two of the arguments, col_status and row_status, are obtained by calling cpx.solution.basis.get_col_basis() and cpx.solution.basis.get_row_basis().
However, despite cpx.solution.status[cpx.solution.get_status()] returning optimal and being able to obtain both cpx.solution.get_values() and cpx.solution.get_dual_values() ...
Calling cpx.solution.basis.get_basis() returns CPLEX Error 1262: No basis exists.
Now, according to this post one can call the warm start function with empty lists for the column and row basis statuses, as follows.
lastsolution = cpx.solution.get_values()
cpx.start.set_start(col_status=[], row_status=[],
col_primal=lastsolution, row_primal=[],
col_dual=[], row_dual=[])
However, this actually results in making a few more CPLEX iterations. Why more is unclear, but the overall goal is to have significantly less, obviously.
Version Info
Python 2.7.12
CPLEX 12.6.3

I'm not sure why you're getting the CPXERR_NO_BASIS. See my comment.
You may have better luck if you provide the values for row_primal, col_dual, and row_dual too. For example:
cpx2.start.set_start(col_status=[],
row_status=[],
col_primal=cpx.solution.get_values(),
row_primal=cpx.solution.get_linear_slacks(),
col_dual=cpx.solution.get_reduced_costs(),
row_dual=cpx.solution.get_dual_values())
I was able to reproduce the behavior you describe using the afiro.mps model that comes with the CPLEX examples (number of deterministic ticks actually increased when specifying col_primal alone). However, when doing the above, it did help (number of det ticks improved and iterations went to 0).
Finally, I don't believe that there is any guarantee that using set_start will always help (it may even be a bad idea in some cases). I don't have a reference for this.

Function to determine a reasonable initial guess for scipy.optimize?

I'm using scipy.optimize.minimize to find the minimum of a 4D function that is rather sensitive to the initial guess used. If I vary it a little bit, the solution will change considerably.
There are many questions similar to this one already in SO (e.g.: 1, 2, 3), but no real answer.
In an old question of mine, one of the developers of the zunzun.com site (apparently no longer online) explained how they managed this:
Zunzun.com uses the Differential Evolution genetic algorithm (DE) to find initial parameter estimates which are then passed to the Levenberg-Marquardt solver in scipy. DE is not actually used as a global optimizer per se, but rather as an "initial parameter guesser".
The closest I've found to this algorithm is this answer where a for block is used to call the minimizing function many times with random initial guesses. This generates multiple minimized solutions, and finally the best (smallest value) one is picked.
Is there something like what the zunzun dev described already implemented in Python?

There is no general answer for such question, as a problem of minimizing arbitrary function is impossible to solve. You can do better or worse on particular classes of functions, thus it is rather a domain for mathematician, to analyze how your function probably looks like.
Obviously you can also work with dozens of so called "meta optimizers", which are just bunch of heuristics, which might (or not) work for you particular application. Those include random sampling starting point in a loop, using genetic algorithms, or - which is as far as I know most mathematically justified approach - using Bayesian optimization. In general the idea is to model your function in the same time when you try to minimize it, this way you can make informed guess where to start next time (which is level of abstraction higher than random guessing or using genetic algorithms/differential evolution). Thus, I would order these methods in following way
grid search / random sampling - uses no information from previous runs, thus - worst results
genetic approach, evolutionary, basin-hooping, annealing - use information from previous runs as a (x, f(x)) pairs, for limited period of time (generations) - thus average results
Bayesian optimization (and similar methods) - use information from all previous experiences through modeling of the underlying function and performing sampling selection based on expected improvement - best results (at the cost of most complex methods)

Parallel many dimensional optimization

I am building a script that generates input data [parameters] for another program to calculate. I would like to optimize the resulting data. Previously I have been using the numpy powell optimization. The psuedo code looks something like this.
def value(param):
run_program(param)
#Parse output
return value
scipy.optimize.fmin_powell(value,param)
This works great; however, it is incredibly slow as each iteration of the program can take days to run. What I would like to do is coarse grain parallelize this. So instead of running a single iteration at a time it would run (number of parameters)*2 at a time. For example:
Initial guess: param=[1,2,3,4,5]
#Modify guess by plus minus another matrix that is changeable at each iteration
jump=[1,1,1,1,1]
#Modify each variable plus/minus jump.
for num,a in enumerate(param):
new_param1=param[:]
new_param1[num]=new_param1[num]+jump[num]
run_program(new_param1)
new_param2=param[:]
new_param2[num]=new_param2[num]-jump[num]
run_program(new_param2)
#Wait until all programs are complete -> Parse Output
Output=[[value,param],...]
#Create new guess
#Repeat
Number of variable can range from 3-12 so something such as this could potentially speed up the code from taking a year down to a week. All variables are dependent on each other and I am only looking for local minima from the initial guess. I have started an implementation using hessian matrices; however, that is quite involved. Is there anything out there that either does this, is there a simpler way, or any suggestions to get started?
So the primary question is the following:
Is there an algorithm that takes a starting guess, generates multiple guesses, then uses those multiple guesses to create a new guess, and repeats until a threshold is found. Only analytic derivatives are available. What is a good way of going about this, is there something built already that does this, is there other options?
Thank you for your time.
As a small update I do have this working by calculating simple parabolas through the three points of each dimension and then using the minima as the next guess. This seems to work decently, but is not optimal. I am still looking for additional options.
Current best implementation is parallelizing the inner loop of powell's method.
Thank you everyone for your comments. Unfortunately it looks like there is simply not a concise answer to this particular problem. If I get around to implementing something that does this I will paste it here; however, as the project is not particularly important or the need of results pressing I will likely be content letting it take up a node for awhile.

I had the same problem while I was in the university, we had a fortran algorithm to calculate the efficiency of an engine based on a group of variables. At the time we use modeFRONTIER and if I recall correctly, none of the algorithms were able to generate multiple guesses.
The normal approach would be to have a DOE and there where some algorithms to generate the DOE to best fit your problem. After that we would run the single DOE entries parallely and an algorithm would "watch" the development of the optimizations showing the current best design.
Side note: If you don't have a cluster and needs more computing power HTCondor may help you.

Are derivatives of your goal function available? If yes, you can use gradient descent (old, slow but reliable) or conjugate gradient. If not, you can approximate the derivatives using finite differences and still use these methods. I think in general, if using finite difference approximations to the derivatives, you are much better off using conjugate gradients rather than Newton's method.
A more modern method is SPSA which is a stochastic method and doesn't require derivatives. SPSA requires much fewer evaluations of the goal function for the same rate of convergence than the finite difference approximation to conjugate gradients, for somewhat well-behaved problems.

There are two ways of estimating gradients, one easily parallelizable, one not:
around a single point, e.g. (f( x + h directioni ) - f(x)) / h;
this is easily parallelizable up to Ndim
"walking" gradient: walk from x0 in direction e0 to x1,
then from x1 in direction e1 to x2 ...;
this is sequential.
Minimizers that use gradients are highly developed, powerful, converge quadratically (on smooth enough functions).
The user-supplied gradient function
can of course be a parallel-gradient-estimator.
A few minimizers use "walking" gradients, among them Powell's method,
see Numerical Recipes p. 509.
So I'm confused: how do you parallelize its inner loop ?
I'd suggest scipy fmin_tnc
with a parallel-gradient-estimator, maybe using central, not one-sided, differences.
(Fwiw,
this
compares some of the scipy no-derivative optimizers on two 10-d functions; ymmv.)

I think what you want to do is use the threading capabilities built-in python.
Provided you your working function has more or less the same run-time whatever the params, it would be efficient.
Create 8 threads in a pool, run 8 instances of your function, get 8 result, run your optimisation algo to change the params with 8 results, repeat.... profit ?

If I haven't gotten wrong what you are asking, you are trying to minimize your function one parameter at the time.
you can obtain it by creating a set of function of a single argument, where for each function you freeze all the arguments except one.
Then you go on a loop optimizing each variable and updating the partial solution.
This method can speed up by a great deal function of many parameters where the energy landscape is not too complex (the dependency between the parameters is not too strong).
given a function
energy(*args) -> value
you create the guess and the function:
guess = [1,1,1,1]
funcs = [ lambda x,i=i: energy( guess[:i]+[x]+guess[i+1:] ) for i in range(len(guess)) ]
than you put them in a while cycle for the optimization
while convergence_condition:
for func in funcs:
optimize fot func
update the guess
check for convergence
This is a very simple yet effective method of simplify your minimization task. I can't really recall how this method is called, but A close look to the wikipedia entry on minimization should do the trick.

You could do parallel at two parts: 1) parallel the calculation of single iteration or 2) parallel start N initial guessing.
On 2) you need a job controller to control the N initial guess discovery threads.
Please add an extra output on your program: "lower bound" that indicates the output values of current input parameter's decents wont lower than this lower bound.
The initial N guessing thread can compete with each other; if any one thread's lower bound is higher than existing thread's current value, then this thread can be dropped by your job controller.

Parallelizing local optimizers is intrinsically limited: they start from a single initial point and try to work downhill, so later points depend on the values of previous evaluations. Nevertheless there are some avenues where a modest amount of parallelization can be added.
As another answer points out, if you need to evaluate your derivative using a finite-difference method, preferably with an adaptive step size, this may require many function evaluations, but the derivative with respect to each variable may be independent; you could maybe get a speedup by a factor of twice the number of dimensions of your problem. If you've got more processors than you know what to do with, you can use higher-order-accurate gradient formulae that require more (parallel) evaluations.
Some algorithms, at certain stages, use finite differences to estimate the Hessian matrix; this requires about half the square of the number of dimensions of your matrix, and all can be done in parallel.
Some algorithms may also be able to use more parallelism at a modest algorithmic cost. For example, quasi-Newton methods try to build an approximation of the Hessian matrix, often updating this by evaluating a gradient. They then take a step towards the minimum and evaluate a new gradient to update the Hessian. If you've got enough processors so that evaluating a Hessian is as fast as evaluating the function once, you could probably improve these by evaluating the Hessian at every step.
As far as implementations go, I'm afraid you're somewhat out of luck. There are a number of clever and/or well-tested implementations out there, but they're all, as far as I know, single-threaded. Your best bet is to use an algorithm that requires a gradient and compute your own in parallel. It's not that hard to write an adaptive one that runs in parallel and chooses sensible step sizes for its numerical derivatives.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.