How does automatic differentiation with respect to the input work?

How does automatic differentiation with respect to the input work? - python

I've been trying to understand how automatic differentiation (autodiff) works. There are several implementations of this that can be found in Tensorflow, PyTorch and other programs.
There are three aspects of automatic differentiation that currently seem vague to me.
The exact process used to calculate the gradients
How autodiff works with respect to inputs
How autodiff works with respect to a singular value as input
So far, it seems to roughly follow the following steps:
Break up original function into elementary operations (individual arithmetic operations, composition and function calls).
The elementary operations are combined to form a computational graph in such a way that the original function can be calculated using the computational graph.
The computational graph is executed for a certain input, and each operation is recorded
Walking through the recorded operations in reverse using the chain rule gives us the gradient.
First of all, is this a correct overview of the steps that are taken in automatic differentiation?
Secondly, how would the above process work for a derivative with respect to the inputs. For instance, a function would need a difference in the x value. Does that mean that the derivative can only be calculated after at least two different x values have been provided as the input? Or does it require multiple inputs at once (i.e. vector input) over which it can calculate a difference? And how does this compare when we calculate the gradient with respect to the model weights (i.e. as done in backpropagation).
Thirdly, how can we take the derivative of a singular value. Take, for instance, the following Python code where the derivative of is calculated:
x = tf.constant(3.0)
with tf.GradientTape() as tape:
  tape.watch(x)
  y = x**2
# dy = 2x * dx
dy_dx = tape.gradient(y, x)
print(dy_dx.numpy()) # prints: '6.0'
Since dx is the difference between several x inputs, would that not mean that dx = 0?
I found that this paper had a pretty good overview of the various modes of autodiff. As well as the differences as compared to numerical and symbolic differentiation. However, it did not bring a full understanding and I would still like to understand the autodiff process in context of these traditional differentiation techniques.
Rather than applying it practically, I would love to get a more theoretical understanding.

I had similar questions in my mind a few weeks ago until I started to code my own Automatic Differentiation package tensortrax in Python. It uses forward-mode AD with a hyper-dual number approach. I wrote a Readme (landing page of the repository, section Theory) with an example which could be of interest for you.

I think what you need to understand first is what is a derivative, many math textbooks could help you with that. The notation dx means an infinitesimal variation, so you not actually compute any difference, but do a symbolic operation on your function f that transforms it to a function f' also noted df/dx, which you then apply at any point where it is defined.
Regarding the algorithm used for automatic differentiation, you understood it right, the part that you seem to be missing is how the derivatives of elementary operations are computed and what do they mean, but it would be hard to do a crash course about that in a SO answer.

Related

Checkgradient without solving optimization problem in MATLAB

I have a relatively complicated function and I have calculated the analytical form of the Jacobian of this function. However, sometimes, I mess up this Jacobian.
MATLAB has a nice way to check for the accuracy of the Jacobian when using some optimization technique as described here.
The problem though is that it looks like MATLAB solves the optimization problem and then returns if the Jacobian was correct or not. This is extremely time consuming, especially considering that some of my optimization problems take hours or even days to compute.
Python has a somewhat similar function in scipy as described here which just compares the analytical gradient with a finite difference approximation of the gradient for some user provided input.
Is there anything I can do to check the accuracy of the Jacobian in MATLAB without having to solve the entire optimization problem?

A laborious but useful method I've used for this sort of thing is to check that the (numerical) integral of the purported derivative is the difference of the function at the end points. I have found this more convenient than comparing fractions like (f(x+h)-f(x))/h with f'(x) because of the difficulty of choosing h so that on the one hand h is not so small that the fraction is not dominated by rounding error and on the other h is small enough that the fraction should be close to f'(x)
In the case of a function F of a single variable, the assumption is that you have code f to evaluate F and fd say to evaluate F'. Then the test is, for various intervals [a,b] to look at the differences, which the fundamental theorem of calculus says should be 0,
Integral{ 0<=x<=b | fd(x)} - (f(b)-f(a))
with the integral being computed numerically. There is no need for the intervals to be small.
Part of the error will, of course, be due to the error in the numerical approximation to the integral. For this reason I tend to use, for example, and order 40 Gausss Legendre integrator.
For functions of several variables, you can test one variable at a time. For several functions, these can be tested one at a time.
I've found that these tests, which are of course by no means exhaustive, show up the kinds of mistakes that occur in computing derivatives quire readily.

Have you considered the usage of Complex step differentiation to check your gradient? See this description

How can I use implicit components to assemble a full system?

I'm working on a panel method code at the moment. To keep us from being bogged down in the minutia, I won't show the code - this is a question about overall program structure.
Currently, I solve my system by:
Generating the corresponding rows of the A matrix and b vector in an explicit component for each boundary condition
Assembling the partial outputs into the full A, b.
Solving the linear system, Ax=b, using a LinearSystemComp.
Here's a (crude) diagram:
I would prefer to be able to do this by just writing one implicit component to represent each boundary condition, vectorising the inputs/outputs to represent multiple rows/cols in the matrix, then allowing openMDAO to solve for the x while driving the residual for each boundary condition to 0.
I've run into trouble trying to make this work, as each implicit component is underdetermined (more rows in the output vector x than the component output residuals; that is, A1.x - b1= R1, length(R1) < length(x). Essentially, I would like openMDAO to take each of these underdetermined implicit systems, and find the value of x that solves the determined full system - without needing to do all of the assembling stuff myself.
Something like this:
To try and make my goal clearer, I'll explain what I actually want from the perspective of my panel method. I'd like a component, let's say Influence, that computes the potential induced by a given panel at a given point in the panel's reference frame. I'd like to vectorise the input panels and points such that it can compute the influence coefficent of many panels on one point, of many points on one panel, or of many points on many panels.
I'd then like a system of implicit boundary conditions to find the correct value of mu to solve the system. These boundary conditions, again, should be able to be vectorised to compute the violation of the boundary condition at many points under the influence of many panels.
I get confused again at this part. Not every boundary condition will use the influence coefficient values - some, like the Kutta condition, are just enforced on the mu vector, e.g .
How would I implement this as an implicit component? It has no inputs, and doesn't output the full mu vector.
I appreciate that the question is rather long and rambling, but I'm pretty confused. To summarise:
How can I use openMDAO to solve multiple individually underdetermined (but combined, fully determined) implicit systems?
How can I use openMDAO to write an implicit component that takes no inputs and only uses a portion of the overall solution vector?

In the OpenMDAO docs there is a close analog to what you are trying to accomplish, with the node-voltage analysis tutorial. In that code, the balance comp is used to create an implicit relationship that is similar to what you're describing. Its singular on its own, but part of a larger group is a well defined system.
You'll need to find a way to build similar components for your model. Each "row" in your equation will be associated with one state variables (one entry in your x vector).
In the simplest case, each row (or set of rows) would have one input which is the associated row of the A matrix, and a second input which is ALL of the other values for x, and a final input which is the entry of the b vector (right hand side vector). Then you could evaluate the residual for that specific row, which would be the following
R['x_i'] = np.sum(A*x_full) - b
where x_full is the assembly of the full x-vector from the x_other input and the x_i state variable.
#########
Having proposed the above solution, I have to say that I don't think this is a particularly efficient way to build or solve this linear system. It is modular, and might give you some flexibility, but you're jumping through a lot of hoops to avoid doing some index-math, and shoving everything into a matrix.
Granted, the derivatives might be a bit easier in your design, because the matrix assembly is going to get handled "magically" by the connections you have to create between the various row-components. So maybe its worth the trade... but i would say you might be better of trying a more traditional coding approach and using JAX or some other AD code to make the derivatives easier.

how to improve this adaptive trapezoidal rule?

I have attempted an exercise from the computational physics written by Newman and written the following code for an adaptive trapezoidal rule. When the error estimate of each slide is larger than the permitted value, it divides that portion into two halves. I am just wondering what else I can do to make the algorithm more efficient.
xm=[]
def trap_adapt(f,a,b,epsilon=1.0e-8):
def step(x1,x2,f1,f2):
xm = (x1+x2)/2.0
fm = f(xm)
h1 = x2-x1
h2 = h1/2.0
I1 = (f1+f2)*h1/2.0
I2 = (f1+2*fm+f2)*h2/2.0
error = abs((I2-I1)/3.0) # leading term in the error expression
if error <= h2*delta:
points.append(xm) # add the points to the list to check if it is really using more points for more rapid-varying regions
return h2/3*(f1 + 4*fm + f2)
else:
return step(x1,xm,f1,fm)+step(xm,x2,fm,f2)
delta = epsilon/(b-a)
fa, fb = f(a), f(b)
return step(a,b,fa,fb)
Besides, I used a few simple formula to compare this to Romberg integration, and found that for the same accuracy, this adaptive method uses many more point to calculate the integral.
Is it just because of its inherent limitations? Any advantages of using this adaptive algorithm instead of the Romberg method? any ways to make it faster and more accurate?

Your code is refining to meet an error tolerance in each individual subinterval. It's also using a low-order integration rule. Improvements in both of these can significantly reduce the number of function evaluations.
Rather than considering the error in each subinterval separately, more advanced codes compute the total error over all the subintervals and refine until the total error is below the desired threshold. Subintervals are chosen for refinement according to their contribution to the total error, with larger errors being refined first. Typically a priority queue is used to quickly chose the subinterval for refinement.
Higher-order integration rules can integrate more complicated functions exactly. For example, your code is based on Simpson's rule, which is exact for polynomials of degree up to 3. A more advanced code will probably use a rule that's exact for polynomials of much higher degree (say 10-15).
From a practical point of view, the simplest thing is to use a canned routine that implements the above ideas, e.g., scipy.integrate.quad. Unless you have particular knowledge of what you want to integrate, you're unlikely to do better.
Romberg integration requires evaluation at equally-spaced points. If you can evaluate the function at any point, then other methods are generally more accurate for "smooth" (polynomial-like) functions. And if your function is not smooth everywhere, then an adaptive code will do much better because it can focus on beating down the error in the non-smooth regions.

parameter within an interval while optimizing

Usually I use Mathematica, but now trying to shift to python, so this question might be a trivial one, so I am sorry about that.
Anyways, is there any built-in function in python which is similar to the function named Interval[{min,max}] in Mathematica ? link is : http://reference.wolfram.com/language/ref/Interval.html
What I am trying to do is, I have a function and I am trying to minimize it, but it is a constrained minimization, by that I mean, the parameters of the function are only allowed within some particular interval.
For a very simple example, lets say f(x) is a function with parameter x and I am looking for the value of x which minimizes the function but x is constrained within an interval (min,max) . [ Obviously the actual problem is just not one-dimensional rather multi-dimensional optimization, so different paramters may have different intervals. ]
Since it is an optimization problem, so ofcourse I do not want to pick the paramter randomly from an interval.
Any help will be highly appreciated , thanks!

If it's a highly non-linear problem, you'll need to use an algorithm such as the Generalized Reduced Gradient (GRG) Method.
The idea of the generalized reduced gradient algorithm (GRG) is to solve a sequence of subproblems, each of which uses a linear approximation of the constraints. (Ref)
You'll need to ensure that certain conditions known as the KKT conditions are met, etc. but for most continuous problems with reasonable constraints, you'll be able to apply this algorithm.
This is a good reference for such problems with a few examples provided. Ref. pg. 104.
Regarding implementation:
While I am not familiar with Python, I have built solver libraries in C++ using templates as well as using function pointers so you can pass on functions (for the objective as well as constraints) as arguments to the solver and you'll get your result - hopefully in polynomial time for convex problems or in cases where the initial values are reasonable.
If an ability to do that exists in Python, it shouldn't be difficult to build a generalized GRG solver.
The Python Solution:
Edit: Here is the python solution to your problem: Python constrained non-linear optimization

Parallel many dimensional optimization

I am building a script that generates input data [parameters] for another program to calculate. I would like to optimize the resulting data. Previously I have been using the numpy powell optimization. The psuedo code looks something like this.
def value(param):
run_program(param)
#Parse output
return value
scipy.optimize.fmin_powell(value,param)
This works great; however, it is incredibly slow as each iteration of the program can take days to run. What I would like to do is coarse grain parallelize this. So instead of running a single iteration at a time it would run (number of parameters)*2 at a time. For example:
Initial guess: param=[1,2,3,4,5]
#Modify guess by plus minus another matrix that is changeable at each iteration
jump=[1,1,1,1,1]
#Modify each variable plus/minus jump.
for num,a in enumerate(param):
new_param1=param[:]
new_param1[num]=new_param1[num]+jump[num]
run_program(new_param1)
new_param2=param[:]
new_param2[num]=new_param2[num]-jump[num]
run_program(new_param2)
#Wait until all programs are complete -> Parse Output
Output=[[value,param],...]
#Create new guess
#Repeat
Number of variable can range from 3-12 so something such as this could potentially speed up the code from taking a year down to a week. All variables are dependent on each other and I am only looking for local minima from the initial guess. I have started an implementation using hessian matrices; however, that is quite involved. Is there anything out there that either does this, is there a simpler way, or any suggestions to get started?
So the primary question is the following:
Is there an algorithm that takes a starting guess, generates multiple guesses, then uses those multiple guesses to create a new guess, and repeats until a threshold is found. Only analytic derivatives are available. What is a good way of going about this, is there something built already that does this, is there other options?
Thank you for your time.
As a small update I do have this working by calculating simple parabolas through the three points of each dimension and then using the minima as the next guess. This seems to work decently, but is not optimal. I am still looking for additional options.
Current best implementation is parallelizing the inner loop of powell's method.
Thank you everyone for your comments. Unfortunately it looks like there is simply not a concise answer to this particular problem. If I get around to implementing something that does this I will paste it here; however, as the project is not particularly important or the need of results pressing I will likely be content letting it take up a node for awhile.

I had the same problem while I was in the university, we had a fortran algorithm to calculate the efficiency of an engine based on a group of variables. At the time we use modeFRONTIER and if I recall correctly, none of the algorithms were able to generate multiple guesses.
The normal approach would be to have a DOE and there where some algorithms to generate the DOE to best fit your problem. After that we would run the single DOE entries parallely and an algorithm would "watch" the development of the optimizations showing the current best design.
Side note: If you don't have a cluster and needs more computing power HTCondor may help you.

Are derivatives of your goal function available? If yes, you can use gradient descent (old, slow but reliable) or conjugate gradient. If not, you can approximate the derivatives using finite differences and still use these methods. I think in general, if using finite difference approximations to the derivatives, you are much better off using conjugate gradients rather than Newton's method.
A more modern method is SPSA which is a stochastic method and doesn't require derivatives. SPSA requires much fewer evaluations of the goal function for the same rate of convergence than the finite difference approximation to conjugate gradients, for somewhat well-behaved problems.

There are two ways of estimating gradients, one easily parallelizable, one not:
around a single point, e.g. (f( x + h directioni ) - f(x)) / h;
this is easily parallelizable up to Ndim
"walking" gradient: walk from x0 in direction e0 to x1,
then from x1 in direction e1 to x2 ...;
this is sequential.
Minimizers that use gradients are highly developed, powerful, converge quadratically (on smooth enough functions).
The user-supplied gradient function
can of course be a parallel-gradient-estimator.
A few minimizers use "walking" gradients, among them Powell's method,
see Numerical Recipes p. 509.
So I'm confused: how do you parallelize its inner loop ?
I'd suggest scipy fmin_tnc
with a parallel-gradient-estimator, maybe using central, not one-sided, differences.
(Fwiw,
this
compares some of the scipy no-derivative optimizers on two 10-d functions; ymmv.)

I think what you want to do is use the threading capabilities built-in python.
Provided you your working function has more or less the same run-time whatever the params, it would be efficient.
Create 8 threads in a pool, run 8 instances of your function, get 8 result, run your optimisation algo to change the params with 8 results, repeat.... profit ?

If I haven't gotten wrong what you are asking, you are trying to minimize your function one parameter at the time.
you can obtain it by creating a set of function of a single argument, where for each function you freeze all the arguments except one.
Then you go on a loop optimizing each variable and updating the partial solution.
This method can speed up by a great deal function of many parameters where the energy landscape is not too complex (the dependency between the parameters is not too strong).
given a function
energy(*args) -> value
you create the guess and the function:
guess = [1,1,1,1]
funcs = [ lambda x,i=i: energy( guess[:i]+[x]+guess[i+1:] ) for i in range(len(guess)) ]
than you put them in a while cycle for the optimization
while convergence_condition:
for func in funcs:
optimize fot func
update the guess
check for convergence
This is a very simple yet effective method of simplify your minimization task. I can't really recall how this method is called, but A close look to the wikipedia entry on minimization should do the trick.

You could do parallel at two parts: 1) parallel the calculation of single iteration or 2) parallel start N initial guessing.
On 2) you need a job controller to control the N initial guess discovery threads.
Please add an extra output on your program: "lower bound" that indicates the output values of current input parameter's decents wont lower than this lower bound.
The initial N guessing thread can compete with each other; if any one thread's lower bound is higher than existing thread's current value, then this thread can be dropped by your job controller.

Parallelizing local optimizers is intrinsically limited: they start from a single initial point and try to work downhill, so later points depend on the values of previous evaluations. Nevertheless there are some avenues where a modest amount of parallelization can be added.
As another answer points out, if you need to evaluate your derivative using a finite-difference method, preferably with an adaptive step size, this may require many function evaluations, but the derivative with respect to each variable may be independent; you could maybe get a speedup by a factor of twice the number of dimensions of your problem. If you've got more processors than you know what to do with, you can use higher-order-accurate gradient formulae that require more (parallel) evaluations.
Some algorithms, at certain stages, use finite differences to estimate the Hessian matrix; this requires about half the square of the number of dimensions of your matrix, and all can be done in parallel.
Some algorithms may also be able to use more parallelism at a modest algorithmic cost. For example, quasi-Newton methods try to build an approximation of the Hessian matrix, often updating this by evaluating a gradient. They then take a step towards the minimum and evaluate a new gradient to update the Hessian. If you've got enough processors so that evaluating a Hessian is as fast as evaluating the function once, you could probably improve these by evaluating the Hessian at every step.
As far as implementations go, I'm afraid you're somewhat out of luck. There are a number of clever and/or well-tested implementations out there, but they're all, as far as I know, single-threaded. Your best bet is to use an algorithm that requires a gradient and compute your own in parallel. It's not that hard to write an adaptive one that runs in parallel and chooses sensible step sizes for its numerical derivatives.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.