I'm looking to back-propagate gradients through a singular value decomposition for regularisation purposes. PyTorch currently does not support backpropagation through a singular value decomposition.
I know that I could write my own custom function that operates on a Variable; takes its .data tensor, applies the torch.svd to it, wraps a Variable around its singular values and returns it in the forward pass, and in the backward pass applies the appropriate Jacobian matrix to the incoming gradients.
However, I was wondering whether there was a more elegant (and potentially faster) solution, where I could overwrite the "Type Variable doesn't implement stateless method svd" Error directly, call Lapack, etc. ?
If someone could guide me through the appropriate steps and source files I need to look at, I'd be very grateful. I suppose these steps would similarly apply to other linear algebra operations which have no associated backward method currently.
torch.svd with forward and backward pass is now available in the Pytorch master:
http://pytorch.org/docs/master/torch.html#torch.svd
You need to install Pytorch from source:
https://github.com/pytorch/pytorch/#from-source
PyTorch's torch.linalg.svd operation supports gradient calculations, but note:
Gradients computed using U and Vh may be unstable if input is not full rank or has non-unique singular values.
Related
During the course of my training process, I tend to use a lot of calls to torch.cat() and copying tensors into new tensors. How are these operations handled by autograd? Is the gradient value affected by these operations?
As pointed out in the comments, cat is a mathematical function. For example we could write the following (special case) definition of cat in more traditional mathematical notation as
The Jacobian of this function w.r.t. either of its inputs can be expressed as
Since the Jacobian is well defined you can, of course, apply back-propagation.
In reality you generally wouldn't define these operations with such notation, and a general definition of the cat operation used by pytorch in such a way would be cumbersome.
That said, internally autograd uses backward algorithms that take into account the gradients of such "index style" operations just like any other function.
I am solving a problem of minimizing a function using the BFGS-optimizer available in in Scipy from https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html.
In certain cases I would like to perform just a single optimization step with my Scipy optimizer. I would think that this should be easy, but I cannot find any way to do it based on the documentation available in the link. There is an option 'maxiter', which I have tried to set to 1. But this seems to be number of internal evaluations of the BFGS algorithm before it returns the new function value and hence not the number of function evaluations. Does anyone have an idea about how to solve my problem?
Kind regards
I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.
I would like to get the gradient of tf.cholesky with respect to its input. As of the moment, the tf.cholesky does not have a registered gradient:
LookupError: No gradient defined for operation 'Cholesky' (op type: Cholesky)
The code used to generate this error is:
import tensorflow as tf
A = tf.diag(tf.ones([3]))
chol = tf.cholesky(A)
cholgrad = tf.gradients(chol, A)
While it is possible for me to compute the gradient myself and register it, the only existing means by which I've seen the Cholesky gradient computed involved the use of for loops and needs the shape of the input matrix. However, to the best of my knowledge, symbolic loops aren't currently available to TensorFlow.
One possible workaround to getting the shape of the input matrix A would probably be to use:
[int(elem) for elem in list(A.get_shape())]
But this approach doesn't work if the dimensions of A is dependent on a TensorFlow placeholder object with shape TensorShape([Dimension(None)]).
If anyone has any idea for how to compute and register a gradient of tf.cholesky, I would very much appreciate knowing about it.
We discussed this a bit in the answers and comments to this question: TensorFlow cholesky decomposition.
It might (?) be possible to port the Theano implementation of CholeskyGrad, provided its semantics are actually what you want. Theano's is based upon Smith's "Differentiation of the Cholesky Algorithm".
If you implement it as a C++ operation that the Python just calls into, you have unrestricted access to all the looping constructs you could desire, and anything Eigen provides. If you wanted to do it in pure tensorflow, you could use the control flow ops, such as tf.control_flow_ops.While to loop.
Once you know the actual formula you want to apply, the answer here: matrix determinant differentiation in tensorflow
shows how to implement and register a gradient for an op in tensorflow.
You could also create an issue on github to request this feature, though, of course, you'll probably get it faster if you implement it yourself and then send in a pull request. :)