During the course of my training process, I tend to use a lot of calls to torch.cat() and copying tensors into new tensors. How are these operations handled by autograd? Is the gradient value affected by these operations?
As pointed out in the comments, cat is a mathematical function. For example we could write the following (special case) definition of cat in more traditional mathematical notation as
The Jacobian of this function w.r.t. either of its inputs can be expressed as
Since the Jacobian is well defined you can, of course, apply back-propagation.
In reality you generally wouldn't define these operations with such notation, and a general definition of the cat operation used by pytorch in such a way would be cumbersome.
That said, internally autograd uses backward algorithms that take into account the gradients of such "index style" operations just like any other function.
Related
I am using Tensorflow v1.14 for creating networks and training them. Everything works fine and I don't have any problem with code. I use the function tf.reduce_min() in my loss function. For the gradients to flow, it is essential that the loss function is differentiable. But a min operator is not differentiable as such. This link, gives the necessary explanation for the tf.reduce_min() function but without references.
In general there are functions in Tensorflow (tf.cond, tf.where, among many more) that are inherently not differentiable by their definition. I want to know how these are made differentiable by defining "pseudo gradients" and the proper references to documentation. Thanks.
I am not too convinced by the parameter optimization classes sklearn provides, in fact GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) just loops over the parameters I pass in via param_grid. RandomizedSearch (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) seems like a we-gave-up-approach to me.
One of my approaches was to re-use adaptive integration of a function (used in numerical mathematics): basically reduce the search-space of a parameter in every iteration until a certain error-threshold is reached. The biggest advantage of this method is that the error is also reduced in each iteration.
The problem is that certain parameter-values may be untouched if you consider the different score-values (precision, roc-auc etc.) as functions. I still got good results in one case, not so good in other cases.
What would be a good mathematical approach to optimize values more efficient than GridSearch and RandomizedSearch?
I'm looking to back-propagate gradients through a singular value decomposition for regularisation purposes. PyTorch currently does not support backpropagation through a singular value decomposition.
I know that I could write my own custom function that operates on a Variable; takes its .data tensor, applies the torch.svd to it, wraps a Variable around its singular values and returns it in the forward pass, and in the backward pass applies the appropriate Jacobian matrix to the incoming gradients.
However, I was wondering whether there was a more elegant (and potentially faster) solution, where I could overwrite the "Type Variable doesn't implement stateless method svd" Error directly, call Lapack, etc. ?
If someone could guide me through the appropriate steps and source files I need to look at, I'd be very grateful. I suppose these steps would similarly apply to other linear algebra operations which have no associated backward method currently.
torch.svd with forward and backward pass is now available in the Pytorch master:
http://pytorch.org/docs/master/torch.html#torch.svd
You need to install Pytorch from source:
https://github.com/pytorch/pytorch/#from-source
PyTorch's torch.linalg.svd operation supports gradient calculations, but note:
Gradients computed using U and Vh may be unstable if input is not full rank or has non-unique singular values.
I have some functional, such as S[f] = \int_\Omega f^2(x) dx. If you're familiar with physics, it's the action. This object takes in a function defined on a certain domain \Omega and gives you a number. The math jargon for this is functional.
Now I need to minimize this thing with respect to f. I know SciPy has an optimize package that allows one to minimize multivariable functions, but I am curious if there is a better way considering if I used this I would be minimizing over ~10,000 variables (because the functions are essentially just lists of 10,000 numbers).
Do I have any other options?
You could use symbolic regression to find the function.
There are several packages available:
deap
glyph
gplearn
monkeys
Here is a good paper on symbolic regression by Schmidt and Lipson.
Although it is more designed for doing Neural Network stuff, Tensorflow sounds like it would work for you. It has the ability to differentiate vector equations and also optimize them using gradient descent.
I would like to get the gradient of tf.cholesky with respect to its input. As of the moment, the tf.cholesky does not have a registered gradient:
LookupError: No gradient defined for operation 'Cholesky' (op type: Cholesky)
The code used to generate this error is:
import tensorflow as tf
A = tf.diag(tf.ones([3]))
chol = tf.cholesky(A)
cholgrad = tf.gradients(chol, A)
While it is possible for me to compute the gradient myself and register it, the only existing means by which I've seen the Cholesky gradient computed involved the use of for loops and needs the shape of the input matrix. However, to the best of my knowledge, symbolic loops aren't currently available to TensorFlow.
One possible workaround to getting the shape of the input matrix A would probably be to use:
[int(elem) for elem in list(A.get_shape())]
But this approach doesn't work if the dimensions of A is dependent on a TensorFlow placeholder object with shape TensorShape([Dimension(None)]).
If anyone has any idea for how to compute and register a gradient of tf.cholesky, I would very much appreciate knowing about it.
We discussed this a bit in the answers and comments to this question: TensorFlow cholesky decomposition.
It might (?) be possible to port the Theano implementation of CholeskyGrad, provided its semantics are actually what you want. Theano's is based upon Smith's "Differentiation of the Cholesky Algorithm".
If you implement it as a C++ operation that the Python just calls into, you have unrestricted access to all the looping constructs you could desire, and anything Eigen provides. If you wanted to do it in pure tensorflow, you could use the control flow ops, such as tf.control_flow_ops.While to loop.
Once you know the actual formula you want to apply, the answer here: matrix determinant differentiation in tensorflow
shows how to implement and register a gradient for an op in tensorflow.
You could also create an issue on github to request this feature, though, of course, you'll probably get it faster if you implement it yourself and then send in a pull request. :)