I would like to get the gradient of tf.cholesky with respect to its input. As of the moment, the tf.cholesky does not have a registered gradient:
LookupError: No gradient defined for operation 'Cholesky' (op type: Cholesky)
The code used to generate this error is:
import tensorflow as tf
A = tf.diag(tf.ones([3]))
chol = tf.cholesky(A)
cholgrad = tf.gradients(chol, A)
While it is possible for me to compute the gradient myself and register it, the only existing means by which I've seen the Cholesky gradient computed involved the use of for loops and needs the shape of the input matrix. However, to the best of my knowledge, symbolic loops aren't currently available to TensorFlow.
One possible workaround to getting the shape of the input matrix A would probably be to use:
[int(elem) for elem in list(A.get_shape())]
But this approach doesn't work if the dimensions of A is dependent on a TensorFlow placeholder object with shape TensorShape([Dimension(None)]).
If anyone has any idea for how to compute and register a gradient of tf.cholesky, I would very much appreciate knowing about it.
We discussed this a bit in the answers and comments to this question: TensorFlow cholesky decomposition.
It might (?) be possible to port the Theano implementation of CholeskyGrad, provided its semantics are actually what you want. Theano's is based upon Smith's "Differentiation of the Cholesky Algorithm".
If you implement it as a C++ operation that the Python just calls into, you have unrestricted access to all the looping constructs you could desire, and anything Eigen provides. If you wanted to do it in pure tensorflow, you could use the control flow ops, such as tf.control_flow_ops.While to loop.
Once you know the actual formula you want to apply, the answer here: matrix determinant differentiation in tensorflow
shows how to implement and register a gradient for an op in tensorflow.
You could also create an issue on github to request this feature, though, of course, you'll probably get it faster if you implement it yourself and then send in a pull request. :)
Related
I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.
I'm looking to back-propagate gradients through a singular value decomposition for regularisation purposes. PyTorch currently does not support backpropagation through a singular value decomposition.
I know that I could write my own custom function that operates on a Variable; takes its .data tensor, applies the torch.svd to it, wraps a Variable around its singular values and returns it in the forward pass, and in the backward pass applies the appropriate Jacobian matrix to the incoming gradients.
However, I was wondering whether there was a more elegant (and potentially faster) solution, where I could overwrite the "Type Variable doesn't implement stateless method svd" Error directly, call Lapack, etc. ?
If someone could guide me through the appropriate steps and source files I need to look at, I'd be very grateful. I suppose these steps would similarly apply to other linear algebra operations which have no associated backward method currently.
torch.svd with forward and backward pass is now available in the Pytorch master:
http://pytorch.org/docs/master/torch.html#torch.svd
You need to install Pytorch from source:
https://github.com/pytorch/pytorch/#from-source
PyTorch's torch.linalg.svd operation supports gradient calculations, but note:
Gradients computed using U and Vh may be unstable if input is not full rank or has non-unique singular values.
I am trying to derive the conditional distribution of the visible variables, , for the Replicated Softmax Model (RSM) or equivalently, the Restricted Boltzmann Machine (RBM) for word counts, according to the paper: "Replicated Softmax: an Undirected Topic Model" by Salakhutdinov and Hinton.
Paper can be found at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B04C8D67D381B8106FF6FA4203A86264?doi=10.1.1.164.71&rep=rep1&type=pdf
However, despite all efforts, I've been unable to get how the conditional can turn out to be a softmax distribtution:
Also, I'm confused if is a 3D matrix and a 2D matrix or is it instead a 2D matrix and vector respectively. I believe it is the latter. Hoping someone can demonstrate the derivations.
I am looking to implement the RSM to do topic modelling in python's theano. I am aware that there are codes out there but I prefer to understand the derivation myself so that I can extend or optimize the codes without the risk of breaking the model.
p.s. apologies, this is a repost of https://math.stackexchange.com/questions/2085616/rbm-deriving-the-replicated-softmax-model-rsm but i did so as aren't as many mathstackexchange users.
After sometime I found out where I misunderstood things and managed to derive the equations. Please refer to math.stackexchange:
https://math.stackexchange.com/questions/2085616/rbm-deriving-the-replicated-softmax-model-rsm/2087272#2087272
I have a Stochastic Optimal Control problem that I wish to solve, using some type of Bayesian Simulation based framework. My problem has the following general structure:
s_t+1 = r*s_t(1 - s_t) - x_t+1 + epsilon_t+1
x_t+1 ~ Beta(u_t+1, w_t+1)
u_t+1 = f_1(u_t,w_t, s_t, x_t)
w_t+1 = f_2(u_t,w_t, s_t, x_t)
epsilon_t ~ Normal(0,sigma)
objective function: max_{x_t} E(Sigma_{t=0}^{T} V(s_t,x_t,c) * rho^t)
My goal is to explore different functional forms of f_1, f_2, and V to determine how this model differs w.r.t a non-stochastic model and another simpler stochastic model.
State variables are s_t, control variables are x_t with u_t and w_t representing some belief of the current state. The objective function is the discounted maximum from gains (function V) over the time period t=0 to t=T.
I was thinking of using Python, specifically PyMC to solve this, though I am not sure how to proceed, specifically how to optimize the control variables. I found a book, published 1967, Optimization of Stochastic Systems by Masanao Aoki, that references some bayesian techniques that may be useful, is there a current Python implementation that may help? Or is there a much better way to simulate a optimal path, using Python?
The first guess coming to my mind is to try neural network packages like chainer or theano which can track derivative of your cost function with respect to control function parameters; they also have a bunch of optimization plug-in routines. You can use numpy.random to generate samples (particles), compose your control functions from the libraries components, and run them through explicit Euler scheme for first try. This will give you cost function on your particles and its derivative with respect to parameters, which can be fed to the optimizers.
The issue that can arise here is that solver's iterations will create a host of derivative-tracking objects.
update: Please see this example on Github
Also there is a number of hits on Github with keywords particle filter python:
https://github.com/strohel/PyBayes
https://github.com/jerkern/pyParticleEst
Also there is a manuscript around which mentions that the author implemented filters in Python, so you might want to contact them.