I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)
Related
I am new to Pytorch and I am now following the tutorial on transforms. I see that the transformations are configured into the dataset object. I am wondering, however, why aren't they configured within the neural network itself. My naive point of view is that the transformations should be in any case the most external layers of the network, in the same way as the eye comes before the brain to transform light into signals for the brain, and you don't modify the world instead to adapt it to the brain.
So, is there any technical reason for putting the transformations in the dataset instead of the net? Is it a good/bad practice to put the transformations within my neural network instead? Why?
These are some of the reason that can explain why one would do this.
We would like to use the same NN code for training as well as testing / inference. Typically during inference, we don't want to do any transformation and hence one might want to keep it out of the network. However, you may argue that one can just simply use model.training flag to skip the transformation.
Most of the transformations happen on CPU. Doing transformations in dataset allows to easily use multi-processing and prefetching. The dataset code can prefetch the data, transform, and keep it ready to be fed into the NN in a separate thread. If instead, we do it inside the forward function, GPUs will idle during the transformations (as these happen on CPU), likely leading to a longer training time.
I am using Tensorflow v1.14 for creating networks and training them. Everything works fine and I don't have any problem with code. I use the function tf.reduce_min() in my loss function. For the gradients to flow, it is essential that the loss function is differentiable. But a min operator is not differentiable as such. This link, gives the necessary explanation for the tf.reduce_min() function but without references.
In general there are functions in Tensorflow (tf.cond, tf.where, among many more) that are inherently not differentiable by their definition. I want to know how these are made differentiable by defining "pseudo gradients" and the proper references to documentation. Thanks.
I need to implement a neural network which is NOT layer based, meaning that ANY neuron may be connected to any other neuron, and that there's no way to logically organize them in consecutive layers.
What I'm asking for is an example or a reference to proper and clear documentation about how to implement the following:
Originally I had my own implementation in matlab, however, I've been using TensorFlow and Keras to test simple models and it allows to tune your networks very fast and the implementations are pretty efficient, so I decided to try out more complex models, however, I just got stuck creating this type of network.
HINT: It MAY be OK to create single-neuron layers, as long as you can connect a layer to ANY layer (without caring if it is not adjacent) and to MORE THAN ONE LAYER.
I'm new to Tf and Keras, so a simple python example would be appreciated, althought, pointing me in the right direction would be OK.
This is an example network (¡loops are intentional!):
I dont need to train at the moment, just to evaluate models, however, keep in mind that evaluation of this kind of network is different too, one possible way is to keep with the signal sending until output stabilices, but it is just an example.
Occassionally we may encounter some nan/inf in gradients during backprop on seq2seq Tensorflow models. How could we easily find the cause of such issue, e.g. by locating the op and time step on which nan/inf is produced?
Since the error occurs on backpropagation, we could not simply observe the gradient values with tf.Print(). Also in a RNN model, tf.add_check_numerics_ops() doesn't work, and we could not use tf.check_numerics() unless we dig into the messy tf libraries or reimplement the control flow manually. While tfdbg, as a general solution, is hard to use and extremely slow on large models.
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.