I am using Tensorflow v1.14 for creating networks and training them. Everything works fine and I don't have any problem with code. I use the function tf.reduce_min() in my loss function. For the gradients to flow, it is essential that the loss function is differentiable. But a min operator is not differentiable as such. This link, gives the necessary explanation for the tf.reduce_min() function but without references.
In general there are functions in Tensorflow (tf.cond, tf.where, among many more) that are inherently not differentiable by their definition. I want to know how these are made differentiable by defining "pseudo gradients" and the proper references to documentation. Thanks.
Related
I am following the tutorial on neural style transfer. The style transfer is done by minimizing a loss function with respect to an image (initialized with the content image). What confuses me is the following piece of code:
preprocessed_input = tf.keras.applications.vgg19.preprocess_input(inputs)
which is part of the call method in the StyleContentModel class. How does TensorFlow know the gradient of this operation? I have checked if this operation has a gradient function using get_gradient_function in the module tensorflow.python.framework.ops, and as far as I can tell it does not.
It is very simple, the function internally uses symbolic tensor operations that are differentiable. TensorFlow can compute gradients through functions that internally use TensorFlow operations, there is no need to manually define a gradient for each function.
You can confirm by looking at the code of that function here, specially if you look at the _preprocess_symbolic_function here which is using normal scalar operations and Keras backend functions (which are just TensorFlow functions in tf.keras).
This has nothing to do with the model or gradients. What this function does is scale the input images so the pixels are in the range from -1 to +1. This is a common requirement for many models used in transfer learning like VGG and MobileNet. If you use the ImageDataGenerator it has a parameter preprocessing_function which the generator calls to preprocess the images. Make sure if you preprocess the training images you do the same for the test and validation images.
I want to make an accumulated SGD optimizer for tf.keras (not keras standalone). I have found a couple of implementations of standalone keras accumulated SGD optimizers including this one on pypi. Nevertheless, I am using a project which make use of tf.keras. And as I have seen it's not a good idea to mix them together.
The problem is that the documentation for achieving this custom optimizer is not really straight forward. The base class (which I should inherit from) is Optimizer_v2.py which contains some information in the comment section about the task.
The required methods that should be overridden are:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
- get_config (serialization of the optimizer, include all hyper parameters)
Of course of these ones only get_config() actually exists in the base class. resource_apply_dense is actually _resource_apply_dense, resource_apply_sparse is _resource_apply_sparse and create_slots does not even exist in base class. In subclasses as SGD in gradient_decent.py, create_slots also exists as _create_slots.
Anyway, apparently the documentation is not updated (there is also an issue regarding this in git but I don't remember the link which pointed this lack of consistency with the documentation) but this makes the whole procedure difficult. For example in SGD I have to override the _resource_apply_dense() method but I cannot understand where the gradients are being calculated and where they are updated.
The actual code is given below:
def _resource_apply_dense(self, grad, var, apply_state=None):
var_device, var_dtype = var.device, var.dtype.base_dtype
coefficients = ((apply_state or {}).get((var_device, var_dtype))
or self._fallback_apply_state(var_device, var_dtype))
if self._momentum:
momentum_var = self.get_slot(var, "momentum")
return training_ops.resource_apply_keras_momentum(
var.handle,
momentum_var.handle,
coefficients["lr_t"],
grad,
coefficients["momentum"],
use_locking=self._use_locking,
use_nesterov=self.nesterov)
else:
return training_ops.resource_apply_gradient_descent(
var.handle, coefficients["lr_t"], grad, use_locking=self._use_locking)
which obviously rely on training_ops.resource_apply_keras_momentum and training_ops.resource_apply_gradient_descent to do the actual job. How can I split the 2 parts mentioned in the minimize() method in OptimizerV2 from the above code? The 2 parts are:
_compute_gradients() and apply_gradients().
There are a lot of parts that are confusing in this comments like for example in the base class:
Many optimizer subclasses, such as Adam and Adagrad allocate and
manage additional variables associated with the variables to train.
These are called Slots. Slots have names and you can ask the
optimizer for the names of the slots that it uses.
although if I declare an Adam optimizer and ask for slot names I get an empty list (?).
optimizer = Adam(lr=1e-3)
optimizer.get_slot_names()
[]
Another confusing issue is the use of private methods which is not clear when they are called and what's their purpose. For example _prepare_local() is contained within SGD and includes a line:
apply_state[(var_device, var_dtype)]["momentum"] = array_ops.identity(self._get_hyper("momentum", var_dtype))
Anyway, the problem here is that I do not know which exactly approach to follow to create a custom tf.keras optimizer. Instructions included in comments seem to contradict with the actual implemented subclasses, and the latter also seem to assign the dirty work to the actual C++ function without being clear how this is done or how (in my case) to separate the actions (like the gradient calculation and application). So, is there any advice someone can provide on how to proceed and steps to follow to accomplish this (relatively) simple task?
I am using tf 1.15 by the way (so the links are from there).
Reference for optimizer : DiffGrad (kind of Adam like)
https://github.com/evanatyourservice/diffGrad-tf/blob/master/diffgrad.py
It is based on a paper called DiffGrad , they have good explanations and generally a good read.
First of all good question, secondly TensorFlow documentation can do a lot better. Answers to various questions in no particular order:
In reference to empty slot list for Adam, you have to run a model.fit once on a model for it to initialize as far as I have seen. Remember reading about it while looking up saving and loading optimizer states (check if it works on model.compile).
As for _prepare_local, that line creates the momentum variable from the hyper parameter you set on creation. I suppose it makes it accessible to all the weights the optimizer is trying to update, why they use identity is deep TensorFlow graph stuff.
Why they use _prepare_local generally is to create variables that are common across all weighs that are being updated like decays or learning rates or time steps and such. For every Iteration these variables are common across all variables tracked in the optimizer's var_list.
Unlike the above _prepare_local, slots are separate variables for each weight tracked by the optimizer so you might have moments or history or cumulative sum. Anything to do with that specific individual weight.
Gradient compute and apply: If I understand this correctly compute gradients takes the loss does back propagation and auto differentiation and gets you the "gradients" for each weight. when you go to apply it is when the optimizer comes into play with its slots and variables. finally optimizer does the updating with the computed gradients as inputs.
I wrote a custom layer that is part of a neural network and it contains some operations that I am using for the first time such as tf.scan and tf.slice.
I can easily test that the forward pass works and it makes sense, but how do I know that it will still work during the learning, when it has to do backpropagation? Can I safely assume that everything is going to be fine because the results I get make sense in the forward pass?
I was thinking that one possibility might be to create a neural network, replace one or two layers with the custom ones I have just created, train it, and see what happens. However, despite this would take quite a long time, the network may learn in the other layers whereas in my custom layer it may not work well anyway.
In conclusion, is there any way I can see that back-propagation will work well and I won't have any problems during the learning in this layer?
As far as I know, almost all TensorFlow ops are differentiable, including ops such as tf.abs or tf.where and gradient flows correctly through them. TensorFlow has an automatic differentiation engine, that takes any TensorFlow graph and computes derivatives w.r.t. desired variables.
So if your graph is composed of TensorFlow ops I wouldn't worry about the gradients being wrong (if you would post the code of your layer, I could expand further). However, there are still issues like numerical stability which can make otherwise mathematically sound operation still fail in practice (e.g. naive softmax computation, or tf.exp in your graph in general). Apart from that, TensorFlow differentiation should be correct and taken care of, from the user's point of view.
If you still want to examine your gradients by hand, you can compute the derivatives in your graph using tf.gradients op, which will get you the gradients that you wish and you can check by hand if TensorFlow did the differentiation correctly. (See https://www.tensorflow.org/api_docs/python/tf/gradients)
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.
I'm looking to back-propagate gradients through a singular value decomposition for regularisation purposes. PyTorch currently does not support backpropagation through a singular value decomposition.
I know that I could write my own custom function that operates on a Variable; takes its .data tensor, applies the torch.svd to it, wraps a Variable around its singular values and returns it in the forward pass, and in the backward pass applies the appropriate Jacobian matrix to the incoming gradients.
However, I was wondering whether there was a more elegant (and potentially faster) solution, where I could overwrite the "Type Variable doesn't implement stateless method svd" Error directly, call Lapack, etc. ?
If someone could guide me through the appropriate steps and source files I need to look at, I'd be very grateful. I suppose these steps would similarly apply to other linear algebra operations which have no associated backward method currently.
torch.svd with forward and backward pass is now available in the Pytorch master:
http://pytorch.org/docs/master/torch.html#torch.svd
You need to install Pytorch from source:
https://github.com/pytorch/pytorch/#from-source
PyTorch's torch.linalg.svd operation supports gradient calculations, but note:
Gradients computed using U and Vh may be unstable if input is not full rank or has non-unique singular values.