I would like to run the same RNN cell over two inputs in Tensorflow.
My code:
def lstm_cell():
return tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True)
self.forward_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(layers)], state_is_tuple=True)
self.initial_state = self.forward_cell.zero_state(self.batch_size, tf.float32)
outputs1, state1 = tf.nn.dynamic_rnn(self.forward_cell, input1, initial_state=self.initial_state)
outputs2, state2 = tf.nn.dynamic_rnn(self.forward_cell, input2, initial_state=self.initial_state)
My question now is, is this the correct code to do what I want (use the SAME RNN on both inputs, i.e. share the weights).
On a similar post I found a similar solution using reuse_variables(): Running the same RNN over two tensors in tensorflow
I would go for that, but with my current solution I do not get a reuse error, which confuses me. When I print my variables it seems to be fine, too.
Could you explain why there is no reuse error in my case, and if this is correct?
Update:
After I double checked the source code in 1.6, I found that my memories from early versions are no longer actual (so thanks for bringing this up!). Your code indeed reuses the cell variables, because cells are initialized lazily and only once (see RNNCell.build() method, which actually creates the kernel and bias). After the cell is built, it's not rebuilt upon the next call. This means that a single instance of a cell always holds the same variables, no matter how often it's used in different networks, until you manually reset the built state. That's why reuse parameter no longer matters.
Original answer (no longer valid):
Your current code creates two independent RNN layers (each one is deep), with same initial state. This means they have different weight matrices, different nodes in graph, etc. Tensorflow has nothing to complain about, because it doesn't know they are intended to be shared. That's why you should specify reuse=True before calling tf.dynamic_rnn as the question you refer to suggests, this will cause tensorflow share the kernels of all cells.
Related
For a project I'm working on I have to compare the performance of a small, simulated dataset to the performance of well-known datasets (e.g. ImageNet, CIFAR-10) using two neural networks, ResNet v2 at different depths and Inception v3.
To account for the imbalance between my set and the big popular ones, one of the methods I want to employ is augmenting the big sets using either a class from my simulated set or the same class but instead with real-life images. Then, using a pre-trained model, I want to train on both augmented sets and later compare performance.
The issue
The augmentation works fine: I copy over the .tfrecord files containing only images of the extra class to the original dataset and I modify the labels.txt file to contain the appropriate extra labels. However, the issue occurs when trying to train on this augmented dataset:
I0125 16:40:31.279634 140094914221888 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [258] rhs shape= [257]
which is to be expected, as the graph cannot handle the extra class.
My attempted solution
To alleviate this issue I tried applying the same method as was posed in this answer, using the Python script below (modified for brevity):
checkpoint = {}
saver = tf.compat.v1.train.import_meta_graph(f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}.meta")
with tf.Session() as sess:
saver.restore(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
# get appropriate nodes where number of classes is one of its dimensions
if "inception_v3" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
elif "resnet_v2" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
# select the nodes previously defined
nodes = [node for node in tf.global_variables() if node.name in target_nodes]
for node in nodes:
# load the values of those nodes from the checkpoint
checkpoint[node.name] = node.eval()
# extend dimensions for extra class
checkpoint.update({node.name: np.insert(checkpoint[node.name], checkpoint[node.name].shape[-1], 0.0, axis=-1)})
# assign new nodes to the checkpoint
sess.run(tf.assign(node, checkpoint[node.name], validate_shape=False))
saver.save(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
In short, this script loads the checkpoint and isolates all nodes where one of the dimensions is equal to the number of classes and increases it. I have verified these are always equal to the number of classes using the checkpoints from the different datasets the models have been trained on. In fact, it seems that those tensors are the only thing that's different between the checkpoints.
Although this method does execute correctly and allows me to run the training, it feels very boneheaded and I'm almost positive it is not at all what I'm supposed to do. However I'm curious as to what exactly is wrong with this (apart from initializing the new values as 0.0), as except from a wild fluctuation in loss at the start of the tranining it seems to work as I had hoped.
Other solutions
When looking further into this issue I also found other users' answers on similar questions suggesting that it is in fact impossible to modify a checkpoint in the way that I want to and suggested transfer learning or modifying the output layer. I am very new to neural network training and I don't know who to believe, as the linked answer seems to suggest that what I'm trying to do should work.
My question
So I ask: which approach is correct? Could my initial approach work, or should I try a different method to fix this issue? Starting from scratch would not be ideal, as the progress so far has taken over a month of continuous training.
I am using a modified version the TensorFlow Model Garden as my environment to train the networks, with the modificiations pertaining to creating custom datasets and using them using the supplied training scripts.
I want to make an accumulated SGD optimizer for tf.keras (not keras standalone). I have found a couple of implementations of standalone keras accumulated SGD optimizers including this one on pypi. Nevertheless, I am using a project which make use of tf.keras. And as I have seen it's not a good idea to mix them together.
The problem is that the documentation for achieving this custom optimizer is not really straight forward. The base class (which I should inherit from) is Optimizer_v2.py which contains some information in the comment section about the task.
The required methods that should be overridden are:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
- get_config (serialization of the optimizer, include all hyper parameters)
Of course of these ones only get_config() actually exists in the base class. resource_apply_dense is actually _resource_apply_dense, resource_apply_sparse is _resource_apply_sparse and create_slots does not even exist in base class. In subclasses as SGD in gradient_decent.py, create_slots also exists as _create_slots.
Anyway, apparently the documentation is not updated (there is also an issue regarding this in git but I don't remember the link which pointed this lack of consistency with the documentation) but this makes the whole procedure difficult. For example in SGD I have to override the _resource_apply_dense() method but I cannot understand where the gradients are being calculated and where they are updated.
The actual code is given below:
def _resource_apply_dense(self, grad, var, apply_state=None):
var_device, var_dtype = var.device, var.dtype.base_dtype
coefficients = ((apply_state or {}).get((var_device, var_dtype))
or self._fallback_apply_state(var_device, var_dtype))
if self._momentum:
momentum_var = self.get_slot(var, "momentum")
return training_ops.resource_apply_keras_momentum(
var.handle,
momentum_var.handle,
coefficients["lr_t"],
grad,
coefficients["momentum"],
use_locking=self._use_locking,
use_nesterov=self.nesterov)
else:
return training_ops.resource_apply_gradient_descent(
var.handle, coefficients["lr_t"], grad, use_locking=self._use_locking)
which obviously rely on training_ops.resource_apply_keras_momentum and training_ops.resource_apply_gradient_descent to do the actual job. How can I split the 2 parts mentioned in the minimize() method in OptimizerV2 from the above code? The 2 parts are:
_compute_gradients() and apply_gradients().
There are a lot of parts that are confusing in this comments like for example in the base class:
Many optimizer subclasses, such as Adam and Adagrad allocate and
manage additional variables associated with the variables to train.
These are called Slots. Slots have names and you can ask the
optimizer for the names of the slots that it uses.
although if I declare an Adam optimizer and ask for slot names I get an empty list (?).
optimizer = Adam(lr=1e-3)
optimizer.get_slot_names()
[]
Another confusing issue is the use of private methods which is not clear when they are called and what's their purpose. For example _prepare_local() is contained within SGD and includes a line:
apply_state[(var_device, var_dtype)]["momentum"] = array_ops.identity(self._get_hyper("momentum", var_dtype))
Anyway, the problem here is that I do not know which exactly approach to follow to create a custom tf.keras optimizer. Instructions included in comments seem to contradict with the actual implemented subclasses, and the latter also seem to assign the dirty work to the actual C++ function without being clear how this is done or how (in my case) to separate the actions (like the gradient calculation and application). So, is there any advice someone can provide on how to proceed and steps to follow to accomplish this (relatively) simple task?
I am using tf 1.15 by the way (so the links are from there).
Reference for optimizer : DiffGrad (kind of Adam like)
https://github.com/evanatyourservice/diffGrad-tf/blob/master/diffgrad.py
It is based on a paper called DiffGrad , they have good explanations and generally a good read.
First of all good question, secondly TensorFlow documentation can do a lot better. Answers to various questions in no particular order:
In reference to empty slot list for Adam, you have to run a model.fit once on a model for it to initialize as far as I have seen. Remember reading about it while looking up saving and loading optimizer states (check if it works on model.compile).
As for _prepare_local, that line creates the momentum variable from the hyper parameter you set on creation. I suppose it makes it accessible to all the weights the optimizer is trying to update, why they use identity is deep TensorFlow graph stuff.
Why they use _prepare_local generally is to create variables that are common across all weighs that are being updated like decays or learning rates or time steps and such. For every Iteration these variables are common across all variables tracked in the optimizer's var_list.
Unlike the above _prepare_local, slots are separate variables for each weight tracked by the optimizer so you might have moments or history or cumulative sum. Anything to do with that specific individual weight.
Gradient compute and apply: If I understand this correctly compute gradients takes the loss does back propagation and auto differentiation and gets you the "gradients" for each weight. when you go to apply it is when the optimizer comes into play with its slots and variables. finally optimizer does the updating with the computed gradients as inputs.
In several parts of the documentation (e.g. Dataset Iterators here) there are references to Stateful Objects. What exactly are they and what role do they play in the graph?
To clarify, in the Dataset documentation there's an example with the one_shot_iterator that works because it's stateless:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
what makes the iterator stateless?
As others have mentioned, stateful objects are those holding a state. Now, a state, in TensorFlow terms, is some value or data that is saved between different calls to tf.Session.run. The most common and basic kind of stateful objects are variables. You can call run once to update a model's parameters, which are variables, and they will maintain their assigned value for the next call to run. This is different to most operations; for example, if you have an addition operation, which takes two tensors and outputs a third one, the output value that it computes in one call to run is not saved. Indeed, even if your graph consists only of operations with constant values, tensor operations will be evaluated every time you call run, even though the result will always be the same. When you assign a value to a variable, however, it will "stick" (and, by the way, take the corresponding memory and, if you choose so, be serialized on checkpoints).
Dataset iterators are also stateful. When you get a piece of data in one run then it is consumed, and then in the next run you get a different piece of data; the iterator "remembers" where it was between runs. That is why, similarly to how you initialize variables, you can initialize iterators (when they support it), to reset them back to a known state.
Technically speaking, another kind of stateful objects is random operations. One usually regards random operations as, well, random, but in reality they hold a random number generator that does have a state which is held between runs, and if you provide a seed then they will be in a well-defined state when you start the session. However there is not, as far as I know, any way to reset random operations to their initial state within the same session.
Note that frequently (when one is not referring to TensorFlow in particular) the term "stateful" is used in a slightly different sense, or at a different level of abstraction. For example, recurrent neural networks (RNNs) are generally said to be stateful, because, conceptually, they have an internal state that changes with every input they receive. When you make a RNN in TensorFlow, however, that internal state does not necessarily have to be in a stateful object! Like any other kind of neural network, RNNs in TensorFlow will in principle have some parameters or weights, typically stored in trainable variables - so, in TensorFlow terms, all trainable models, RNN or not, have stateful objects for the trained parameters. However, the internal state of the RNN is represented in TensorFlow with an input state value and an output state value that you get on each run (see tf.nn.dynamic_rnn), and you can just start with a "zero" state on each run and forget about the final output state. Of course, you can also, if you want, take the input state to be the value of a variable, and the write the output state back to that variable, and then your RNN internal state will be "stateful" for TensorFlow; that is, you would be able to process some data in one run and the "pick up where you left it" in the next run (which may or may not make sense depending on the case). I understand this can be a bit confusing but I hope it makes sense.
I'm studying the code of the tensorflow convolution neural network tutorial which trains a CNN using the cifar10 dataset. The source code lies here in Gihub and the document in Document.
My question is specifically about the use of ExponentialMovingAverage(doc here) in cifar10.py line 375-378. which is
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
return train_op
Here, the variables_averages_op is an operation that updates all the shadow variables and apply_gradient_op is an operation that applies computed gradients to all original variables(which updates the original variables, a.k.a. model weights).
Since control_dependencies doesn't guarantee the order of the execution of its passed arguments, the execution order of apply_gradient_op and variables_averages_op is arbitrary in this example, which further indicates that upon running the train_op, we could end up with firstly updating the original variables and then updating the corresponding shadow variables, or updating shadow variables before the original variables. The latter one seems unreasonable to me.
According to the official doc of ExponentialMovingAverage(link above), the update of the shadow variable relies on the original variable:
shadow_variable = decay * shadow_variable + (1 - decay) * variable
The update of the original variable should be before the update of the shadow ones, which is not the case in the tutorial code.
Can anyone help me clear that? Thanks.
I believe you are right. It does seem like a bug in the example. It is probably not important in practice as the order of variable update and moving average update is likely to be stable. Even if it is the "wrong" order, in the worst case, your moving average will be "one step ahead of the variable". Which is likely to have a less significant effect than changing your decay from 0.999 to 0.998 or something like that.
Just created a pull request to fix this: https://github.com/tensorflow/models/pull/3946
I was checking the Caffe LeNet Tutorial here and a question came to mind:
What's the difference between these 2 codes:
self.solver.step(1)
and
self.solver.net.forward() # train net
They both seem to train the network at least according to the comment.
Personally I think the first one trains the network on the training data and updates the weights of both net and test_net but the second one seems to only forward a batch of data and apply the learned weights from the previous step.
If what I think is right, then what is the purpose of the second code in the tutorial? why did the code do a net.forward ? can't solver.step(1) do this itself?
Thanks for your time
step does one full iteration, covering all three phases: forward evaluation, backward propagation, and update. The call to forward does only the first of these. There are also differences in the signature (parameter list).
I discovered a strange behavior in solver.step(1) and solver.net.forward(). When I used a custom layer for the input network, my instance layer needs a variable before using:
solver.net.layers[0].mySet(variable)
That variable was set in a local variable for my layer. But when I called for solver.step, that variable does not appear. However it does, when I use solver.net.forward(). I am not certain, but maybe solver.step is instantiating a new variable for the layer.