What is a "stateful object" in tensorflow? - python

In several parts of the documentation (e.g. Dataset Iterators here) there are references to Stateful Objects. What exactly are they and what role do they play in the graph?
To clarify, in the Dataset documentation there's an example with the one_shot_iterator that works because it's stateless:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
what makes the iterator stateless?

As others have mentioned, stateful objects are those holding a state. Now, a state, in TensorFlow terms, is some value or data that is saved between different calls to tf.Session.run. The most common and basic kind of stateful objects are variables. You can call run once to update a model's parameters, which are variables, and they will maintain their assigned value for the next call to run. This is different to most operations; for example, if you have an addition operation, which takes two tensors and outputs a third one, the output value that it computes in one call to run is not saved. Indeed, even if your graph consists only of operations with constant values, tensor operations will be evaluated every time you call run, even though the result will always be the same. When you assign a value to a variable, however, it will "stick" (and, by the way, take the corresponding memory and, if you choose so, be serialized on checkpoints).
Dataset iterators are also stateful. When you get a piece of data in one run then it is consumed, and then in the next run you get a different piece of data; the iterator "remembers" where it was between runs. That is why, similarly to how you initialize variables, you can initialize iterators (when they support it), to reset them back to a known state.
Technically speaking, another kind of stateful objects is random operations. One usually regards random operations as, well, random, but in reality they hold a random number generator that does have a state which is held between runs, and if you provide a seed then they will be in a well-defined state when you start the session. However there is not, as far as I know, any way to reset random operations to their initial state within the same session.
Note that frequently (when one is not referring to TensorFlow in particular) the term "stateful" is used in a slightly different sense, or at a different level of abstraction. For example, recurrent neural networks (RNNs) are generally said to be stateful, because, conceptually, they have an internal state that changes with every input they receive. When you make a RNN in TensorFlow, however, that internal state does not necessarily have to be in a stateful object! Like any other kind of neural network, RNNs in TensorFlow will in principle have some parameters or weights, typically stored in trainable variables - so, in TensorFlow terms, all trainable models, RNN or not, have stateful objects for the trained parameters. However, the internal state of the RNN is represented in TensorFlow with an input state value and an output state value that you get on each run (see tf.nn.dynamic_rnn), and you can just start with a "zero" state on each run and forget about the final output state. Of course, you can also, if you want, take the input state to be the value of a variable, and the write the output state back to that variable, and then your RNN internal state will be "stateful" for TensorFlow; that is, you would be able to process some data in one run and the "pick up where you left it" in the next run (which may or may not make sense depending on the case). I understand this can be a bit confusing but I hope it makes sense.

Related

Why does setting backward(retain_graph=True) use up lot GPU memory?

I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?
The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/

Custom metric across rows in TensorFlow

I'm trying to calculate metrics for my TensorFlow model across rows with a common key -- specifically precision at k for an information retrieval task -- and I'm finding this extremely nontrivial. My data include a field that indicates the session ID of each row, and there are variable number of rows for each session ID (but small, under 100). My task is to train across the rows as independent observations, so I don't want to group on the session ID and train per-session as it will bias the model. The entire point of the model is to train and evaluate on individual items independently of context, but evaluate the quality of that evaluation within the context, by session IDs.
As a side note, part of the challenge I'm concerned about is data locality as I'm performing distributed training. However it seems that there may be a single evaluator? ("Evaluator is a special task that is not part of the training cluster.")
Once I realized that tf.metrics.precision_at_k calculates precision at k within class predictions for a given row / data point and not , I have considered writing a custom metric function to call from within the Estimator train_and_evaluate method that keeps an internal dict of session ID to tuples of labels and predictions, and transforms these into Tensors to feed to tf.metrics.precision_at_k.
Caveats:
I don't know if I can store this dict, as I don't think I can put it in the computational graph. Can / should I try to store its state in the metric method itself? Will that even be retained after the graph is created, and will it be correctly accessed on subsequent calls to the method?
I don't know how or if I can group items with the same session ID onto the same executor -- even if a method like group_by_window or group_by_reducer on the Dataset works, how does this affect locality in a distributed context?
I could reduce my eval set size to fit into memory but I don't know how to force this to run on only one executor.
I haven't had much luck finding any examples or more information online about anything like this, and the TF code and docs can be somewhat unhelpful, so I'd appreciate any advice! Thanks!

Use same RNN twice

I would like to run the same RNN cell over two inputs in Tensorflow.
My code:
def lstm_cell():
return tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True)
self.forward_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(layers)], state_is_tuple=True)
self.initial_state = self.forward_cell.zero_state(self.batch_size, tf.float32)
outputs1, state1 = tf.nn.dynamic_rnn(self.forward_cell, input1, initial_state=self.initial_state)
outputs2, state2 = tf.nn.dynamic_rnn(self.forward_cell, input2, initial_state=self.initial_state)
My question now is, is this the correct code to do what I want (use the SAME RNN on both inputs, i.e. share the weights).
On a similar post I found a similar solution using reuse_variables(): Running the same RNN over two tensors in tensorflow
I would go for that, but with my current solution I do not get a reuse error, which confuses me. When I print my variables it seems to be fine, too.
Could you explain why there is no reuse error in my case, and if this is correct?
Update:
After I double checked the source code in 1.6, I found that my memories from early versions are no longer actual (so thanks for bringing this up!). Your code indeed reuses the cell variables, because cells are initialized lazily and only once (see RNNCell.build() method, which actually creates the kernel and bias). After the cell is built, it's not rebuilt upon the next call. This means that a single instance of a cell always holds the same variables, no matter how often it's used in different networks, until you manually reset the built state. That's why reuse parameter no longer matters.
Original answer (no longer valid):
Your current code creates two independent RNN layers (each one is deep), with same initial state. This means they have different weight matrices, different nodes in graph, etc. Tensorflow has nothing to complain about, because it doesn't know they are intended to be shared. That's why you should specify reuse=True before calling tf.dynamic_rnn as the question you refer to suggests, this will cause tensorflow share the kernels of all cells.

Where does machine learning algorithme store the result?

I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.
the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.
I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.

Changing learning rate of snapshotted Tensorflow Optimizer

As the question says...is there any way to change the learning rate of an Optimizer object? So say I continue some training procedure using a snapshot of a prior model, but since creating that snapshot, I realise that my model is skipping round the energy landscape instead of settling down and generating some pretty output. Is there any way to load in the Optimizer object and setting a new learning rate post-hoc?
There's the object field _lr in the python memory space and _lr_t (in the tf memory space) that are numerically equal, however assigning to _lr doesn't change the value of _lr_t. Presumably it hence does not affect computions on the graph. Sooo, how can that learning rate be changed? Do I have to construct a new Optimizer object that then must be also attached to the network output? And just ignore the "old" Optimizer object restored from the snapshot?
If so, that seems a bit wasteful in terms of memory, and messy in terms of code, rather than just providing some setters.

Categories

Resources