I am attempting to train a sequence to sequence model using tensorflow and have been looking at their example code.
I want to be able to access the vector embeddings created by the encoder as they seem to have some interesting properties.
However, it really isn't clear to me how this can be.
In the vector representations of words example they talk a lot about what these embeddings can be used for and then don't appear to provide a simple way of accessing them, unless I am mistaken.
Any help figuring out how to access them would be greatly appreciated.
As with all Tensorflow operations, most variables are dynamically created. There are different ways to access these variables ( and their values ). Here, the variable you are interested in is part of the set of trained variables. To access these, we can thus use the tf.trainable_variables() function:
for var in tf.trainable_variables():
print var.name
which will give us - for a GRU seq2seq model, the following list:
embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0
embedding_rnn_seq2seq/RNN/GRUCell/Gates/Linear/Matrix:0
embedding_rnn_seq2seq/RNN/GRUCell/Gates/Linear/Bias:0
embedding_rnn_seq2seq/RNN/GRUCell/Candidate/Linear/Matrix:0
embedding_rnn_seq2seq/RNN/GRUCell/Candidate/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/embedding:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Gates/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Gates/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Candidate/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Candidate/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/OutputProjectionWrapper/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/OutputProjectionWrapper/Linear/Bias:0
This tells us that the embedding is called embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0, which we can then use to retrieve a pointer to that variable in our earlier iterator:
for var in tf.trainable_variables():
print var.name
if var.name == 'embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0':
embedding_op = var
This we can then pass along with other ops to our session-run:
_, loss_t, summary, embedding = sess.run([train_op, loss, summary_op, embedding_op], feed_dict)
and we have ourselves the (batch-list of) embeddings ...
There is a related post, but it is based on tensorflow-0.6 which is quite out of date. So I update his answer in tensorflow-0.8 which is also similar to that in the newest version.
(*represent where to modify)
losses = []
outputs = []
*states = []
with ops.op_scope(all_inputs, name, "model_with_buckets"):
for j, bucket in enumerate(buckets):
with variable_scope.variable_scope(variable_scope.get_variable_scope(),
reuse=True if j > 0 else None):
*bucket_outputs, _ ,bucket_states= seq2seq(encoder_inputs[:bucket[0]],
decoder_inputs[:bucket[1]])
outputs.append(bucket_outputs)
if per_example_loss:
losses.append(sequence_loss_by_example(
outputs[-1], targets[:bucket[1]], weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
else:
losses.append(sequence_loss(
outputs[-1], targets[:bucket[1]], weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
return outputs, losses, *states
at python/ops/seq2seq, modify embedding_attention_seq2seq()
if isinstance(feed_previous, bool):
*outputs, states = embedding_attention_decoder(
decoder_inputs, encoder_state, attention_states, cell,
num_decoder_symbols, embedding_size, num_heads=num_heads,
output_size=output_size, output_projection=output_projection,
feed_previous=feed_previous,
initial_state_attention=initial_state_attention)
*return outputs, states, encoder_state
# If feed_previous is a Tensor, we construct 2 graphs and use cond.
def decoder(feed_previous_bool):
reuse = None if feed_previous_bool else True
with variable_scope.variable_scope(variable_scope.get_variable_scope(),reuse=reuse):
outputs, state = embedding_attention_decoder(
decoder_inputs, encoder_state, attention_states, cell,
num_decoder_symbols, embedding_size, num_heads=num_heads,
output_size=output_size, output_projection=output_projection,
feed_previous=feed_previous_bool,
update_embedding_for_previous=False,
initial_state_attention=initial_state_attention)
return outputs + [state]
outputs_and_state = control_flow_ops.cond(feed_previous, lambda: decoder(True), lambda: decoder(False))
*return outputs_and_state[:-1], outputs_and_state[-1], encoder_state
at model/rnn/translate/seq2seq_model.py modify init()
if forward_only:
*self.outputs, self.losses, self.states= tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True),
softmax_loss_function=softmax_loss_function)
# If we use output projection, we need to project outputs for decoding.
if output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [
tf.matmul(output, output_projection[0]) + output_projection[1]
for output in self.outputs[b]
]
else:
*self.outputs, self.losses, _ = tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets,
lambda x, y: seq2seq_f(x, y, False),
softmax_loss_function=softmax_loss_function)
at model/rnn/translate/seq2seq_model.py modify step()
if not forward_only:
return outputs[1], outputs[2], None # Gradient norm, loss, no outputs.
else:
*return None, outputs[0], outputs[1:], outputs[-1] # No gradient norm, loss, outputs.
with all these done, we can get the encoded states by calling :
_, _, output_logits, states = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, True)
print (states)
in the translate.py.
Related
I'm puzzled by the fact that I've seen blocks of code that require tf.GradientTape().watch() to work and blocks that seem to work without it.
For example, this block of code requires the watch() function:
x = tf.ones((2,2)) #Note: In original posting of question,
# I didn't include this key line.
with tf.GradientTape() as t:
# Record the actions performed on tensor x with `watch`
t.watch(x)
# Define y as the sum of the elements in x
y = tf.reduce_sum(x)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
But, this block does not:
with tf.GradientTape() as tape:
logits = model(images, images = True)
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy.mean()
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradient(zip(grads, model.trainable_vaiables))
What is the difference between these two cases?
The documentation of watch says the following:
Ensures that tensor is being traced by this tape.
Any trainable variable that is accessed within the context of tape is watched by default. This means we can calculate the gradient with respect to that trainable variable by calling t.gradient(loss, variable). Check the following example:
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets, training=True)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
Hence no need to use tape.watch in the above code. But sometimes we need to calculate gradients with respect to some non trainable variable. In those cases we need to watch them.
with tf.GradientTape() as t:
t.watch(images)
predictions = cnn_model(images)
loss = tf.keras.losses.categorical_crossentropy(expected_class_output, predictions)
gradients = t.gradient(loss, images)
In the above code, images is input to model and not a trainable variable. I need to calculate the gradient of loss with respect to the images and hence I need to watch it.
Although the answer and comment given above were helpful, I think this code most clearly illustrates what I needed to know.
# Define a 2x2 array of 1's
x = tf.ones((2,2))
x1 = tf.Variable([[1,1],[1,1]],name='x1',dtype=tf.float32)
with tf.GradientTape(persistent=True) as t:
# Record the actions performed on tensor x with `watch`
# Define y as the sum of the elements in x
y = tf.reduce_sum(x) + tf.reduce_sum(x1)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
dz_dx1 = t.gradient(z, x1)
# Print results
print("dz_dx =",dz_dx)
print("dz_dx1 =", dz_dx1)
print("type(x) =", type(x))
print("type(x1) =", type(x1))
which gives,
dz_dx = None
dz_dx1 = tf.Tensor(
[[16. 16.]
[16. 16.]], shape=(2, 2), dtype=float32)
type(x) = <class 'tensorflow.python.framework.ops.EagerTensor'>
type(x1) = <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
In Tensorflow, all variables are are given the property Trainable=True by default. So, they are automatically watched. I neglected (out of ignorance) to originally indicate that x was not a variable, but instead, was an EagerTensor, which is not watched by default.
model.trainable_variables is a list of trainable variables, which are watched by gradient tape, whereas a regular Tensor is not watched without specifically telling the tape to watch it.
Here is the code https://github.com/tensorflow/tensorflow/blob/r0.10/tensorflow/models/rnn/ptb/ptb_word_lm.py,I wonder why we can feed the model by:
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
Because initial_state isn't with tf.placeholder,so how can we feed it?
In the code, it defines a class. And defines self._initial_state = cell.zero_state(batch_size, data_type()), then state = self._initial_state, and (cell_output, state) = cell(inputs[:, time_step, :], state). After that, self._final_state = state. What's more, it defines a function in the class:
#property
def final_state(self):
return self._final_state
And here comes
state = m.initial_state.eval()
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
And I have ran the code locally, it quite have the difference without state in feeddict.
Can anyone help?
Because initial_state isn't with tf.placeholder,so how can we feed it?
Placeholder is just a tensorflow tensor; we can feed any tensor using feed_dict mechanism,
And I have ran the code locally, it quite have the difference without state in feed_dict.
Here if you don't feed the learned state from previous batch, state will be initialized as zero for the next batch, hence there will be degradation of results.
With Tensorflow 0.12, there have been changes to the way that MultiRNNCell works, for starters, state_is_tuple is now set to True by default, furthermore, there is this discussion on it:
state_is_tuple: If True, accepted and returned states are n-tuples, where n = len(cells). If False, the states are all concatenated along the column axis. This latter behavior will soon be deprecated.
I'm wondering how exactly I could use a multi layer RNN with GRU cells, here is my code so far:
def _run_rnn(self, inputs):
# embedded inputs are passed in here
self.initial_state = tf.zeros([self._batch_size, self._hidden_size], tf.float32)
cell = tf.nn.rnn_cell.GRUCell(self._hidden_size)
cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=self._dropout_placeholder)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * self._num_layers, state_is_tuple=False)
outputs, last_state = tf.nn.dynamic_rnn(
cell = cell,
inputs = inputs,
sequence_length = self.sequence_length,
initial_state = self.initial_state
)
return outputs, last_state
My inputs look up word ids and return a corresponding embedding vectors. Now, running with the code above I'm greeted by the following error:
ValueError: Dimension 1 in both shapes must be equal, but are 100 and 200 for 'rnn/while/Select_1' (op: 'Select') with input shapes: [?], [64,100], [64,200]
The places I've got a ? in is within my placeholders:
def _add_placeholders(self):
self.input_placeholder = tf.placeholder(tf.int32, shape=[None, self._max_steps])
self.label_placeholder = tf.placeholder(tf.int32, shape=[None, self._max_steps])
self.sequence_length = tf.placeholder(tf.int32, shape=[None])
self._dropout_placeholder = tf.placeholder(tf.float32)
Your main issue is in the setting of the initial_state. Since your state is now a tuple, (more specifically an LSTMStateTuple, you cannot directly assign it to tf.zeros. Instead use,
self.initial_state = cell.zero_state(self._batch_size, tf.float32)
Have a look at the documentation for more.
To use this in code, you will need to pass this tensor in the feed_dict. Do something like this,
state = sess.run(model.initial_state)
for batch in batches:
# Logic to add input placeholder in `feed_dict`
feed_dict[model.initial_state] = state
# Note I'm re-using `state` below
(loss, state) = sess.run([model.loss, model.final_state], feed_dict=feed_dict)
Given a trained LSTM model I want to perform inference for single timesteps, i.e. seq_length = 1 in the example below. After each timestep the internal LSTM (memory and hidden) states need to be remembered for the next 'batch'. For the very beginning of the inference the internal LSTM states init_c, init_h are computed given the input. These are then stored in a LSTMStateTuple object which is passed to the LSTM. During training this state is updated every timestep. However for inference I want the state to be saved in between batches, i.e. the initial states only need to be computed at the very beginning and after that the LSTM states should be saved after each 'batch' (n=1).
I found this related StackOverflow question: Tensorflow, best way to save state in RNNs?. However this only works if state_is_tuple=False, but this behavior is soon to be deprecated by TensorFlow (see rnn_cell.py). Keras seems to have a nice wrapper to make stateful LSTMs possible but I don't know the best way to achieve this in TensorFlow. This issue on the TensorFlow GitHub is also related to my question: https://github.com/tensorflow/tensorflow/issues/2838
Anyone good suggestions for building a stateful LSTM model?
inputs = tf.placeholder(tf.float32, shape=[None, seq_length, 84, 84], name="inputs")
targets = tf.placeholder(tf.float32, shape=[None, seq_length], name="targets")
num_lstm_layers = 2
with tf.variable_scope("LSTM") as scope:
lstm_cell = tf.nn.rnn_cell.LSTMCell(512, initializer=initializer, state_is_tuple=True)
self.lstm = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * num_lstm_layers, state_is_tuple=True)
init_c = # compute initial LSTM memory state using contents in placeholder 'inputs'
init_h = # compute initial LSTM hidden state using contents in placeholder 'inputs'
self.state = [tf.nn.rnn_cell.LSTMStateTuple(init_c, init_h)] * num_lstm_layers
outputs = []
for step in range(seq_length):
if step != 0:
scope.reuse_variables()
# CNN features, as input for LSTM
x_t = # ...
# LSTM step through time
output, self.state = self.lstm(x_t, self.state)
outputs.append(output)
I found out it was easiest to save the whole state for all layers in a placeholder.
init_state = np.zeros((num_layers, 2, batch_size, state_size))
...
state_placeholder = tf.placeholder(tf.float32, [num_layers, 2, batch_size, state_size])
Then unpack it and create a tuple of LSTMStateTuples before using the native tensorflow RNN Api.
l = tf.unpack(state_placeholder, axis=0)
rnn_tuple_state = tuple(
[tf.nn.rnn_cell.LSTMStateTuple(l[idx][0], l[idx][1])
for idx in range(num_layers)]
)
RNN passes in the API:
cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell]*num_layers, state_is_tuple=True)
outputs, state = tf.nn.dynamic_rnn(cell, x_input_batch, initial_state=rnn_tuple_state)
The state - variable will then be feeded to the next batch as a placeholder.
Tensorflow, best way to save state in RNNs? was actually my original question. The code bellow is how I use the state tuples.
with tf.variable_scope('decoder') as scope:
rnn_cell = tf.nn.rnn_cell.MultiRNNCell \
([
tf.nn.rnn_cell.LSTMCell(512, num_proj = 256, state_is_tuple = True),
tf.nn.rnn_cell.LSTMCell(512, num_proj = WORD_VEC_SIZE, state_is_tuple = True)
], state_is_tuple = True)
state = [[tf.zeros((BATCH_SIZE, sz)) for sz in sz_outer] for sz_outer in rnn_cell.state_size]
for t in range(TIME_STEPS):
if t:
last = y_[t - 1] if TRAINING else y[t - 1]
else:
last = tf.zeros((BATCH_SIZE, WORD_VEC_SIZE))
y[t] = tf.concat(1, (y[t], last))
y[t], state = rnn_cell(y[t], state)
scope.reuse_variables()
Rather than using tf.nn.rnn_cell.LSTMStateTuple I just create a lists of lists which works fine. In this example I am not saving the state. However you could easily have made state out of variables and just used assign to save the values.
I currently have the following code for a series of chained together RNNs in tensorflow. I am not using MultiRNN since I was to do something later on with the output of each layer.
for r in range(RNNS):
with tf.variable_scope('recurent_%d' % r) as scope:
state = [tf.zeros((BATCH_SIZE, sz)) for sz in rnn_func.state_size]
time_outputs = [None] * TIME_STEPS
for t in range(TIME_STEPS):
rnn_input = getTimeStep(rnn_outputs[r - 1], t)
time_outputs[t], state = rnn_func(rnn_input, state)
time_outputs[t] = tf.reshape(time_outputs[t], (-1, 1, RNN_SIZE))
scope.reuse_variables()
rnn_outputs[r] = tf.concat(1, time_outputs)
Currently I have a fixed number of time steps. However I would like to change it to have only one timestep but remember the state between batches. I would therefore need to create a state variable for each layer and assign it the final state of each of the layers. Something like this.
for r in range(RNNS):
with tf.variable_scope('recurent_%d' % r) as scope:
saved_state = tf.get_variable('saved_state', ...)
rnn_outputs[r], state = rnn_func(rnn_outputs[r - 1], saved_state)
saved_state = tf.assign(saved_state, state)
Then for each of the layers I would need to evaluate the saved state in my sess.run function as well as calling my training function. I would need to do this for every rnn layer. This seems like kind of a hassle. I would need to track every saved state and evaluate it in run. Also then run would need to copy the state from my GPU to host memory which would be inefficient and unnecessary. Is there a better way of doing this?
Here is the code to update the LSTM's initial state, when state_is_tuple=True by defining state variables. It also supports multiple layers.
We define two functions - one for getting the state variables with an initial zero state and one function for returning an operation, which we can pass to session.run in order to update the state variables with the LSTM's last hidden state.
def get_state_variables(batch_size, cell):
# For each layer, get the initial state and make a variable out of it
# to enable updating its value.
state_variables = []
for state_c, state_h in cell.zero_state(batch_size, tf.float32):
state_variables.append(tf.contrib.rnn.LSTMStateTuple(
tf.Variable(state_c, trainable=False),
tf.Variable(state_h, trainable=False)))
# Return as a tuple, so that it can be fed to dynamic_rnn as an initial state
return tuple(state_variables)
def get_state_update_op(state_variables, new_states):
# Add an operation to update the train states with the last state tensors
update_ops = []
for state_variable, new_state in zip(state_variables, new_states):
# Assign the new state to the state variables on this layer
update_ops.extend([state_variable[0].assign(new_state[0]),
state_variable[1].assign(new_state[1])])
# Return a tuple in order to combine all update_ops into a single operation.
# The tuple's actual value should not be used.
return tf.tuple(update_ops)
We can use that to update the LSTM's state after each batch. Note that I use tf.nn.dynamic_rnn for unrolling:
data = tf.placeholder(tf.float32, (batch_size, max_length, frame_size))
cell_layer = tf.contrib.rnn.GRUCell(256)
cell = tf.contrib.rnn.MultiRNNCell([cell] * num_layers)
# For each layer, get the initial state. states will be a tuple of LSTMStateTuples.
states = get_state_variables(batch_size, cell)
# Unroll the LSTM
outputs, new_states = tf.nn.dynamic_rnn(cell, data, initial_state=states)
# Add an operation to update the train states with the last state tensors.
update_op = get_state_update_op(states, new_states)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run([outputs, update_op], {data: ...})
The main difference to this answer is that state_is_tuple=True makes the LSTM's state a LSTMStateTuple containing two variables (cell state and hidden state) instead of just a single variable. Using multiple layers then makes the LSTM's state a tuple of LSTMStateTuples - one per layer.
Resetting to zero
When using a trained model for prediction / decoding, you might want to reset the state to zero. Then, you can make use of this function:
def get_state_reset_op(state_variables, cell, batch_size):
# Return an operation to set each variable in a list of LSTMStateTuples to zero
zero_states = cell.zero_state(batch_size, tf.float32)
return get_state_update_op(state_variables, zero_states)
For example like above:
reset_state_op = get_state_reset_op(state, cell, max_batch_size)
# Reset the state to zero before feeding input
sess.run([reset_state_op])
sess.run([outputs, update_op], {data: ...})
I am now saving the RNN states using the tf.control_dependencies. Here is an example.
saved_states = [tf.get_variable('saved_state_%d' % i, shape = (BATCH_SIZE, sz), trainable = False, initializer = tf.constant_initializer()) for i, sz in enumerate(rnn.state_size)]
W = tf.get_variable('W', shape = (2 * RNN_SIZE, RNN_SIZE), initializer = tf.truncated_normal_initializer(0.0, 1 / np.sqrt(2 * RNN_SIZE)))
b = tf.get_variable('b', shape = (RNN_SIZE,), initializer = tf.constant_initializer())
rnn_output, states = rnn(last_output, saved_states)
with tf.control_dependencies([tf.assign(a, b) for a, b in zip(saved_states, states)]):
dense_input = tf.concat(1, (last_output, rnn_output))
dense_output = tf.tanh(tf.matmul(dense_input, W) + b)
last_output = dense_output + last_output
I just make sure that part of my graph is dependent on saving the state.
These two links are also related and useful for this question:
https://github.com/tensorflow/tensorflow/issues/2695
https://github.com/tensorflow/tensorflow/issues/2838