Why am I getting Nan after adding relu activation in LSTM? - python

I have simple LSTM network that looks roughly like this:
lstm_activation = tf.nn.relu
cells_fw = [LSTMCell(num_units=100, activation=lstm_activation),
LSTMCell(num_units=10, activation=lstm_activation)]
stacked_cells_fw = MultiRNNCell(cells_fw)
_, states = tf.nn.dynamic_rnn(cell=stacked_cells_fw,
output_states = [s.h for s in states]
states = tf.concat(output_states, 1)
My question is. When I don't use activation (activation=None) or use tanh everything works but when I switch relu I'm keep getting "NaN loss during training", why is that?. It's 100% reproducible.

When you use the relu activation function inside the lstm cell, it is guaranteed that all the outputs from the cell, as well as the cell state, will be strictly >= 0. Because of that, your gradients become extremely large and are exploding. For example, run the following code snippet and observe that the outputs are never < 0.
X = np.random.rand(4,3,2)
lstm_cell = tf.nn.rnn_cell.LSTMCell(5, activation=tf.nn.relu)
hidden_states, _ = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=X, dtype=tf.float64)
sess = tf.Session()


Extracting the dropout mask from a keras dropout layer?

I would like to extract and store the dropout mask [array of 1/0s] from a dropout layer in a Sequential Keras model at each batch while training. I was wondering if there was a straight forward way way to do this within Keras or if I would need to switch over to tensorflow (How to get the dropout mask in Tensorflow).
Would appreciate any help! I'm quite new to TensorFlow and Keras.
There are a couple of functions (dropout_layer.get_output_mask(), dropout_layer.get_input_mask()) for the dropout layer that I tried using but got None after calling on the previous layer.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(name="flat", input_shape=(28, 28, 1)))
name = 'dense_1',
dropout = tf.keras.layers.Dropout(0.2, name = 'dropout') #want this layer's mask
x = dropout.output_mask
y = dropout.input_mask
It's not easily exposed in Keras. It goes deep until it calls the Tensorflow dropout.
So, although you're using Keras, it's will also be a tensor in the graph that can be gotten by name (finding it's name: In Tensorflow, get the names of all the Tensors in a graph).
This option, of course will lack some keras information, you should probably have to do that inside a Lambda layer so Keras adds certain information to the tensor. And you must take extra care because the tensor will exist even when not training (where the mask is skipped)
Now, you can also use a less hacky way, that may consume a little processing:
def getMask(x):
boolMask = tf.not_equal(x, 0)
floatMask = tf.cast(boolMask, tf.float32) #or tf.float64
return floatMask
Use a Lambda(getMasc)(output_of_dropout_layer)
But instead of using a Sequential model, you will need a functional API Model.
inputs = tf.keras.layers.Input((28, 28, 1))
outputs = tf.keras.layers.Flatten(name="flat")(inputs)
outputs = tf.keras.layers.Dense(
# activation='relu', #relu will be a problem here
name = 'dense_1',
outputs = tf.keras.layers.Dropout(0.2, name = 'dropout')(outputs)
mask = Lambda(getMask)(outputs)
#there isn't "input_mask"
#add the missing relu:
outputs = tf.keras.layers.Activation('relu')(outputs)
outputs = tf.keras.layers.Dense(
model = Model(inputs, outputs)
Training and predicting
Since you can't train the masks (it doesn't make any sense), it should not be an output of the model for training.
Now, we could try this:
trainingModel = Model(inputs, outputs)
predictingModel = Model(inputs, [output, mask])
But masks don't exist in prediction, because dropout is only applied in training. So this doesn't bring us anything good in the end.
The only way for training is then using a dummy loss and dummy targets:
def dummyLoss(y_true, y_pred):
return y_true #but this might evoke a "None" gradient problem since it's not trainable, there is no connection to any weights, etc.
model.compile(loss=[loss_for_main_output, dummyLoss], ....)
model.fit(x_train, [y_train, np.zeros((len(y_Train),) + mask_shape), ...)
It's not guaranteed that these will work.
I found a very hacky way to do this by trivially extending the provided dropout layer. (Almost all code from TF.)
class MyDR(tf.keras.layers.Layer):
def __init__(self,rate,**kwargs):
super(MyDR, self).__init__(**kwargs)
self.noise_shape = None
self.rate = rate
def _get_noise_shape(self,x, noise_shape=None):
# If noise_shape is none return immediately.
if noise_shape is None:
return array_ops.shape(x)
# Best effort to figure out the intended shape.
# If not possible, let the op to handle it.
# In eager mode exception will show up.
noise_shape_ = tensor_shape.as_shape(noise_shape)
except (TypeError, ValueError):
return noise_shape
if x.shape.dims is not None and len(x.shape.dims) == len(noise_shape_.dims):
new_dims = []
for i, dim in enumerate(x.shape.dims):
if noise_shape_.dims[i].value is None and dim.value is not None:
return tensor_shape.TensorShape(new_dims)
return noise_shape
def build(self, input_shape):
self.noise_shape = input_shape
def call(self,input):
self.noise_shape = self._get_noise_shape(input)
random_tensor = tf.random.uniform(self.noise_shape, seed=1235, dtype=input.dtype)
keep_prob = 1 - self.rate
scale = 1 / keep_prob
# NOTE: if (1.0 + rate) - 1 is equal to rate, then we want to consider that
# float to be selected, hence we use a >= comparison.
self.keep_mask = random_tensor >= self.rate
#NOTE: here is where I save the binary masks.
#the file grows quite big!
ret = input * scale * math_ops.cast(self.keep_mask, input.dtype)
return ret

Do we use different weights in Bidirectional LSTM for each batch?

For example this is one of the function which we need to call for each batch. Here it looks like different parameters are used for each batch. Is that correct? If it is then, why? Shouldn't we be using same parameters for whole training set?
def bidirectional_lstm(input_data, num_layers=3, rnn_size=200, keep_prob=0.6):
output = input_data
for layer in range(num_layers):
with tf.variable_scope('encoder_{}'.format(layer)):
cell_fw = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, input_keep_prob = keep_prob)
cell_bw = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, input_keep_prob = keep_prob)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw,
output = tf.concat(outputs,2)
return output
for batch_i, batch in enumerate(get_batches(X_train, batch_size)):
embeddings = tf.nn.embedding_lookup(word_embedding_matrix, batch)
output = bidirectional_lstm(embeddings)
I have figured out the issue in there. It turns out that we do use the same parameter and above code will give an error in second iteration saying that bidirectional kernel already exists. To fix this, we need to set, reuse=AUTO_REUSE while defining scope variable. Therefore, the line
with tf.variable_scope('encoder_{}'.format(layer)):
will become
with tf.variable_scope('encoder_{}'.format(layer),reuse=AUTO_REUSE):
Now we are using the same layers for each batch.

TensorFlow Assign

I am trying to write a custom version of an RNN and would like to just store the state and last output of the cells in variables but it is not working. My guess is that TensorFlow sees the storing of the values unnecessary and does not execute it. Here is a snippet that illustrates the problem.
For this example, I have five layers of "cells" that intentionally ignore the input and output the sum of the biases for the cell and the previous output, which is initialized to zero. However, as we run this, the output of the network is always just the values of the biases in the final layer and the value of last_output remains zero.
import tensorflow as tf
import numpy as np
def cell_function(cell_inputs, layer):
last_output = tf.get_variable('last_output_{}'.format(layer), shape=(10, 1),
initializer=tf.zeros_initializer, trainable=False)
biases = tf.get_variable('biases_{}'.format(layer), shape=(10, 1),
cell_output = last_output + biases
return cell_output
def rnn_function(inputs):
with tf.variable_scope('rnn', reuse=tf.AUTO_REUSE):
next_inputs = inputs
for layer in range(num_layers):
next_inputs = cell_function(next_inputs, layer)
return next_inputs
num_layers = 5
data = np.random.uniform(0, 10, size=(1001, 10, 1))
x = tf.placeholder('float', shape=(10, 1))
y = tf.placeholder('float', shape=(10, 1))
predictions = rnn_function(x)
loss = tf.losses.mean_squared_error(predictions=predictions, labels=y)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(loss=loss)
with tf.variable_scope('rnn', reuse=tf.AUTO_REUSE):
last = tf.get_variable('last_output_4', shape=(10, 1),
initializer=tf.zeros_initializer, trainable=False)
layer_biases = tf.get_variable('biases_4', shape=(10, 1),
with tf.Session() as sess:
for t in range(1000):
rnn_input = data[t]
rnn_output = data[t+1]
feed_dict = {x: rnn_input, y: rnn_output}
fetches = [optimizer, redictions, loss, last, layer_biases]
_, pred, mse, value, bias = sess.run(fetches, feed_dict=feed_dict)
If change the last line of cell_function before the return to last_output = tf.assign(last_output, cell_output) and then return it with cell_output and then return it again out of rnn_function and use that for the variable last everything works. I think it is because we are forcing TensorFlow to compute that node in the graph.
Is there any way to make this work without passing last_output out of the cell? It would be much nicer if I didn't have to keep passing all this stuff out to get the assignment operation to be executed.
Make it dependent on an operation that will be run, in this example I'll use the cost function, but use whatever makes sense:
with tf.control_dependencies(cost):
tf.assign(last_output, cell_output)
Now the assign operation will be required in order for cost to be computed, which should solve your problem. For any operation you request tensorflow to compute with sess.run(some_op), tensorflow will work backwards through the dependency graph and only compute the minimum elements necessary to produce the requested output.

Evaluation of Regression Neural Network

I am trying to write a small program to solve a Regression problem. My dataset is hereby 4 random x (x1,x2,x3 and x4) and 1 y value. One of the rows looks like this:
0.634585 0.552366 0.873447 0.196890 8.75
I know want to predict the y-value as close as possible so after the training I would like to evaluate how good my model is by showing the loss. Unfortunately I always receive
Training cost= nan
The most important lines of could would be:
X_data = tf.placeholder(shape=[None, 4], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
# Input neurons : 4
# Hidden neurons : 2 x 8
# Output neurons : 3
hidden_layer_nodes = 8
w1 = tf.Variable(tf.random_normal(shape=[4,hidden_layer_nodes])) # Inputs -> Hidden Layer1
b1 = tf.Variable(tf.random_normal(shape=[hidden_layer_nodes])) # First Bias
w2 = tf.Variable(tf.random_normal(shape=[hidden_layer_nodes,1])) # Hidden layer2 -> Outputs
b2 = tf.Variable(tf.random_normal(shape=[1])) # Third Bias
hidden_output = tf.nn.relu(tf.add(tf.matmul(X_data, w1), b1))
final_output = tf.nn.relu(tf.add(tf.matmul(hidden_output, w2), b2))
loss = tf.reduce_mean(-tf.reduce_sum(y_target * tf.log(final_output), axis=0))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
steps = 10000
with tf.Session() as sess:
for i in range(steps):
if i%500 == 0:
print('Currently on step {}'.format(i))
training_cost = sess.run(loss, feed_dict={X_data:X_test,y_target:y_test})
print("Training cost=", training_cost)
Maybe someone knows where my mistake is or even better, how to constantly show the error during my training :) I know how this is done with the tf.estimator, but not without. If you need the dataset, let me know.
This is because the Relu activation function causes the exploding gradient. Therefore, you need to reduce the learning rate accordingly. Moreover, you can try a different activation function also (for this you may have to normalize your dataset first)
Here, (In simple multi-layer FFNN only ReLU activation function doesn't converge) is a similar problem as your case. Follow the answer and you will understand.
Hope this helps.

TensorFlow: Remember LSTM state for next batch (stateful LSTM)

Given a trained LSTM model I want to perform inference for single timesteps, i.e. seq_length = 1 in the example below. After each timestep the internal LSTM (memory and hidden) states need to be remembered for the next 'batch'. For the very beginning of the inference the internal LSTM states init_c, init_h are computed given the input. These are then stored in a LSTMStateTuple object which is passed to the LSTM. During training this state is updated every timestep. However for inference I want the state to be saved in between batches, i.e. the initial states only need to be computed at the very beginning and after that the LSTM states should be saved after each 'batch' (n=1).
I found this related StackOverflow question: Tensorflow, best way to save state in RNNs?. However this only works if state_is_tuple=False, but this behavior is soon to be deprecated by TensorFlow (see rnn_cell.py). Keras seems to have a nice wrapper to make stateful LSTMs possible but I don't know the best way to achieve this in TensorFlow. This issue on the TensorFlow GitHub is also related to my question: https://github.com/tensorflow/tensorflow/issues/2838
Anyone good suggestions for building a stateful LSTM model?
inputs = tf.placeholder(tf.float32, shape=[None, seq_length, 84, 84], name="inputs")
targets = tf.placeholder(tf.float32, shape=[None, seq_length], name="targets")
num_lstm_layers = 2
with tf.variable_scope("LSTM") as scope:
lstm_cell = tf.nn.rnn_cell.LSTMCell(512, initializer=initializer, state_is_tuple=True)
self.lstm = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * num_lstm_layers, state_is_tuple=True)
init_c = # compute initial LSTM memory state using contents in placeholder 'inputs'
init_h = # compute initial LSTM hidden state using contents in placeholder 'inputs'
self.state = [tf.nn.rnn_cell.LSTMStateTuple(init_c, init_h)] * num_lstm_layers
outputs = []
for step in range(seq_length):
if step != 0:
# CNN features, as input for LSTM
x_t = # ...
# LSTM step through time
output, self.state = self.lstm(x_t, self.state)
I found out it was easiest to save the whole state for all layers in a placeholder.
init_state = np.zeros((num_layers, 2, batch_size, state_size))
state_placeholder = tf.placeholder(tf.float32, [num_layers, 2, batch_size, state_size])
Then unpack it and create a tuple of LSTMStateTuples before using the native tensorflow RNN Api.
l = tf.unpack(state_placeholder, axis=0)
rnn_tuple_state = tuple(
[tf.nn.rnn_cell.LSTMStateTuple(l[idx][0], l[idx][1])
for idx in range(num_layers)]
RNN passes in the API:
cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell]*num_layers, state_is_tuple=True)
outputs, state = tf.nn.dynamic_rnn(cell, x_input_batch, initial_state=rnn_tuple_state)
The state - variable will then be feeded to the next batch as a placeholder.
Tensorflow, best way to save state in RNNs? was actually my original question. The code bellow is how I use the state tuples.
with tf.variable_scope('decoder') as scope:
rnn_cell = tf.nn.rnn_cell.MultiRNNCell \
tf.nn.rnn_cell.LSTMCell(512, num_proj = 256, state_is_tuple = True),
tf.nn.rnn_cell.LSTMCell(512, num_proj = WORD_VEC_SIZE, state_is_tuple = True)
], state_is_tuple = True)
state = [[tf.zeros((BATCH_SIZE, sz)) for sz in sz_outer] for sz_outer in rnn_cell.state_size]
for t in range(TIME_STEPS):
if t:
last = y_[t - 1] if TRAINING else y[t - 1]
last = tf.zeros((BATCH_SIZE, WORD_VEC_SIZE))
y[t] = tf.concat(1, (y[t], last))
y[t], state = rnn_cell(y[t], state)
Rather than using tf.nn.rnn_cell.LSTMStateTuple I just create a lists of lists which works fine. In this example I am not saving the state. However you could easily have made state out of variables and just used assign to save the values.

