How to use reset_states(states) function in Keras? - python

I'm trying to set the LSTM internal state before training each batch.
I'm sharing my test code and findings, hoping to find an answer and help others that are addressing similar problems.
In particular, for each data I have a feature X (which doesn't change over time) and a sequence P = p1, p2, p3,... p30.
The goal is: given X and p1,p2,p3 predict p4, p5, .. p30.
To this aim, I want to initialize the hidden state of an LSTM with X, as done in several works (e.g., neuraltalk), then the LSTM has to be fit with p1,p2,p3 to predict p4,..,p30.
This initialization is needed before each batch (batch_size=1), therefore I need to have the control of the LSTM states initialization.
Considerint this question Initializing LSTM hidden state Tensorflow/Keras I've tested the following code:
First of all I've added some prints in the reset_states() function defined in recurrent.py, in order to understand what exactly happens.
def reset_states(self, states=None):
if not self.stateful:
raise AttributeError('Layer must be stateful.')
batch_size = self.input_spec[0].shape[0]
if not batch_size:
raise ValueError('If a RNN is stateful, it needs to know '
'its batch size. Specify the batch size '
'of your input tensors: \n'
'- If using a Sequential model, '
'specify the batch size by passing '
'a `batch_input_shape` '
'argument to your first layer.\n'
'- If using the functional API, specify '
'the time dimension by passing a '
'`batch_shape` argument to your Input layer.')
# initialize state if None
if self.states[0] is None:
self.states = [K.zeros((batch_size, self.units))
for _ in self.states]
print "reset states A (all zeros)"
elif states is None:
for state in self.states:
K.set_value(state, np.zeros((batch_size, self.units)))
print "reset states B (all zeros)"
else:
if not isinstance(states, (list, tuple)):
states = [states]
print "reset states C (list or tuple copying)"
if len(states) != len(self.states):
raise ValueError('Layer ' + self.name + ' expects ' +
str(len(self.states)) + ' states, '
'but it received ' + str(len(states)) +
' state values. Input received: ' +
str(states))
for index, (value, state) in enumerate(zip(states, self.states)):
if value.shape != (batch_size, self.units):
raise ValueError('State ' + str(index) +
' is incompatible with layer ' +
self.name + ': expected shape=' +
str((batch_size, self.units)) +
', found shape=' + str(value.shape))
K.set_value(state, value)
print "reset states D (set values)"
print value
print "\n"
Here is the test code:
import tensorflow as tf
from keras.layers import LSTM
from keras.layers import Input
from keras.models import Model
import numpy as np
import keras.backend as K
input = Input(batch_shape=(1,3,1))
lstm_layer = LSTM(10,stateful=True)(input)
>>> reset states A (all zeros)
As you can see, the first print is executed when the lstm layer is created
model = Model(input,lstm_layer)
model.compile(optimizer="adam", loss="mse")
with tf.Session() as sess:
tf.global_variables_initializer().run()
h = sess.run(model.layers[1].states[0])
c = sess.run(model.layers[1].states[1])
print h
>>> [[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
print c
>>> [[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
The internal states have been set to all zeros.
As an alternative the function reset_states() can be used
model.layers[1].reset_states()
>>> reset states B (all zeros)
The second message has been printed in this case. Everything seem to work correctly.
Now I want to set the states with arbitrary values.
new_h = K.variable(value=np.ones((1, 10)))
new_c = K.variable(value=np.ones((1, 10))+1)
model.layers[1].states[0] = new_h
model.layers[1].states[1] = new_c
with tf.Session() as sess:
tf.global_variables_initializer().run()
h = sess.run(model.layers[1].states[0])
c = sess.run(model.layers[1].states[1])
print h
>>> [[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]
print c
>>> [[ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]
Ok, I've successfully set both hidden states with my vectors of all one and all two.
However, it worth to exploit the class function reset_states() which takes as input the states.
This function exploits the function K.set_values(x,values) which expects 'values' to be a numpy array.
new_h_5 = np.zeros((1,10))+5
new_c_24 = np.zeros((1,10))+24
model.layers[1].reset_states([new_h_5,new_c_24])
It seems to work, indeed the output is:
>>> reset states D (set values)
>>> [[ 5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]]
>>>
>>>
>>>
>>>
>>> reset states D (set values)
>>> [[ 24. 24. 24. 24. 24. 24. 24. 24. 24. 24.]]
However, if i want to check if the states have been initializated I find the previous initialization values (all one, all two).
with tf.Session() as sess:
tf.global_variables_initializer().run()
hh = sess.run(model.layers[1].states[0])
cc = sess.run(model.layers[1].states[1])
print hh
>>> [[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]
print cc
>>> [[ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]
What is exactly happening here? Why the function seems working according to the prints but doesn't change the values of the internal states?

As you may read here, value parameter sets a value by which a variable should be initialized. So when you call tf.global_variables_initializer().run() your states are initialized with values defined here:
new_h = K.variable(value=np.ones((1, 10)))
new_c = K.variable(value=np.ones((1, 10))+1)
Edit:
It seemed obvious for me but once again I will explain why reset_states doesn't work.
Variable definition: When you defined your inner states to be variables initialized by a certain value the n this certain vaklue will be set every time you call variable_initializer.
Reset states: it will update a current value of this variable but it will not change a default value of initializer. In order to do that you need to reassign this states by yet another variable with a given states set as default.

Related

Tensor slicing looses the shape information in TensorFlow

I'm trying to dynamically slice a tensor to automatically adjust its shape for the next iteration. However, I realized that when sliced in Graph mode the shape information of the tensor is lost hence I can not further apply certain operations on it which requires knowing the shape of a given tensor. Below I attached an example code, in my specific example the opt_with_slicing function is within a vectorized_map which is defined in a larger function that takes care of auto differentiation. Since the original function is too large to include here I simplified it accordingly;
a = tf.constant(np.linspace(0.,10.,11,endpoint=True)[::-1])
b = tf.ones((2,10))
def opt_with_slicing(x, some_cutoff: float):
a, b = x
new_size = tf.math.count_nonzero(
tf.cast(a >= some_cutoff, dtype=tf.int32), dtype=tf.int32
)
tf.print(f"new size {new_size}, initial size {a.get_shape()}")
test1 = b[:, :new_size]
test2 = tf.slice(b, [0, 0], [b.get_shape()[0], new_size])
tf.print(f"test1 shape {test1.get_shape()}, test2 shape {test2.get_shape()}")
return test1, test2
tf.function(opt_with_slicing)([a,b], 5.)
# Output:
# new size Tensor("count_nonzero/Cast_1:0", shape=(), dtype=int32), initial size (11,)
# test1 shape (2, None), test2 shape (2, None)
# (<tf.Tensor: shape=(2, 6), dtype=float32, numpy=
# array([[1., 1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1., 1.]], dtype=float32)>,
# <tf.Tensor: shape=(2, 6), dtype=float32, numpy=
# array([[1., 1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1., 1.]], dtype=float32)>)
As you can see from the print out the shape information of test1 and test2 is lost and since this is a dynamic operation I have no way to know the new_size prior to the execution. Is there a way to reinstate the shape information of the function without breaking the graph mode?
PS: I tried the same with boolean_mask as well;
mask = tf.greater_equal(a, some_cutoff)
masked_shape = tf.boolean_mask(a, mask).get_shape()[0]
but masked_shape turns out to be None as well.
System info:
Tensorflow v2.5.0
Python v3.8.2

tf.Variable assign method breaks the tf.GradientTape

When I use the assign method of tf.Variable to change the value of a variable, it brakes the tf.Gradient, e. g., see the code for a toy example below:
(NOTE: I am interested in TensorFlow 2 only.)
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
o = tf.reduce_mean(y)
do_dpatch = g.gradient(o, patch)
Then it gives me None for the do_dpatch.
Note that if I do the following it works perfectly fine:
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
o = tf.reduce_mean(y)
do_dx = g.gradient(o, x)
and gives me:
>>>do_dx
<tf.Tensor: id=106, shape=(2, 3), dtype=float32, numpy=
array([[ 1., 2., 52.],
[ 1., 2., 52.]], dtype=float32)>
This behavior does make sense. Let's take your first example
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[1., 1.], [1., 1.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
dy_dx = g.gradient(y, patch)
You are computing dy/d(patch). But your y depends on x only not on patch. Yes, you do assign values to x from patch. But this operation doesn't carry a reference to the patch Variable. It just copies the values.
In short, you are trying to get a gradient w.r.t something it doesn't depend on. So you will get None.
Let's look at the second example and why it works.
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
with tf.GradientTape() as g:
g.watch(x)
x[:2,:2].assign([[1., 1.], [1., 1.]])
y = tf.tensordot(x, tf.transpose(x), axes=1)
dy_dx = g.gradient(y, x)
This example is perfectly fine. Y depends on x and you are computing dy/dx. So you'd get actual gradients in this example.
As explained HERE (see the quote below from alextp) tf.assign does not support gradient.
"There is no plan to add a gradient to tf.assign because it's not possible in general to connect the uses of the assigned variable with the graph which assigned it."
So, the above problem can be resolved by the following code:
x= tf.Variable([[0.0,0.0,4.0], [0.,0.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
padding = tf.constant([[0, 0], [0, 1]])
padde_patch = tf.pad(patch, padding, mode='CONSTANT', constant_values=0)
revised_x = x+ padde_patch
y = tf.tensordot(revised_x, tf.transpose(revised_x), axes=1)
o = tf.reduce_mean(y)
do_dpatch = g.gradient(o, patch)
which results in
do_dpatch
<tf.Tensor: id=65, shape=(2, 2), dtype=float32, numpy=
array([[1., 2.],
[1., 2.]], dtype=float32)>

Tensorflow RNN text generation example tutorial

Looking at this tutorial here, they use a starting sequence of “Romeo: “.
int(generate_text(model, start_string=u"ROMEO: "))
However, looking at the actual generation step, is it fair to say it’s only using the last character “ “? So it’s the same whether we use “ROMEO: “ or just “ “? It’s hard to test as it samples from the output distribution ...
Relatedly, it’s unclear how it would predict from such a short string as the original training sequence is much longer. I understand if we trained on a history of 100 chars we predict the 101st and then use 2-101 to predict 102... but how does it start with just 7 characters?
EDIT
As a specific example, I reworked my model to be of the following form:
model = tf.keras.Sequential()
model.add(tf.keras.layers.SimpleRNN(units=512, input_shape = (seq_len, 1), activation="tanh"))
model.add(tf.keras.layers.Dense(len(vocab)))
model.compile(loss=loss, optimizer='adam')
model.summary()
Notice, I use a simpleRNN instead of a GRU and drop the embedding step. Both of those changes are to simplify the model but that shouldn't matter.
My training and output data is as follows:
>>> input_array_reshaped
array([[46., 47., 53., ..., 39., 58., 1.],
[ 8., 0., 20., ..., 33., 31., 10.],
[63., 1., 44., ..., 58., 46., 43.],
...,
[47., 41., 47., ..., 0., 21., 57.],
[59., 58., 1., ..., 1., 61., 43.],
[52., 57., 43., ..., 1., 63., 53.]])
>>> input_array_reshaped.shape
(5000, 100)
>>> output_array_reshaped.shape
(5000, 1, 1)
>>> output_array_reshaped
array([[[40.]],
[[ 0.]],
[[56.]],
...,
[[ 1.]],
[[56.]],
[[59.]]])
However, if I try to predict on a string less than 100 characters I get:
ValueError: Error when checking input: expected simple_rnn_1_input to have shape (100, 1) but got array with shape (50, 1)
Below is my prediction function if needed. If I change the required_training_length to anything but 100 it crashes. It requires "specifically" time_steps of length 100.
Can someone tell me how to adjust the model to make it more flexible as in the example? What subtlety am I missing?
def generateText(starting_corpus, num_char_to_generate = 1000, required_training_length = 100):
random_starting_int = random.sample(range(len(text)),1)[0]
ending_position = random_starting_int+required_training_length
starting_string = text[random_starting_int:ending_position]
print("Starting string is: " + starting_string)
numeric_starting_string = [char2idx[x] for x in starting_string]
reshaped_numeric_string = np.reshape(numeric_starting_string, (1, len(numeric_starting_string), 1)).astype('float32')
output_numeric_vector = []
for i in range(num_char_to_generate):
if i%50 == 0:
print("Processing character index: "+str(i))
predicted_values = model.predict(reshaped_numeric_string)
selected_predicted_value = tf.random.categorical(predicted_values, num_samples = 1)[0][0].numpy().astype('float32') #sample from the predicted values
#temp = reshaped_numeric_string.copy()
output_numeric_vector.append(selected_predicted_value)
reshaped_numeric_string = np.append(reshaped_numeric_string[:,1:,:], np.reshape(selected_predicted_value, (1,1,1)), axis = 1)
predicted_chars = [idx2char[x] for x in output_numeric_vector]
final_text = ''.join(predicted_chars)
return(final_text)
However, looking at the actual generation step, is it fair to say
it’s only using the last character “ “? So it’s the same whether we
use “ROMEO: “ or just “ “? It’s hard to test as it samples from the
output distribution ...
No, it is taking all characters into consideration. You can easily
verify that by using a fixed random seed:
from numpy.random import seed
from tensorflow.random import set_seed
seed(1)
set_seed(1)
print('======')
print(generate_text(m, 'ROMEO: '))
seed(1)
set_seed(1)
print('======')
print(generate_text(m, ' '))
Relatedly, it’s unclear how it would predict from such a short
string as the original training sequence is much longer. I
understand if we trained on a history of 100 chars we predict the
101st and then use 2-101 to predict 102... but how does it start
with just 7 characters?
Internally it runs the sequence in a loop. It takes the first
character and predicts the second. Then the second to predict the
third and so on. While doing so it updates its hidden state so that
its predictions becomes better and better. Eventually it plateaus
because it cannot remember arbitrary long sequences.

Getting Keras / Tensorflow to output OneHotCategorical, but operation has None for gradient

Problem description
I have inputs x that are indicator variables, and outputs y, where each row is a random one-hot vector that depends on the values of x (data sample shown below).
I want to train a model that essentially learns the probabilistic relationship between x and y in the form of per-column weights. The model must "choose" one, and only one, indicator to output. My current approach is to sample a categorical random variable and produce a one-hot vector as a prediction.
The issue is that I'm getting an error ValueError: An operation has `None` for gradient when I try to train my Keras model.
I find this error odd, because I've trained mixture networks using Keras and Tensorflow, which use tf.contrib.distributions.Categorical, and I did not run into any gradient-related issues.
Code
Experiment
import tensorflow as tf
import tensorflow.contrib.distributions as tfd
import numpy as np
from keras import backend as K
from keras.layers import Layer
from keras.models import Sequential
from keras.utils import to_categorical
def make_xy_prob(rng, size=10000):
rng = np.random.RandomState(rng) if isinstance(rng, int) else rng
cols = 3
weights = np.array([[1, 2, 3]])
# generate data and drop zeros for now
x = rng.choice(2, (size, cols))
is_zeros = x.sum(axis=1) == 0
x = x[~is_zeros]
# use weights to create probabilities for determining y
weighted_x = x * weights
prob_x = weighted_x / weighted_x.sum(axis=1, keepdims=True)
y = np.row_stack([to_categorical(rng.choice(cols, p=p), cols) for p in prob_x])
# add zeros back and shuffle
zeros = np.zeros(((size - len(x), cols)))
x = np.row_stack([x, zeros])
y = np.row_stack([y, zeros])
shuffle_idx = rng.permutation(size)
x = x[shuffle_idx]
y = y[shuffle_idx]
return x, y
class OneHotGate(Layer):
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel', shape=(1, input_shape[1]), initializer='ones')
def call(self, x):
zero_cond = x < 1
x_shape = tf.shape(x)
# weight indicators so that more probability is assigned to more likely columns
weighted_x = x * self.kernel
# fill zeros with -inf so that zero probability is assigned to that column
ninf_fill = tf.fill(x_shape, -np.inf)
masked_x = tf.where(zero_cond, ninf_fill, weighted_x)
onehot_gate = tf.squeeze(tfd.OneHotCategorical(logits=masked_x, dtype=x.dtype).sample(1))
# fill gate with zeros where input was originally zero
zeros_fill = tf.fill(x_shape, 0.0)
masked_gate = tf.where(zero_cond, zeros_fill, onehot_gate)
return masked_gate
def experiment(epochs=10):
K.clear_session()
rng = np.random.RandomState(2)
X, y = make_xy_prob(rng)
input_shape = (X.shape[1], )
model = Sequential()
gate_layer = OneHotGate(input_shape=input_shape)
model.add(gate_layer)
model.compile('adam', 'categorical_crossentropy')
model.fit(X, y, 64, epochs, verbose=1)
Data sample
>>> x
array([[1., 1., 1.],
[0., 1., 0.],
[1., 0., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 0.]])
>>> y
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
...,
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
Error
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
The problem lies in the fact that in OneHotCategorical performs a discontinuous sampling - what causes gradient computation to fail. In order to replace this discontinuous sampling with a continuous (relaxed) version one may try to use RelaxedOneHotCategorical (which is based on interesting Gumbel Softmax technique).

Input an integer with placeholder in tensorflow?

I want to feed a batch_size integer as a placeholder in Tensorflow. But it does not act as an integer. Consider the following example:
import tensorflow as tf
max_length = 5
batch_size = 3
batch_size_placeholder = tf.placeholder(dtype=tf.int32)
mask_0 = tf.one_hot(indices=[0]*batch_size_placeholder, depth=max_length, on_value=0., off_value=1.)
mask_1 = tf.one_hot(indices=[0]*batch_size, depth=max_length, on_value=0., off_value=1.)
# new session
with tf.Session() as sess:
feed = {batch_size_placeholder : 3}
batch, mask0, mask1 = sess.run([
batch_size_placeholder, mask_0, mask_1
], feed_dict=feed)
When I print the values of batch, mask0 and mask1 I have the following:
print(batch)
>>> array(3, dtype=int32)
print(mask0)
>>> array([[0., 1., 1., 1., 1.]], dtype=float32)
print(mask1)
>>> array([[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.]], dtype=float32)
Indeed I thought mask0 and mask1 must be the same, but it seems that Tensorflow does not treat batch_size_placeholder as an integer. I believe it would be a tensor, but is there anyway that I can use it as an integer in my computations?
Is there anyway I can fix this problem? Just FYI, I used tf.one_hot as just an example, I want to run train/validation during training in my code where I will need a lot of other computations with different values for batch_size in training and in validation steps.
Any help would be appreciated.
In pure python usage, [0]*3 will be [0,0,0]. However, batch_size_placeholder is a placeholder, during the graph execution, it will be a tensor. [0]*tensor will be regarded as tensor multiplication. In your case, it will be a 1-d tensor which has 0 value. To correctly use batch_size_placeholder, you should create a tensor which has the same length as batch_size_placeholder.
mask_0 = tf.one_hot(tf.zeros(batch_size_placeholder, dtype=tf.int32), depth=max_length, on_value=0., off_value=1.)
It will have the same result as mask_1.
A simple example to show the difference.
batch_size_placeholder = tf.placeholder(dtype=tf.int32)
a = [0]*batch_size_placeholder
b = tf.zeros(batch_size_placeholder, dtype=tf.int32)
with tf.Session() as sess:
print(sess.run([a, b], feed_dict={batch_size_placeholder : 3}))
# [array([0], dtype=int32), array([0, 0, 0], dtype=int32)]

Categories

Resources