Cascade multiple RNN models for N-dimensional output - python

I'm having some difficulty with chaining together two models in an unusual way.
I am trying to replicate the following flowchart:
For clarity, at each timestep of Model[0] I am attempting to generate an entire time series from IR[i] (Intermediate Representation) as a repeated input using Model[1]. The purpose of this scheme is it allows the generation of a ragged 2-D time series from a 1-D input (while both allowing the second model to be omitted when the output for that timestep is not needed, and not requiring Model[0] to constantly "switch modes" between accepting input, and generating output).
I assume a custom training loop will be required, and I already have a custom training loop for handling statefulness in the first model (the previous version only had a single output at each timestep). As depicted, the second model should have reasonably short outputs (able to be constrained to fewer than 10 timesteps).
But at the end of the day, while I can wrap my head around what I want to do, I'm not nearly adroit enough with Keras and/or Tensorflow to actually implement it. (In fact, this is my first non-toy project with the library.)
I have unsuccessfully searched literature for similar schemes to parrot, or example code to fiddle with. And I don't even know if this idea is possible from within TF/Keras.
I already have the two models working in isolation. (As in I've worked out the dimensionality, and done some training with dummy data to get garbage outputs for the second model, and the first model is based off of a previous iteration of this problem and has been fully trained.) If I have Model[0] and Model[1] as python variables (let's call them model_a and model_b), then how would I chain them together to do this?
Edit to add:
If this is all unclear, perhaps having the dimensions of each input and output will help:
The dimensions of each input and output are:
Input: (batch_size, model_a_timesteps, input_size)
IR: (batch_size, model_a_timesteps, ir_size)
IR[i] (after duplication): (batch_size, model_b_timesteps, ir_size)
Out[i]: (batch_size, model_b_timesteps, output_size)
Out: (batch_size, model_a_timesteps, model_b_timesteps, output_size)

As this question has multiple major parts, I've dedicated a Q&A to the core challenge: stateful backpropagation. This answer focuses on implementing the variable output step length.
As validated in Case 5, we can take a bottom-up first approach. First we feed the complete input to model_a (A) - then, feed its outputs as input to model_b (B), but this time one step at a time.
Note that we must chain B's output steps per A's input step, not between A's input steps; i.e., in your diagram, gradient is to flow between Out[0][1] and Out[0][0], but not between Out[2][0] and Out[0][1].
For computing loss it won't matter whether we use a ragged or padded tensor; we must however use a padded tensor for writing to TensorArray.
Loop logic in code below is general; specific attribute handling and hidden state passing, however, is hard-coded for simplicity, but can be rewritten for generality.
Code: at bottom.
Here we predefine the number of iterations for B per input from A, but we can implement any arbitrary stopping logic. For example, we can take a Dense layer's output from B as a hidden state and check if its L2-norm exceeds a threshold.
Per above, if longest_step is unknown to us, we can simply set it, which is common for NLP & other tasks with a STOP token.
Alternatively, we may write to separate TensorArrays at every A's input with dynamic_size=True; see "point of uncertainty" below.
A valid concern is, how do we know gradients flow correctly? Note that we've validate them for both vertical and horizontal in the linked Q&A, but it didn't cover multiple output steps per an input step, for multiple input steps. See below.
Point of uncertainty: I'm not entirely sure whether gradients interact between e.g. Out[0][1] and Out[2][0]. I did, however, verify that gradients will not flow horizontally if we write to separate TensorArrays for B's outputs per A's inputs (case 2); reimplementing for cases 4 & 5, grads will differ for both models, including lower one with a complete single horizontal pass.
Thus we must write to a unified TensorArray. For such, as there are no ops leading from e.g. IR[1] to Out[0][1], I can't see how TF would trace it as such - so it seems we're safe. Note, however, that in below example, using steps_at_t=[1]*6 will make gradient flow in the both model horizontally, as we're writing to a single TensorArray and passing hidden states.
The examined case is confounded, however, with B being stateful at all steps; lifting this requirement, we might not need to write to a unified TensorArray for all Out[0], Out[1], etc, but we must still test against something we know works, which is no longer as straightforward.
Example [code]:
import numpy as np
import tensorflow as tf
#%%# Make data & models, then fit ###########################################
x0 = y0 = tf.constant(np.random.randn(2, 3, 4))
msn = MultiStatefulNetwork(batch_shape=(2, 3, 4), steps_at_t=[3, 4, 2])
with tf.GradientTape(persistent=True) as tape:
outputs = msn(x0)
# shape: (3, 4, 2, 4), 0-padded
# We can pad labels accordingly.
# Note the (2, 4) model_b's output shape, which is a timestep slice;
# model_b is a *slice model*. Careful in implementing various logics
# which are and aren't intended to be stateful.
Not the cleanest, nor most optimal code, but it works; room for improvement.
More importantly: I implemented this in Eager, and have no idea how it'll work in Graph, and making it work for both can be quite tricky. If needed, just run in Graph and compare all values as done in the "cases".
# ideally we won't `import tensorflow` at all; kept for code simplicity
import tensorflow as tf
from tensorflow.python.util import nest
from tensorflow.python.ops import array_ops, tensor_array_ops
from tensorflow.python.framework import ops
from tensorflow.keras.layers import Input, SimpleRNN, SimpleRNNCell
from tensorflow.keras.models import Model
class MultiStatefulNetwork():
def __init__(self, batch_shape=(2, 6, 4), steps_at_t=[]):
self.batch_size = batch_shape[0]
self.units = batch_shape[-1]
def __call__(self, inputs):
outputs = self._forward_pass_a(inputs)
outputs = self._forward_pass_b(outputs)
return outputs
def _forward_pass_a(self, inputs):
return self.model_a(inputs, training=True)
def _forward_pass_b(self, inputs):
return model_rnn_outer(self.model_b, inputs, self.steps_at_t)
def _build_models(self):
ipt = Input(batch_shape=self.batch_shape)
out = SimpleRNN(self.units, return_sequences=True)(ipt)
self.model_a = Model(ipt, out)
ipt = Input(batch_shape=(self.batch_size, self.units))
sipt = Input(batch_shape=(self.batch_size, self.units))
out, state = SimpleRNNCell(4)(ipt, sipt)
self.model_b = Model([ipt, sipt], [out, state])
self.model_a.compile('sgd', 'mse')
self.model_b.compile('sgd', 'mse')
def inner_pass(model, inputs, states):
return model_rnn(model, inputs, states)
def model_rnn_outer(model, inputs, steps_at_t=[2, 2, 4, 3]):
def outer_step_function(inputs, states):
x, steps = inputs
x = array_ops.expand_dims(x, 0)
x = array_ops.tile(x, [steps, *[1] * (x.ndim - 1)]) # repeat steps times
output, new_states = inner_pass(model, x, states)
return output, new_states
(outer_steps, steps_at_t, longest_step, outer_t, initial_states,
output_ta, input_ta) = _process_args_outer(model, inputs, steps_at_t)
def _outer_step(outer_t, output_ta_t, *states):
current_input = [,]
output, new_states = outer_step_function(current_input, tuple(states))
# pad if shorter than longest_step.
# model_b may output twice, but longest in `steps_at_t` is 4; then we need
# output.shape == (2, *model_b.output_shape) -> (4, *...)
# checking directly on `output` is more reliable than from `steps_at_t`
output = tf.cond(
tf.math.less(output.shape[0], longest_step),
lambda: tf.pad(output, [[0, longest_step - output.shape[0]],
*[[0, 0]] * (output.ndim - 1)]),
lambda: output)
output_ta_t = output_ta_t.write(outer_t, output)
return (outer_t + 1, output_ta_t) + tuple(new_states)
final_outputs = tf.while_loop(
loop_vars=(outer_t, output_ta) + initial_states,
cond=lambda outer_t, *_: tf.math.less(outer_t, outer_steps))
output_ta = final_outputs[1]
outputs = output_ta.stack()
return outputs
def _process_args_outer(model, inputs, steps_at_t):
def swap_batch_timestep(input_t):
# Swap the batch and timestep dim for the incoming tensor.
# (samples, timesteps, channels) -> (timesteps, samples, channels)
# iterating dim0 to feed (samples, channels) slices expected by RNN
axes = list(range(len(input_t.shape)))
axes[0], axes[1] = 1, 0
return array_ops.transpose(input_t, axes)
inputs = nest.map_structure(swap_batch_timestep, inputs)
assert inputs.shape[0] == len(steps_at_t)
outer_steps = array_ops.shape(inputs)[0] # model_a_steps
longest_step = max(steps_at_t)
steps_at_t = tensor_array_ops.TensorArray(
dtype=tf.int32, size=len(steps_at_t)).unstack(steps_at_t)
# assume single-input network, excluding states which are handled separately
input_ta = tensor_array_ops.TensorArray(
# TensorArray is used to write outputs at every timestep, but does not
# support RaggedTensor; thus we must make TensorArray such that column length
# is that of the longest outer step, # and pad model_b's outputs accordingly
element_shape = tf.TensorShape((longest_step, *model.output_shape[0]))
# overall shape: (outer_steps, longest_step, *model_b.output_shape)
# for every input / at each step we write in dim0 (outer_steps)
output_ta = tensor_array_ops.TensorArray(
outer_t = tf.constant(0, dtype='int32')
initial_states = (tf.zeros(model.input_shape[0], dtype='float32'),)
return (outer_steps, steps_at_t, longest_step, outer_t, initial_states,
output_ta, input_ta)
def model_rnn(model, inputs, states):
def step_function(inputs, states):
output, new_states = model([inputs, *states], training=True)
return output, new_states
initial_states = states
input_ta, output_ta, time, time_steps_t = _process_args(model, inputs)
def _step(time, output_ta_t, *states):
current_input =
output, new_states = step_function(current_input, tuple(states))
flat_state = nest.flatten(states)
flat_new_state = nest.flatten(new_states)
for state, new_state in zip(flat_state, flat_new_state):
if isinstance(new_state, ops.Tensor):
output_ta_t = output_ta_t.write(time, output)
new_states = nest.pack_sequence_as(initial_states, flat_new_state)
return (time + 1, output_ta_t) + tuple(new_states)
final_outputs = tf.while_loop(
loop_vars=(time, output_ta) + tuple(initial_states),
cond=lambda time, *_: tf.math.less(time, time_steps_t))
new_states = final_outputs[2:]
output_ta = final_outputs[1]
outputs = output_ta.stack()
return outputs, new_states
def _process_args(model, inputs):
time_steps_t = tf.constant(inputs.shape[0], dtype='int32')
# assume single-input network (excluding states)
input_ta = tensor_array_ops.TensorArray(
# assume single-output network (excluding states)
output_ta = tensor_array_ops.TensorArray(
time = tf.constant(0, dtype='int32', name='time')
return input_ta, output_ta, time, time_steps_t


Stateful LSTM in custom training loop

I am writing a simple custom model + training in tensorflow. My goal is to build a stateful LSTM based model and being able to reset the states when I want to.
So far this is my custom model:
class ResNetModel(Model):
def __init__(self, num_inputs, **kwargs):
The class initialiser should call the base class initialiser, passing any keyword
arguments along. It should also create the layers of the network according to the
above specification.
super(ResNetModel, self).__init__(**kwargs)
self.lstm_1 = tf.keras.layers.LSTM(units=32, input_shape=(None, num_inputs), return_sequences=True)
self.dense = tf.keras.layers.Dense(units=1, activation=None)
def call(self, inputs, training=False):
This method should contain the code for calling the layer according to the above
specification, using the layer objects set up in the initialiser.
x = self.lstm_1(inputs)
y = self.dense(x)
return y + inputs
And this is my custom training loop (I am omitting the whole code because it is quite big, but the function is self contained for the purpose of my question):
def run_training(self, in_train, out_train, epoch_loss, epoch_error, n_skip, n_block):
n_samples = in_train.shape[1]
self.model.reset_states() # clear existing state
self.model(in_train[:, :n_skip, :]) # process some samples to build up state
for n in range(n_skip, n_samples - n_block, n_block):
# compute loss
with tf.GradientTape() as tape:
y_pred = self.model(in_train[:, n:n + n_block, :])
loss = self.loss_func(out_train[:, n:n + n_block, :], y_pred)
grads = tape.gradient(loss, self.model.trainable_variables)
self.opt.apply_gradients(zip(grads, self.model.trainable_variables))
epoch_error.update_state(out_train[:, n:n + n_block, :], y_pred)
And it trains fine, the whole code works as expected.
Then I make predictions like this:
for i in range(0, math.floor(24000/4096)):
predictions[i*4096: (i+1)*4096] = np.array(residual_net.model(X_test[idx][i*4096: (i+1)*4096].reshape(1, 4096, 1))).ravel()
So basically I am passing my input test to my model in residual_net.model(my_test_data) (the numpy slicing ecc... is just to make my input data coherent with the network, it works fine).
However, when I make predictions with my trained network (to give some context, it is working with audio data) I have the output audio that is as expected (an input song processed by the network that adds some distortion), but there are clicks in the output audio that are directly related to the input buffer size.
To make this point clearer: if I predict on chuncks of 512 samples, I have clicks every 512 samples, if I predict every 4096 samples, I have clicks every 4096.
This behaviour is pretty similar to the one you have with IIR filters that do not carry its filter state across audio buffers, so that got me thinking that my LSTM network is not working stateful as I expected.
So my question is:
does Tensorflow automatically reset the state of the network after each processed buffer (even in the case of custom training-prediction loops) if the parameter stateful = True is not specified in the LSTM layer?
I found no information about this, but I expected that behaviour for "standard" training (.fit/.predict functions) and not for custom training loops.
Does this hold also for the training step? (so basically I am messing up also the training)

LSTM Autoencoder problems

Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper:
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]=[N]) + b.
Same pattern for other elements: x[i]=[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x =[x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x =[x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model =
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x =
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
running_tl += loss.item()
if testload is not None:
with torch.no_grad():
for x in testload:
x =
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
The model only learns the average, no matter how complex I make the model or now long I train it.
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
print(f"{i}: {loss}")
if loss < 1e-4:
# Plotting
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Counting with Keras

What I'm trying to do
I'm currently making a very simple sequence-to-sequence LSTM using Keras with a minor twist, earlier predictions in the sequence should count against the loss less than later ones. The way I'm trying to do this is by counting the sequence number and multiplying by the square root of this count. (I want to do this because this value is representative of the relative ratio of uncertainty in a Poisson process based on the number of samples collected. My network is gathering data and attempting to estimate an invariant value based on the data gathered so far.)
How I'm trying to do it
I've implemented both a custom loss function and a custom layer.
Loss function:
def loss_function(y_true, y_pred):
# extract_output essentially concatenates the first three regression outputs of y
# into a list representing an [x, y, z] vector, and returns it along with the rest as a tuple
r, e, n = extract_output(y_true)
r_h, e_h, n_h = extract_output(y_pred)
# Hyperperameters
dir_loss_weight = 10
dist_loss_weight = 1
energy_loss_weight = 3
norm_r = sqrt(dot(r, r))
norm_r_h = sqrt(dot(r_h, r_h))
dir_loss = mean_squared_error(r/norm_r, r_h/norm_r_h)
dist_loss = mean_squared_error(norm_r, norm_r_h)
energy_loss = mean_squared_error(e, e_h)
return sqrt(n) * (dir_loss_weight * dir_loss + dist_lost_weight * dist_loss + energy_loss_weight * energy_loss)
Custom Layer:
class CounterLayer(Layer):
def __init__(self, **kwargs):
super(CounterLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.sequence_number = 0
def call(self, x):
self.sequence_number += 1
return [self.sequence_number]
def compute_output_shape(self, input_shape):
return (1,)
I then added the input as a concatenation to the regular output:
seq_num = CounterLayer()(inputs)
outputs = concatenate([out, seq_num])
What's going wrong
My error is:
Traceback (most recent call last):
File "", line 119, in <module>
File "", line 115, in main
model = create_model()
File "", line 74, in create_model
seq_num = CounterLayer()(inputs)
File "/usr/lib/python3.7/site-packages/keras/engine/", line 497, in __call__
File "/usr/lib/python3.7/site-packages/keras/engine/", line 565, in _add_inbound_node
output_tensors[i]._keras_shape = output_shapes[i]
AttributeError: 'int' object has no attribute '_keras_shape'
I'm assuming I have the shape wrong. But I do not know how. Does anyone know if I'm going about this in the wrong way? What should I do to make this happen?
Further Adventures
Per #Mohammad Jafar Mashhadi's comment, my call return needed to be wrapped in a keras.backend.variable; however, per his linked answer, my approach will not work, because call is not called multiple times, as I initially assumed it was.
How can I get a counter for the RNN?
For clarity, if the RNN given input xi outputs yi, I'm trying to get i as part of my output.
x1 -> RNN -> (y1, 1)
h1 |
x2 -> RNN -> (y2, 2)
h2 |
x3 -> RNN -> (y3, 3)
h3 |
x4 -> RNN -> (y4, 4)
The error is saying that inputs, in the line seq_num = CounterLayer()(inputs), is an integer.
You can't pass integers as inputs to layers. You must pass keras tensors, and only keras tensors.
Second, this will not work because Keras works in a static graph style. A call in a layer doesn't calculate things, it only builds the graph of empty tensors. Only tensors will ever get updated as you pass data to them, integer values will not. When you say self.sequence_number += 1, it will be called only when building the model, it will not be called over and over.
We need details
We can't really understand what is going on if you don't give us enough information, such as:
the model
the summary
your input data shapes
the target data shapes
the custom functions
If the interpretation below is correct, the model's output shape in the summary and the target data shapes as you pass it to fit are absolutely important to know.
Proposed solution
If I understood what you described, you want to have a sequence of increasing integers along with the time steps of your sequences, so these numbers are used in your loss function.
If this interpretation is right, you don't need to keep updating numbers, you just need to create a range tensor and that's it.
So, inside your loss (which I don't understand unless you provide us with the custom functions) you should create this tensor:
def custom_loss(y_true, y_pred):
#use this if you defined your model with static sequence length - input_shape =(length, features)
length = K.int_shape(y_pred)[1]
#use this if you defined your model with dynamic sequence length - input_shape = (None, features)
length = K.shape(y_pred)[1]
#this is the sequence vector:
seq = tf.range(1, length+1)
#you can get the root with
sec = K.sqrt(seq)
#you reshape to match the shape of the loss, which is probably (batch, length)
sec = K.reshape(sec, (1,-1)) #shape (1, lenght)
#compute your loss normally, taking care to reduce the last axis and keep the two first
loss = ..... #shape (batch, length)
#multiply the weights by the loss
return loss * sec
You must work everything as a whole tensor! You cannot interpret it as step by step. You must do everything keeping the first and second dimensions in you loss.
I'm not sure to have understood the question completely, but based on the final draw, I think that to get an extra feature (the time step) fed into the loss function along with the predictions, you might try to use the second approach suggested in this other accepted answer:
Custom loss function in Keras based on the input data
The idea is to expand the label vector with the extra feature, and then separate them again inside the loss function.
# y_true_plus_timesteps has shape [n_training_instances, 2]
def custom_loss(y_true_plus_timesteps, y_pred):
# labels stored in the first column
y_true = y_true_plus_timesteps[:, 0]
time_steps = y_true_plus_timesteps[:, 1]
return K.mean(K.square(y_pred - y_true), axis=-1) + your loss function
# note that labels are fed into the model with time steps, np.append(Y_true, time_steps, axis =1), batch_size = batch_size, epochs=90, shuffle=True, verbose=1)

tensorflow 2 keras shuffle each row gradient problem

I need a NN which will be giving same output for any permutation of same input. Was trying to search for solution ('permutation invariance'), found some layers, but failed to make them work.
I chose different approach: I want to create a layer, to add as I first in the model, which will randomly shuffle input (each row independently) - please let's follow this approach, I know it can be done outside the model, but I want it as a part of the model. I tried:
class ShuffleLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ShuffleLayer, self).__init__(**kwargs)
def call(self, inputs):
batchSize = tf.shape(inputs)[0]
cols = tf.shape(inputs)[-1]
order0 = tf.tile(tf.expand_dims(tf.range(0, batchSize), -1), [1, cols])
order1 = tf.argsort(tf.random.uniform(shape=(batchSize, cols)))
indices = tf.stack([tf.reshape(order0, [-1]), tf.reshape(order1, [-1])], axis=-1)
outputs = tf.reshape(tf.gather_nd(inputs, indices), [batchSize, cols])
return outputs
I am getting following error:
ValueError: Variable has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without
gradient: K.argmax, K.round, K.eval.
How to avoid it ?? I tried to use tf.stop_gradient, but unsuccessfully.
Use Lambda layers:
First of all, if your layer doesn't have trainable weights, you should use a Lambda layer, not a custom layer. It's way simpler and easier.
def shuffleColumns(inputs):
batchSize = tf.shape(inputs)[0]
cols = tf.shape(inputs)[-1]
order0 = tf.tile(tf.expand_dims(tf.range(0, batchSize), -1), [1, cols])
order1 = tf.argsort(tf.random.uniform(shape=(batchSize, cols)))
indices = tf.stack([tf.reshape(order0, [-1]), tf.reshape(order1, [-1])], axis=-1)
outputs = tf.reshape(tf.gather_nd(inputs, indices), [batchSize, cols])
return outputs
In the model, use a Lambda(shuffleColumns) layer.
About the error
If this is the first layer, this error is probably not caused by this layer. (Unless newer versions of Tensorflow are demanding that custom layers have weights and def build(self, input_shape): defined, which doesn't seem very logical).
It seems you are doing something else in another place. The error is: you are using some operation that blocks backpropagation because it's impossible to have the derivative of that operation.
Since the derivatives are taken with respect to the model's "weights", this means that the operation is necessarily after the first weight tensor in the model (ie: after the first layer that contains trainable weights).
You need to search for anything in your model that doesn't have derivatives, like the error suggests: round, argmax, conditionals that return constants, losses that return sorted y_true but don't return operations on y_pred, etc.
Of course that K.stop_gradients is also an operation that blocks backpropagation and will certainly cause this error if you just use it like that. (This may even be the "cause" of your problem, not the solution)
Below there are easier suggestions for your operation, but none of them will fix this error because this error is somewhere else.
Suggested operation 1
Now, it would be way easier to use tf.random.shuffle for this:
def shuffleColumns(x):
x = tf.transpose(x)
x = tf.random.shuffle(x)
return tf.transpose(x)
Use a Lambda(shuffleColumns) layer in your model. It's true that this will shuffle all columns equally, but every batch will have a different permutation. And since you're going to have many epochs, and you will be shuffling (I presume) samples between each epoch (this is automatic in fit), you will hardly ever have repeated batches. So:
each batch will have a different permutation
it will be almost impossible to have the same batch two times
This approach will probably be way faster than yours.
Suggested operation 2
If you want them permutation invariant, why not use tf.sort instead of permutations? Sort the columns and, instead of having infinite permutations to train, you simply eliminate any possibility of permutation. The model should learn faster, and yet the order of the columns in your input will not be taken into account.
Use the layer Lambda(lambda x: tf.sort(x, axis=-1))
This suggestion must be used both in training and inference.

How to use Tensorflow's batch_sequences_with_states utility

I am trying to build a generative RNN using Tensorflow. I have a preprocessed dataset which is a list of sequence_length x 2048 x 2 numpy arrays. The sequences have different lengths. I have been looking through examples and documentation but I really couldn't understand, for example, what key is, or how I should create the input_sequences dictionary, etc.
So how should one format a list of numpy arrays, each of which represent a sequence of rank n (2 in this case) tensors, in order to be able to use this batch_sequences_with_states method?
Toy Implementations
I tried this and I will be glad to share my findings with you. It is a toy example. I attempted to create an example that works and observe how the output varies. In particular I used a case study of lstm. For you, you can define a conv net. Feel free to add more input and adjust as usual and follow the doc.
There are other more subtle examples I tried but I keep this simple version to show how the operation can be useful. In particular add more elements to the dictionaries (input sequence and context sequence) and observe the changes.
Two Approaches
Basically I will use two approaches:
tf.train.batch( )
I will start with the first one because it will directly helpful then I will show how to solve similar problem with train.batch.
I will basically be generate toy numpy arrays and tensors and use it for testing the operations
import tensorflow as tf
batch_size = 32
num_unroll = 20
num_enqueue_threads = 20
lstm_size = 8
cell = tf.contrib.rnn.BasicLSTMCell(num_units=lstm_size)
#state size
state_size = cell.state_size[0];
initial_state_values = tf.zeros((state_size,), dtype=tf.float32)
# Initial states
initial_state_values = tf.zeros((state_size,), dtype=tf.float32)
initial_states = {"lstm_state": initial_state_values}
# Key should be string
#I used x as input sequence and y as input context. So that the
# keys should be 2.
key = ["1","2"]
#Toy data for our sample
x = tf.range(0, 12, name="x")
y = tf.range(12,24,name="y")
# convert to float
#I converted to float so as not to raise type mismatch erroe
#the input sequence as dictionary
#This is needed according to the tensorflow doc
sequences = {"x": x }
#Context Input
context = {"batch1": y}
# Train batch with sequence state
batch_new =
input_length = None,
pad = True,
capacity=batch_size * num_enqueue_threads * 2)
# To test what we have got type and observe the output of
# the following
# In short once in ipython notebook
# type batch_new.[press tab] to see all options
#splitting of input. This generate input per epoch
inputs_by_time = tf.split(inputs, num_unroll)
assert len(inputs_by_time) == num_unroll
# Get lstm or conv net output
lstm_output, _ = tf.contrib.rnn.static_state_saving_rnn(
Create Graph and Queue as Usual
The parts with # and * can be further adapted to suit requirement.
# Create the graph, etc.
init_op = tf.global_variables_initializer()
#Create a session for running operations in the Graph.
sess = tf.Session()
# Initialize the variables (like the epoch counter).
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
# For the part below uncomment
#*those comments with asterics to do other operations
#* while not coord.should_stop():
#*Run training steps or whatever
#* # uncomment to run other ops
#*except tf.errors.OutOfRangeError:
#print('Done training -- epoch limit reached')
# When done, ask the threads to stop.
# Wait for threads to finish.
Second Approach
You can also use train.batch in a very interesting way:
import tensorflow as tf
#[0, 1, 2, 3, 4 ,...]
x = tf.range(0, 11, name="x")
# A queue that outputs 0,1,2,3,..
# slice end is useful for dequeuing
slice_end = 10
# instantiate variable y
y = tf.slice(x, [0], [slice_end], name="y")
# Reshape y
y = tf.reshape(y,[10,1])
y=tf.to_float(y, name='ToFloat')
Note the use of dynamic and enqueue many with padding. Feel free to play with both options. And compare output!
batched_data = tf.train.batch(
batch_size = 128 ;
lstm_cell = tf.contrib.rnn.LSTMCell(batch_size,forget_bias=1,state_is_tuple=True)
val, state = tf.nn.dynamic_rnn(lstm_cell, batched_data, dtype=tf.float32)
The aim is to show that by simple examples we can get insight into the
details of the operations. You can adapt it to convolutional net in your case.
Hope this helps!

