What I'm trying to do
I'm currently making a very simple sequence-to-sequence LSTM using Keras with a minor twist, earlier predictions in the sequence should count against the loss less than later ones. The way I'm trying to do this is by counting the sequence number and multiplying by the square root of this count. (I want to do this because this value is representative of the relative ratio of uncertainty in a Poisson process based on the number of samples collected. My network is gathering data and attempting to estimate an invariant value based on the data gathered so far.)
How I'm trying to do it
I've implemented both a custom loss function and a custom layer.
Loss function:
def loss_function(y_true, y_pred):
# extract_output essentially concatenates the first three regression outputs of y
# into a list representing an [x, y, z] vector, and returns it along with the rest as a tuple
r, e, n = extract_output(y_true)
r_h, e_h, n_h = extract_output(y_pred)
# Hyperperameters
dir_loss_weight = 10
dist_loss_weight = 1
energy_loss_weight = 3
norm_r = sqrt(dot(r, r))
norm_r_h = sqrt(dot(r_h, r_h))
dir_loss = mean_squared_error(r/norm_r, r_h/norm_r_h)
dist_loss = mean_squared_error(norm_r, norm_r_h)
energy_loss = mean_squared_error(e, e_h)
return sqrt(n) * (dir_loss_weight * dir_loss + dist_lost_weight * dist_loss + energy_loss_weight * energy_loss)
Custom Layer:
class CounterLayer(Layer):
def __init__(self, **kwargs):
super(CounterLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.sequence_number = 0
pass
def call(self, x):
self.sequence_number += 1
return [self.sequence_number]
def compute_output_shape(self, input_shape):
return (1,)
I then added the input as a concatenation to the regular output:
seq_num = CounterLayer()(inputs)
outputs = concatenate([out, seq_num])
What's going wrong
My error is:
Traceback (most recent call last):
File "lstm.py", line 119, in <module>
main()
File "lstm.py", line 115, in main
model = create_model()
File "lstm.py", line 74, in create_model
seq_num = CounterLayer()(inputs)
File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 497, in __call__
arguments=user_kwargs)
File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 565, in _add_inbound_node
output_tensors[i]._keras_shape = output_shapes[i]
AttributeError: 'int' object has no attribute '_keras_shape'
I'm assuming I have the shape wrong. But I do not know how. Does anyone know if I'm going about this in the wrong way? What should I do to make this happen?
Further Adventures
Per #Mohammad Jafar Mashhadi's comment, my call return needed to be wrapped in a keras.backend.variable; however, per his linked answer, my approach will not work, because call is not called multiple times, as I initially assumed it was.
How can I get a counter for the RNN?
For clarity, if the RNN given input xi outputs yi, I'm trying to get i as part of my output.
x1 -> RNN -> (y1, 1)
h1 |
v
x2 -> RNN -> (y2, 2)
h2 |
v
x3 -> RNN -> (y3, 3)
h3 |
v
x4 -> RNN -> (y4, 4)
The error is saying that inputs, in the line seq_num = CounterLayer()(inputs), is an integer.
You can't pass integers as inputs to layers. You must pass keras tensors, and only keras tensors.
Second, this will not work because Keras works in a static graph style. A call in a layer doesn't calculate things, it only builds the graph of empty tensors. Only tensors will ever get updated as you pass data to them, integer values will not. When you say self.sequence_number += 1, it will be called only when building the model, it will not be called over and over.
We need details
We can't really understand what is going on if you don't give us enough information, such as:
the model
the summary
your input data shapes
the target data shapes
the custom functions
etc.
If the interpretation below is correct, the model's output shape in the summary and the target data shapes as you pass it to fit are absolutely important to know.
Proposed solution
If I understood what you described, you want to have a sequence of increasing integers along with the time steps of your sequences, so these numbers are used in your loss function.
If this interpretation is right, you don't need to keep updating numbers, you just need to create a range tensor and that's it.
So, inside your loss (which I don't understand unless you provide us with the custom functions) you should create this tensor:
def custom_loss(y_true, y_pred):
#use this if you defined your model with static sequence length - input_shape =(length, features)
length = K.int_shape(y_pred)[1]
#use this if you defined your model with dynamic sequence length - input_shape = (None, features)
length = K.shape(y_pred)[1]
#this is the sequence vector:
seq = tf.range(1, length+1)
#you can get the root with
sec = K.sqrt(seq)
#you reshape to match the shape of the loss, which is probably (batch, length)
sec = K.reshape(sec, (1,-1)) #shape (1, lenght)
#compute your loss normally, taking care to reduce the last axis and keep the two first
loss = ..... #shape (batch, length)
#multiply the weights by the loss
return loss * sec
You must work everything as a whole tensor! You cannot interpret it as step by step. You must do everything keeping the first and second dimensions in you loss.
I'm not sure to have understood the question completely, but based on the final draw, I think that to get an extra feature (the time step) fed into the loss function along with the predictions, you might try to use the second approach suggested in this other accepted answer:
Custom loss function in Keras based on the input data
The idea is to expand the label vector with the extra feature, and then separate them again inside the loss function.
# y_true_plus_timesteps has shape [n_training_instances, 2]
def custom_loss(y_true_plus_timesteps, y_pred):
# labels stored in the first column
y_true = y_true_plus_timesteps[:, 0]
time_steps = y_true_plus_timesteps[:, 1]
return K.mean(K.square(y_pred - y_true), axis=-1) + your loss function
# note that labels are fed into the model with time steps
model.fit(X, np.append(Y_true, time_steps, axis =1), batch_size = batch_size, epochs=90, shuffle=True, verbose=1)
Related
I am handeling a timeseries dataset with n timesteps, m features and k objects.
As a result my feature vector has a shape of (n,k,m) While my targets shape is (n,m)
I want to predict the targets for every timestep and object, but with the same weights for every opject. Also my loss function looks like this.
average_loss = loss_func(prediction, labels)
sum_loss = loss_func(sum(prediction), sum(labels))
loss = loss_weight * average_loss + (1-loss_weight) * sum_loss
My plan is to not only make sure, that I predict every item as good as possible, but also that the sum of all items get perdicted. loss_weights is a constant.
Currently I am doing this kind of ugly solution:
features = local_batch.squeeze(dim = 0)
labels = torch.unsqueeze(local_labels.squeeze(dim = 0), 1)
prediction = net(features)
I set my batchsize = 1. And squeeze it to make the k objects my batch.
My network looks like this:
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.predict = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
How do I make sure I do a reasonable convolution over the opject dimension in order to keep the same weights for all objects, without commiting to batchsize=1? Also, how do I achieve the same loss function, where I compute the loss of the prediction sum vs target sum for any timestamp?
It's not exactly ugly -- I would do the same but generalize it a bit for batch size >1 using view.
# Using your notations
n, k, m = features.shape
features = local_batch.view(n*k, m)
prediction = net(features).view(n, k, m)
With the prediction in the correct shape (n*k*m), implementing your loss function should not be difficult.
TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.
I'm having some difficulty with chaining together two models in an unusual way.
I am trying to replicate the following flowchart:
For clarity, at each timestep of Model[0] I am attempting to generate an entire time series from IR[i] (Intermediate Representation) as a repeated input using Model[1]. The purpose of this scheme is it allows the generation of a ragged 2-D time series from a 1-D input (while both allowing the second model to be omitted when the output for that timestep is not needed, and not requiring Model[0] to constantly "switch modes" between accepting input, and generating output).
I assume a custom training loop will be required, and I already have a custom training loop for handling statefulness in the first model (the previous version only had a single output at each timestep). As depicted, the second model should have reasonably short outputs (able to be constrained to fewer than 10 timesteps).
But at the end of the day, while I can wrap my head around what I want to do, I'm not nearly adroit enough with Keras and/or Tensorflow to actually implement it. (In fact, this is my first non-toy project with the library.)
I have unsuccessfully searched literature for similar schemes to parrot, or example code to fiddle with. And I don't even know if this idea is possible from within TF/Keras.
I already have the two models working in isolation. (As in I've worked out the dimensionality, and done some training with dummy data to get garbage outputs for the second model, and the first model is based off of a previous iteration of this problem and has been fully trained.) If I have Model[0] and Model[1] as python variables (let's call them model_a and model_b), then how would I chain them together to do this?
Edit to add:
If this is all unclear, perhaps having the dimensions of each input and output will help:
The dimensions of each input and output are:
Input: (batch_size, model_a_timesteps, input_size)
IR: (batch_size, model_a_timesteps, ir_size)
IR[i] (after duplication): (batch_size, model_b_timesteps, ir_size)
Out[i]: (batch_size, model_b_timesteps, output_size)
Out: (batch_size, model_a_timesteps, model_b_timesteps, output_size)
As this question has multiple major parts, I've dedicated a Q&A to the core challenge: stateful backpropagation. This answer focuses on implementing the variable output step length.
Description:
As validated in Case 5, we can take a bottom-up first approach. First we feed the complete input to model_a (A) - then, feed its outputs as input to model_b (B), but this time one step at a time.
Note that we must chain B's output steps per A's input step, not between A's input steps; i.e., in your diagram, gradient is to flow between Out[0][1] and Out[0][0], but not between Out[2][0] and Out[0][1].
For computing loss it won't matter whether we use a ragged or padded tensor; we must however use a padded tensor for writing to TensorArray.
Loop logic in code below is general; specific attribute handling and hidden state passing, however, is hard-coded for simplicity, but can be rewritten for generality.
Code: at bottom.
Example:
Here we predefine the number of iterations for B per input from A, but we can implement any arbitrary stopping logic. For example, we can take a Dense layer's output from B as a hidden state and check if its L2-norm exceeds a threshold.
Per above, if longest_step is unknown to us, we can simply set it, which is common for NLP & other tasks with a STOP token.
Alternatively, we may write to separate TensorArrays at every A's input with dynamic_size=True; see "point of uncertainty" below.
A valid concern is, how do we know gradients flow correctly? Note that we've validate them for both vertical and horizontal in the linked Q&A, but it didn't cover multiple output steps per an input step, for multiple input steps. See below.
Point of uncertainty: I'm not entirely sure whether gradients interact between e.g. Out[0][1] and Out[2][0]. I did, however, verify that gradients will not flow horizontally if we write to separate TensorArrays for B's outputs per A's inputs (case 2); reimplementing for cases 4 & 5, grads will differ for both models, including lower one with a complete single horizontal pass.
Thus we must write to a unified TensorArray. For such, as there are no ops leading from e.g. IR[1] to Out[0][1], I can't see how TF would trace it as such - so it seems we're safe. Note, however, that in below example, using steps_at_t=[1]*6 will make gradient flow in the both model horizontally, as we're writing to a single TensorArray and passing hidden states.
The examined case is confounded, however, with B being stateful at all steps; lifting this requirement, we might not need to write to a unified TensorArray for all Out[0], Out[1], etc, but we must still test against something we know works, which is no longer as straightforward.
Example [code]:
import numpy as np
import tensorflow as tf
#%%# Make data & models, then fit ###########################################
x0 = y0 = tf.constant(np.random.randn(2, 3, 4))
msn = MultiStatefulNetwork(batch_shape=(2, 3, 4), steps_at_t=[3, 4, 2])
#%%#############################################
with tf.GradientTape(persistent=True) as tape:
outputs = msn(x0)
# shape: (3, 4, 2, 4), 0-padded
# We can pad labels accordingly.
# Note the (2, 4) model_b's output shape, which is a timestep slice;
# model_b is a *slice model*. Careful in implementing various logics
# which are and aren't intended to be stateful.
Methods:
Not the cleanest, nor most optimal code, but it works; room for improvement.
More importantly: I implemented this in Eager, and have no idea how it'll work in Graph, and making it work for both can be quite tricky. If needed, just run in Graph and compare all values as done in the "cases".
# ideally we won't `import tensorflow` at all; kept for code simplicity
import tensorflow as tf
from tensorflow.python.util import nest
from tensorflow.python.ops import array_ops, tensor_array_ops
from tensorflow.python.framework import ops
from tensorflow.keras.layers import Input, SimpleRNN, SimpleRNNCell
from tensorflow.keras.models import Model
#######################################################################
class MultiStatefulNetwork():
def __init__(self, batch_shape=(2, 6, 4), steps_at_t=[]):
self.batch_shape=batch_shape
self.steps_at_t=steps_at_t
self.batch_size = batch_shape[0]
self.units = batch_shape[-1]
self._build_models()
def __call__(self, inputs):
outputs = self._forward_pass_a(inputs)
outputs = self._forward_pass_b(outputs)
return outputs
def _forward_pass_a(self, inputs):
return self.model_a(inputs, training=True)
def _forward_pass_b(self, inputs):
return model_rnn_outer(self.model_b, inputs, self.steps_at_t)
def _build_models(self):
ipt = Input(batch_shape=self.batch_shape)
out = SimpleRNN(self.units, return_sequences=True)(ipt)
self.model_a = Model(ipt, out)
ipt = Input(batch_shape=(self.batch_size, self.units))
sipt = Input(batch_shape=(self.batch_size, self.units))
out, state = SimpleRNNCell(4)(ipt, sipt)
self.model_b = Model([ipt, sipt], [out, state])
self.model_a.compile('sgd', 'mse')
self.model_b.compile('sgd', 'mse')
def inner_pass(model, inputs, states):
return model_rnn(model, inputs, states)
def model_rnn_outer(model, inputs, steps_at_t=[2, 2, 4, 3]):
def outer_step_function(inputs, states):
x, steps = inputs
x = array_ops.expand_dims(x, 0)
x = array_ops.tile(x, [steps, *[1] * (x.ndim - 1)]) # repeat steps times
output, new_states = inner_pass(model, x, states)
return output, new_states
(outer_steps, steps_at_t, longest_step, outer_t, initial_states,
output_ta, input_ta) = _process_args_outer(model, inputs, steps_at_t)
def _outer_step(outer_t, output_ta_t, *states):
current_input = [input_ta.read(outer_t), steps_at_t.read(outer_t)]
output, new_states = outer_step_function(current_input, tuple(states))
# pad if shorter than longest_step.
# model_b may output twice, but longest in `steps_at_t` is 4; then we need
# output.shape == (2, *model_b.output_shape) -> (4, *...)
# checking directly on `output` is more reliable than from `steps_at_t`
output = tf.cond(
tf.math.less(output.shape[0], longest_step),
lambda: tf.pad(output, [[0, longest_step - output.shape[0]],
*[[0, 0]] * (output.ndim - 1)]),
lambda: output)
output_ta_t = output_ta_t.write(outer_t, output)
return (outer_t + 1, output_ta_t) + tuple(new_states)
final_outputs = tf.while_loop(
body=_outer_step,
loop_vars=(outer_t, output_ta) + initial_states,
cond=lambda outer_t, *_: tf.math.less(outer_t, outer_steps))
output_ta = final_outputs[1]
outputs = output_ta.stack()
return outputs
def _process_args_outer(model, inputs, steps_at_t):
def swap_batch_timestep(input_t):
# Swap the batch and timestep dim for the incoming tensor.
# (samples, timesteps, channels) -> (timesteps, samples, channels)
# iterating dim0 to feed (samples, channels) slices expected by RNN
axes = list(range(len(input_t.shape)))
axes[0], axes[1] = 1, 0
return array_ops.transpose(input_t, axes)
inputs = nest.map_structure(swap_batch_timestep, inputs)
assert inputs.shape[0] == len(steps_at_t)
outer_steps = array_ops.shape(inputs)[0] # model_a_steps
longest_step = max(steps_at_t)
steps_at_t = tensor_array_ops.TensorArray(
dtype=tf.int32, size=len(steps_at_t)).unstack(steps_at_t)
# assume single-input network, excluding states which are handled separately
input_ta = tensor_array_ops.TensorArray(
dtype=inputs.dtype,
size=outer_steps,
element_shape=tf.TensorShape(model.input_shape[0]),
tensor_array_name='outer_input_ta_0').unstack(inputs)
# TensorArray is used to write outputs at every timestep, but does not
# support RaggedTensor; thus we must make TensorArray such that column length
# is that of the longest outer step, # and pad model_b's outputs accordingly
element_shape = tf.TensorShape((longest_step, *model.output_shape[0]))
# overall shape: (outer_steps, longest_step, *model_b.output_shape)
# for every input / at each step we write in dim0 (outer_steps)
output_ta = tensor_array_ops.TensorArray(
dtype=model.output[0].dtype,
size=outer_steps,
element_shape=element_shape,
tensor_array_name='outer_output_ta_0')
outer_t = tf.constant(0, dtype='int32')
initial_states = (tf.zeros(model.input_shape[0], dtype='float32'),)
return (outer_steps, steps_at_t, longest_step, outer_t, initial_states,
output_ta, input_ta)
def model_rnn(model, inputs, states):
def step_function(inputs, states):
output, new_states = model([inputs, *states], training=True)
return output, new_states
initial_states = states
input_ta, output_ta, time, time_steps_t = _process_args(model, inputs)
def _step(time, output_ta_t, *states):
current_input = input_ta.read(time)
output, new_states = step_function(current_input, tuple(states))
flat_state = nest.flatten(states)
flat_new_state = nest.flatten(new_states)
for state, new_state in zip(flat_state, flat_new_state):
if isinstance(new_state, ops.Tensor):
new_state.set_shape(state.shape)
output_ta_t = output_ta_t.write(time, output)
new_states = nest.pack_sequence_as(initial_states, flat_new_state)
return (time + 1, output_ta_t) + tuple(new_states)
final_outputs = tf.while_loop(
body=_step,
loop_vars=(time, output_ta) + tuple(initial_states),
cond=lambda time, *_: tf.math.less(time, time_steps_t))
new_states = final_outputs[2:]
output_ta = final_outputs[1]
outputs = output_ta.stack()
return outputs, new_states
def _process_args(model, inputs):
time_steps_t = tf.constant(inputs.shape[0], dtype='int32')
# assume single-input network (excluding states)
input_ta = tensor_array_ops.TensorArray(
dtype=inputs.dtype,
size=time_steps_t,
tensor_array_name='input_ta_0').unstack(inputs)
# assume single-output network (excluding states)
output_ta = tensor_array_ops.TensorArray(
dtype=model.output[0].dtype,
size=time_steps_t,
element_shape=tf.TensorShape(model.output_shape[0]),
tensor_array_name='output_ta_0')
time = tf.constant(0, dtype='int32', name='time')
return input_ta, output_ta, time, time_steps_t
I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.
What i think i understood (please correct me if i'm wrong!)
Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).
Let's say we're feeding utterances of some sentences as MFCCs.
The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:
alphabet (a, b, ..., z)
a blank token (-)
a space (_) and an end-character (>)
Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).
To implement it, i found that i can do something like this:
# CTC implementation from Keras example found at https://github.com/keras-
# team/keras/blob/master/examples/image_ocr.py
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
# print "y_pred_shape: ", y_pred.shape
y_pred = y_pred[:, 2:, :]
# print "y_pred_shape: ", y_pred.shape
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)
x = Bidirectional(lstm(...,return_sequences=True))(input_data)
x = Bidirectional(lstm(...,return_sequences=True))(x)
y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)
loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
[y_pred, y_true, input_length, label_length])
model = Model(inputs=[input_data, y_true, input_length,label_length],
outputs=loss_out)
With ALPHABET_LENGTH = 29 (alphabet length + special characters)
And:
y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.
(source)
Now, i'm facing some problems:
What i don't understand
Is this implantation the right way to code and use CTC loss?
I do not understand what are concretely y_true, input_length and
label_length. Any examples?
In what form should I give the labels to the network? Again, Any examples?
What are these?
y_true your ground truth data. The data you are going to compare with the model's outputs in training. (On the other hand, y_pred is the model's calculated output)
input_length, the length (in steps, or chars this case) of each sample (sentence) in the y_pred tensor (as said here)
label_length, the length (in steps, or chars this case) of each sample (sentence) in the y_true (or labels) tensor.
It seems this loss expects that your model's outputs (y_pred) have different lengths, as well as your ground truth data (y_true). This is probably to avoid calculating the loss for garbage characters after the end of the sentences (since you will need a fixed size tensor for working with lots of sentences at once)
Form of the labels:
Since the function's documentation is asking for shape (samples, length), the format is that... the char index for each char in each sentence.
How to use this?
There are some possibilities.
1- If you don't care about lengths:
If all lengths are the same, you can easily use it as a regular loss:
def ctc_loss(y_true, y_pred):
return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
#where input_length and label_length are constants you created previously
#the easiest way here is to have a fixed batch size in training
#the lengths should have the same batch size (see shapes in the link for ctc_cost)
model.compile(loss=ctc_loss, ...)
#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)
2 - If you care about the lengths.
This is a little more complicated, you need that your model somehow tells you the length of each output sentence.
There are again several creative forms of doing this:
Have an "end_of_sentence" char and detect where in the sentence it is.
Have a branch of your model to calculate this number and round it to integer.
(Hardcore) If you are using stateful manual training loop, get the index of the iteration you decided to finish a sentence
I like the first idea, and will exemplify it here.
def ctc_find_eos(y_true, y_pred):
#convert y_pred from one-hot to label indices
y_pred_ind = K.argmax(y_pred, axis=-1)
#to make sure y_pred has one end_of_sentence (to avoid errors)
y_pred_end = K.concatenate([
y_pred_ind[:,:-1],
eos_index * K.ones_like(y_pred_ind[:,-1:])
], axis = 1)
#to make sure the first occurrence of the char is more important than subsequent ones
occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())
#is eos?
is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))
#lengths
true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)
#reshape
true_lengths = K.reshape(true_lengths, (-1,1))
pred_lengths = K.reshape(pred_lengths, (-1,1))
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)
model.compile(loss=ctc_find_eos, ....)
If you use the other option, use a model branch to calculate the lengths, concatenate these length to the first or last step of the output, and make sure you do the same with the true lengths in your ground truth data. Then, in the loss function, just take the section for lengths:
def ctc_concatenated_length(y_true, y_pred):
#assuming you concatenated the length in the first step
true_lengths = y_true[:,:1] #may need to cast to int
y_true = y_true[:, 1:]
#since y_pred uses one-hot, you will need to concatenate to full size of the last axis,
#thus the 0 here
pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
y_pred = y_pred[:, 1:]
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)
I'm using keras with tensorflow backend. My goal is to query the batchsize of the current batch in a custom loss function. This is needed to compute values of the custom loss functions which depend on the index of particular observations. I like to make this clearer given the minimum reproducible examples below.
(BTW: Of course I could use the batch size defined for the training procedure and plugin it's value when defining the custom loss function, but there are some reasons why this can vary, especially if epochsize % batchsize (epochsize modulo batchsize) is unequal zero, then the last batch of an epoch has different size. I didn't found a suitable approach in stackoverflow, especially e. g.
Tensor indexing in custom loss function and Tensorflow custom loss function in Keras - loop over tensor and Looping over a tensor because obviously the shape of any tensor can't be inferred when building the graph which is the case for a loss function - shape inference is only possible when evaluating given the data, which is only possible given the graph. Hence I need to tell the custom loss function to do something with particular elements along a certain dimension without knowing the length of the dimension.
(this is the same in all examples)
from keras.models import Sequential
from keras.layers import Dense, Activation
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
example 1: nothing special without issue, no custom loss
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 2: nothing special, with a fairly simple custom loss
def custom_loss(yTrue, yPred):
loss = np.abs(yTrue-yPred)
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 3: the issue
def custom_loss(yTrue, yPred):
print(yPred) # Output: Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
n = yPred.shape[0]
for i in range(n): # TypeError: __index__ returned non-int (type NoneType)
loss = np.abs(yTrue[i]-yPred[int(i/2)])
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
Of course the tensor has not shape info yet which can't be inferred when building the graph, only at training time. Hence for i in range(n) rises an error. Is there any way to perform this?
The traceback of the output:
-------
BTW here's my true custom loss function in case of any questions. I skipped it above for clarity and simplicity.
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = yTrue.shape[0]
for i in range(n):
s1 = K.greater_equal(yTime, yTime[i])
s2 = K.exp(yPred[s1])
s3 = K.sum(s2)
logsum = K.log(y3)
loss = K.sum(yStatus[i] * yPred[i] - logsum)
return loss
Here's an image of the partial negative log-likelihood of the cox proportional harzards model.
This is to clarify a question in the comments to avoid confusion. I don't think it is necessary to understand this in detail to answer the question.
As usual, don't loop. There are severe performance drawbacks and also bugs. Use only backend functions unless totally unavoidable (usually it's not unavoidable)
Solution for example 3:
So, there is a very weird thing there...
Do you really want to simply ignore half of your model's predictions? (Example 3)
Assuming this is true, just duplicate your tensor in the last dimension, flatten and discard half of it. You have the exact effect you want.
def custom_loss(true, pred):
n = K.shape(pred)[0:1]
pred = K.concatenate([pred]*2, axis=-1) #duplicate in the last axis
pred = K.flatten(pred) #flatten
pred = K.slice(pred, #take only half (= n samples)
K.constant([0], dtype="int32"),
n)
return K.abs(true - pred)
Solution for your loss function:
If you have sorted times from greater to lower, just do a cumulative sum.
Warning: If you have one time per sample, you cannot train with mini-batches!!!
batch_size = len(labels)
It makes sense to have time in an additional dimension (many times per sample), as is done in recurrent and 1D conv netoworks. Anyway, considering your example as expressed, that is shape (samples_equal_times,) for yTime:
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = K.shape(yTrue)[0]
#sort the times and everything else from greater to lower:
#obs, you can have the data sorted already and avoid doing it here for performance
#important, yTime will be sorted in the last dimension, make sure its (None,) in this case
# or that it's (None, time_length) in the case of many times per sample
sortedTime, sortedIndices = tf.math.top_k(yTime, n, True)
sortedStatus = K.gather(yStatus, sortedIndices)
sortedPreds = K.gather(yPred, sortedIndices)
#do the calculations
exp = K.exp(sortedPreds)
sums = K.cumsum(exp) #this will have the sum for j >= i in the loop
logsums = K.log(sums)
return K.sum(sortedStatus * sortedPreds - logsums)