Understanding CTC loss for speech recognition in Keras - python

I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.
What i think i understood (please correct me if i'm wrong!)
Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).
Let's say we're feeding utterances of some sentences as MFCCs.
The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:
alphabet (a, b, ..., z)
a blank token (-)
a space (_) and an end-character (>)
Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).
To implement it, i found that i can do something like this:
# CTC implementation from Keras example found at https://github.com/keras-
# team/keras/blob/master/examples/image_ocr.py
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
# print "y_pred_shape: ", y_pred.shape
y_pred = y_pred[:, 2:, :]
# print "y_pred_shape: ", y_pred.shape
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)
x = Bidirectional(lstm(...,return_sequences=True))(input_data)
x = Bidirectional(lstm(...,return_sequences=True))(x)
y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)
loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
[y_pred, y_true, input_length, label_length])
model = Model(inputs=[input_data, y_true, input_length,label_length],
outputs=loss_out)
With ALPHABET_LENGTH = 29 (alphabet length + special characters)
And:
y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.
(source)
Now, i'm facing some problems:
What i don't understand
Is this implantation the right way to code and use CTC loss?
I do not understand what are concretely y_true, input_length and
label_length. Any examples?
In what form should I give the labels to the network? Again, Any examples?

What are these?
y_true your ground truth data. The data you are going to compare with the model's outputs in training. (On the other hand, y_pred is the model's calculated output)
input_length, the length (in steps, or chars this case) of each sample (sentence) in the y_pred tensor (as said here)
label_length, the length (in steps, or chars this case) of each sample (sentence) in the y_true (or labels) tensor.
It seems this loss expects that your model's outputs (y_pred) have different lengths, as well as your ground truth data (y_true). This is probably to avoid calculating the loss for garbage characters after the end of the sentences (since you will need a fixed size tensor for working with lots of sentences at once)
Form of the labels:
Since the function's documentation is asking for shape (samples, length), the format is that... the char index for each char in each sentence.
How to use this?
There are some possibilities.
1- If you don't care about lengths:
If all lengths are the same, you can easily use it as a regular loss:
def ctc_loss(y_true, y_pred):
return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
#where input_length and label_length are constants you created previously
#the easiest way here is to have a fixed batch size in training
#the lengths should have the same batch size (see shapes in the link for ctc_cost)
model.compile(loss=ctc_loss, ...)
#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)
2 - If you care about the lengths.
This is a little more complicated, you need that your model somehow tells you the length of each output sentence.
There are again several creative forms of doing this:
Have an "end_of_sentence" char and detect where in the sentence it is.
Have a branch of your model to calculate this number and round it to integer.
(Hardcore) If you are using stateful manual training loop, get the index of the iteration you decided to finish a sentence
I like the first idea, and will exemplify it here.
def ctc_find_eos(y_true, y_pred):
#convert y_pred from one-hot to label indices
y_pred_ind = K.argmax(y_pred, axis=-1)
#to make sure y_pred has one end_of_sentence (to avoid errors)
y_pred_end = K.concatenate([
y_pred_ind[:,:-1],
eos_index * K.ones_like(y_pred_ind[:,-1:])
], axis = 1)
#to make sure the first occurrence of the char is more important than subsequent ones
occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())
#is eos?
is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))
#lengths
true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)
#reshape
true_lengths = K.reshape(true_lengths, (-1,1))
pred_lengths = K.reshape(pred_lengths, (-1,1))
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)
model.compile(loss=ctc_find_eos, ....)
If you use the other option, use a model branch to calculate the lengths, concatenate these length to the first or last step of the output, and make sure you do the same with the true lengths in your ground truth data. Then, in the loss function, just take the section for lengths:
def ctc_concatenated_length(y_true, y_pred):
#assuming you concatenated the length in the first step
true_lengths = y_true[:,:1] #may need to cast to int
y_true = y_true[:, 1:]
#since y_pred uses one-hot, you will need to concatenate to full size of the last axis,
#thus the 0 here
pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
y_pred = y_pred[:, 1:]
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

Related

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Counting with Keras

What I'm trying to do
I'm currently making a very simple sequence-to-sequence LSTM using Keras with a minor twist, earlier predictions in the sequence should count against the loss less than later ones. The way I'm trying to do this is by counting the sequence number and multiplying by the square root of this count. (I want to do this because this value is representative of the relative ratio of uncertainty in a Poisson process based on the number of samples collected. My network is gathering data and attempting to estimate an invariant value based on the data gathered so far.)
How I'm trying to do it
I've implemented both a custom loss function and a custom layer.
Loss function:
def loss_function(y_true, y_pred):
# extract_output essentially concatenates the first three regression outputs of y
# into a list representing an [x, y, z] vector, and returns it along with the rest as a tuple
r, e, n = extract_output(y_true)
r_h, e_h, n_h = extract_output(y_pred)
# Hyperperameters
dir_loss_weight = 10
dist_loss_weight = 1
energy_loss_weight = 3
norm_r = sqrt(dot(r, r))
norm_r_h = sqrt(dot(r_h, r_h))
dir_loss = mean_squared_error(r/norm_r, r_h/norm_r_h)
dist_loss = mean_squared_error(norm_r, norm_r_h)
energy_loss = mean_squared_error(e, e_h)
return sqrt(n) * (dir_loss_weight * dir_loss + dist_lost_weight * dist_loss + energy_loss_weight * energy_loss)
Custom Layer:
class CounterLayer(Layer):
def __init__(self, **kwargs):
super(CounterLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.sequence_number = 0
pass
def call(self, x):
self.sequence_number += 1
return [self.sequence_number]
def compute_output_shape(self, input_shape):
return (1,)
I then added the input as a concatenation to the regular output:
seq_num = CounterLayer()(inputs)
outputs = concatenate([out, seq_num])
What's going wrong
My error is:
Traceback (most recent call last):
File "lstm.py", line 119, in <module>
main()
File "lstm.py", line 115, in main
model = create_model()
File "lstm.py", line 74, in create_model
seq_num = CounterLayer()(inputs)
File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 497, in __call__
arguments=user_kwargs)
File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 565, in _add_inbound_node
output_tensors[i]._keras_shape = output_shapes[i]
AttributeError: 'int' object has no attribute '_keras_shape'
I'm assuming I have the shape wrong. But I do not know how. Does anyone know if I'm going about this in the wrong way? What should I do to make this happen?
Further Adventures
Per #Mohammad Jafar Mashhadi's comment, my call return needed to be wrapped in a keras.backend.variable; however, per his linked answer, my approach will not work, because call is not called multiple times, as I initially assumed it was.
How can I get a counter for the RNN?
For clarity, if the RNN given input xi outputs yi, I'm trying to get i as part of my output.
x1 -> RNN -> (y1, 1)
h1 |
v
x2 -> RNN -> (y2, 2)
h2 |
v
x3 -> RNN -> (y3, 3)
h3 |
v
x4 -> RNN -> (y4, 4)
The error is saying that inputs, in the line seq_num = CounterLayer()(inputs), is an integer.
You can't pass integers as inputs to layers. You must pass keras tensors, and only keras tensors.
Second, this will not work because Keras works in a static graph style. A call in a layer doesn't calculate things, it only builds the graph of empty tensors. Only tensors will ever get updated as you pass data to them, integer values will not. When you say self.sequence_number += 1, it will be called only when building the model, it will not be called over and over.
We need details
We can't really understand what is going on if you don't give us enough information, such as:
the model
the summary
your input data shapes
the target data shapes
the custom functions
etc.
If the interpretation below is correct, the model's output shape in the summary and the target data shapes as you pass it to fit are absolutely important to know.
Proposed solution
If I understood what you described, you want to have a sequence of increasing integers along with the time steps of your sequences, so these numbers are used in your loss function.
If this interpretation is right, you don't need to keep updating numbers, you just need to create a range tensor and that's it.
So, inside your loss (which I don't understand unless you provide us with the custom functions) you should create this tensor:
def custom_loss(y_true, y_pred):
#use this if you defined your model with static sequence length - input_shape =(length, features)
length = K.int_shape(y_pred)[1]
#use this if you defined your model with dynamic sequence length - input_shape = (None, features)
length = K.shape(y_pred)[1]
#this is the sequence vector:
seq = tf.range(1, length+1)
#you can get the root with
sec = K.sqrt(seq)
#you reshape to match the shape of the loss, which is probably (batch, length)
sec = K.reshape(sec, (1,-1)) #shape (1, lenght)
#compute your loss normally, taking care to reduce the last axis and keep the two first
loss = ..... #shape (batch, length)
#multiply the weights by the loss
return loss * sec
You must work everything as a whole tensor! You cannot interpret it as step by step. You must do everything keeping the first and second dimensions in you loss.
I'm not sure to have understood the question completely, but based on the final draw, I think that to get an extra feature (the time step) fed into the loss function along with the predictions, you might try to use the second approach suggested in this other accepted answer:
Custom loss function in Keras based on the input data
The idea is to expand the label vector with the extra feature, and then separate them again inside the loss function.
# y_true_plus_timesteps has shape [n_training_instances, 2]
def custom_loss(y_true_plus_timesteps, y_pred):
# labels stored in the first column
y_true = y_true_plus_timesteps[:, 0]
time_steps = y_true_plus_timesteps[:, 1]
return K.mean(K.square(y_pred - y_true), axis=-1) + your loss function
# note that labels are fed into the model with time steps
model.fit(X, np.append(Y_true, time_steps, axis =1), batch_size = batch_size, epochs=90, shuffle=True, verbose=1)

How to pad sequences during training for an encoder decoder model

I've got an encoder-decoder model for character level English language spelling correction, it is pretty basic stuff with a two LSTM encoder and another LSTM decoder.
However, up until now, I have been pre-padding the encoder input sequences, like below:
abc -> -abc
defg -> defg
ad -> --ad
And next I have been splitting the data into several groups with the same decoder input length, e.g.
train_data = {'15': [...], '16': [...], ...}
where the key is the length of the decoder input data and I have been training the model once for each length in a loop.
However, there has to be a better way to do this, such as padding after the EOS or before SOS characters etc. But if this is the case, how would I change the loss function so that this padding isn't counted into the loss?
The standard way of doing padding is putting it after the end-of-sequence token, but it should really matter where the padding is.
Trick how to not include the padded positions into the loss is masking them out before reducing the loss. Assuming the PAD_ID variable contains the index of the symbol that you use for padding:
def custom_loss(y_true, y_pred):
mask = 1 - K.cast(K.equal(y_true, PAD_ID), K.floatx())
loss = K.categorical_crossentropy(y_true, y_pred) * mask
return K.sum(loss) / K.sum(mask)

Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)

Keras, output of model predict_proba

In the docs, the predict_proba(self, x, batch_size=32, verbose=1) is
Generates class probability predictions for the input samples batch by batch.
and returns
A Numpy array of probability predictions.
Suppose my model is binary classification model, does the output is [a, b], for a is probability of class_0, and b is the probability of class_1?
Here the situation is different and somehow misleading, especially when you are comparing predict_proba method to sklearn methods with the same name. In Keras (not sklearn wrappers) a method predict_proba is exactly the same as a predict method. You can even check it here:
def predict_proba(self, x, batch_size=32, verbose=1):
"""Generates class probability predictions for the input samples
batch by batch.
# Arguments
x: input data, as a Numpy array or list of Numpy arrays
(if the model has multiple inputs).
batch_size: integer.
verbose: verbosity mode, 0 or 1.
# Returns
A Numpy array of probability predictions.
"""
preds = self.predict(x, batch_size, verbose)
if preds.min() < 0. or preds.max() > 1.:
warnings.warn('Network returning invalid probability values. '
'The last layer might not normalize predictions '
'into probabilities '
'(like softmax or sigmoid would).')
return preds
So - in a binary classification case - the output which you get depends on the design of your network:
if the final output of your network is obtained by a single sigmoid output - then the output of predict_proba is simply a probability assigned to class 1.
if the final output of your network is obtained by a two dimensional output to which you are applying a softmax function - then the output of predict_proba is a pair where [a, b] where a = P(class(x) = 0) and b = P(class(x) = 1).
This second method is rarely used and there are some theorethical advantages of using the first method - but I wanted to inform you - just in case.
It depends on how you specify output of your model and your targets. It can be both. Usually when one is doing binary classification the output is a single value which is a probability of the positive prediction. One minus the output is probability of the negative prediction.

Categories

Resources