Tensorflow NMT with Attention Tutorial -- need help understanding loss function

Tensorflow NMT with Attention Tutorial -- need help understanding loss function - python

I'm following Tensorflow's Neural Machine Translation with Attention tutorial (link) but am unclear about some implementation details. It'd be great if someone could help clarify or refer me to a source/better place to ask:
1) def loss_function(real, pred): This function computes loss at a specific time step (say t), averaged over the entire the batch. Examples whose labels at t is <pad> (i.e. no real data, only padded so that all example sequences are of same length) are masked so as not to count towards loss.
My question: It seems loss should get smaller the bigger t is (since more examples are <pad> the further we get to maximum length). So why is loss averaged over the entire batch, and not just over the number of valid (non-<pad>) examples? (This is analogous to using tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS instead of tf.losses.Reduction.SUM_OVER_BATCH_SIZE)
2) for epoch in range(EPOCHS) ——> Two loss variables are defined in the training loop:
loss = sum of loss_function() outputs over all time steps
batch_loss = loss divided by number of time steps
My question: Why are gradients computed w.r.t. loss and not batch_loss? Shouldn't batch_loss be the average loss over all time steps and the entire batch?
Many thanks!

It seems loss should get smaller the bigger t
The loss does get smaller since the pad token is getting masked while calculating the loss.
Batch_loss is used only to print the loss calculated of each batch. Batch loss is calculated for every batch and across all the time steps.
for t in range(1, targ.shape[1])
This loop runs over the batch for all timesteps and calculates the loss by masking the padded values.
I hope this clears it up :)

Related

Simple L1 loss in PyTorch

I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.

If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)

Is there a difference between multiple times loss.backward() and loss.backward() after multiplying loss by n in Pytorch?

Is there a difference between these two codes?
1
Loss.backward(retain_graph=True)
Loss.backward(retain_graph=True)
Loss.backward()
optimizer.step
2
Loss = 3 * Loss
Loss.backward()
optimizer.step
When I checked the gradient of the parameter after the last backward(), there was no difference between the two codes. However, there is a little difference in test accuracy after training.
I know this is not a common case, but it is related to the research I'm doing.

To me, it looks very different.
Computing the loss three time won't do anything (first code snippet). You just hold on to the gradient you have previously calculated. (Check on your leaf tensors the value of the .grad() attribute).
However, the second code snippet with just multiply the gradients by three, thus speeding up Gradient Descent. For a standard Gradient descent optimizer, it would be like mutliplying the learning rate by 3.
Hope this helps.

In option 1, every time you call .backward(), gradients are computed. After 3 calls, when you perform optimizer.step, the gradients are added and then weights are updated accordingly.
In option 2, you multiply the loss with a constant, so the gradients will be multiplied with that constant too.
So, adding a gradient value 3 times and multiplying the gradient value by 3 would result in the same parameter update.
Please note, I assume there is no loss due to floating point precision (as noted in the comments).

Proper way to feed time-series data to stateful LSTM?

Let's suppose I have a sequence of integers:
0,1,2, ..
and want to predict the next integer given the last 3 integers, e.g.:
[0,1,2]->5, [3,4,5]->6, etc
Suppose I setup my model like so:
batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))
It is my understanding that model has the following structure (please excuse the crude drawing):
First Question: is my understanding correct?
Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).
This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.
For example:
batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc
This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:
From the tensorflow docs:
stateful: Boolean (default False). If True, the last state for each
sample at index i in a batch will be used as initial state for the
sample of index i in the following batch.
it seems this "internal" state isn't available and all that is available is the final state. See this figure:
So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:
batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc

The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.
Batch vs. sample mechanism ("see AI" = see "additional info" section)
All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).
Overlap vs no-overlap batch
Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?
As 240k is too much for an RNN to handle, we use CNN for dimensionality reduction
We have the option to use "sliding windows" - i.e. feed a subsegment at a time; let's use 54k
Take 10 samples, shape (240000, 1). How to feed?
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...
Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...
A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.
Prediction: overlap bad?
If you are doing a one-step prediction, the information landscape is now changed:
Chances are, your sequence length is faaar from 240000, so overlaps of any kind don't suffer the "same batch several times" effect
Prediction fundamentally differs from classification in that, the labels (next timestep) differ for every subsample you feed; classification uses one for the entire sequence
This dramatically changes your loss function, and what is 'good practice' for minimizing it:
A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less
What should I do?
First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:
One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit
Your goal: balance the two; 1's main edge over 2 is:
2 can handicap the model by making it forget seen samples
1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly
Should I ever use (2) in prediction?
If your sequence lengths are very long and you can afford to "slide window" w/ ~50% its length, maybe, but depends on the nature of data: signals (EEG)? Yes. Stocks, weather? Doubt it.
Many-to-many prediction; more common to see (2), in large per longer sequences.
LSTM stateful: may actually be entirely useless for your problem.
Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:
t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0
In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.
When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:
Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.
When and how does LSTM "pass states" in stateful?
When: only batch-to-batch; samples are entirely independent
How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling
Per above, you cannot do this:
# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]
This implies 21 causally follows 10 - and will wreck training. Instead do:
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]
Batch vs. sample: additional info
A "batch" is a set of samples - 1 or greater (assume always latter for this answer)
. Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:
SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.
BONUS DIAGRAMS:

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)

You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.

Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.

When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

tf.nn.sigmoid_cross_entropy_with_logits weights

I have a multi-label problem with ~1000 classes, yet only a handful are selected at a time. When using tf.nn.sigmoid_cross_entropy_with_logits this causes the loss to very quickly approach 0 because there are 990+ 0's being predicted.
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
Is it mathematically possible to just multiple the loss by a large constant (say 1000) just so that I can plot loss numbers in tensorboard that I can actually distinguish between? I realize that I could simply multiple the values that I am plotting (without affecting the value that I pass to the train_op) but I am trying to gain a better understanding for whether multiplying the train_op by a constant would have any real effect. For example, I could implement any of the following choices and am trying to think through the potential consequences:
loss = tf.reduce_mean(tf.multiply(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels), 1000.0))
loss = tf.multiply(tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels)), 1000.0)
Would you expect the training results to differ if a constant is introduced like this?

The larger your loss is, the bigger your gradient will be. Therefore, if you multiply your loss by 1000, your gradient step will be big and can lead to divergence. Look into gradient descent and backpropagation to understand this better.
Moreover, reduce_mean compute the mean of all the elements of your tensor. Multiplying before the mean or after is mathematically identical. Your two lines are therefore doing the same thing.
If you want to multiply your loss just to manipulate bigger number to plot them, just create another tensor and multiply it. You'll use your loss for training and multiplied_loss for plotting.
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
multiplied_loss = tf.multiply(loss, 1000.0)
optimizer.minimize(loss)
tf.summary.scalar('loss*1000', multiplied_loss)
This code is not enough of course, adapt it to your case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow NMT with Attention Tutorial -- need help understanding loss function - python

Related

Simple L1 loss in PyTorch

Is there a difference between multiple times loss.backward() and loss.backward() after multiplying loss by n in Pytorch?

Proper way to feed time-series data to stateful LSTM?

How can I predict the expected value and the variance simultaneously with a neural network?

tf.nn.sigmoid_cross_entropy_with_logits weights

Categories

Resources