I am training a simple feed-forward model with 3 or 4 hidden layers and dropouts between each (hidden layer + non linearity) combination.
Sometimes after a few epochs (about 10-11) the model starts outputting Infs and NaNs as the error of the NLL and the accuracy falls to 0.0%. This problem does not happen when I do not use dropouts. Is this a known issue with dropouts in Theano? The way I implement dropouts is:
def drop(self, input):
mask = self.theano_rng.binomial(n=1, p=self.p, size=input.shape, dtype=theano.config.floatX)
return input * mask
where input is the feature-vector on which we want to apply dropouts.
I have also observed that the occurance of NaNs happens earlier if the dropout probability (self.p) is higher. p = 0.5 would cause NaNs to occur around epoch 1 or 2 but p = 0.7 would cause NaNs to occur around epoch 10 or 11.
Also the occurrence of NaNs happens only when hidden layer sizes are large. For example (800,700,700) gives NaNs whereas (500,500,500) does not.
in my experience, NaNs, when training a network usually happen because of two problems:
first, mathematical error, e.g. log of negative value. It could happen when you are using log() in your loss function.
Second, there is a value that becomes too big so python can't handle.
In your case, from your good observation, I think it's a second case. Your loss value may become too big to handled by python. Try to initialize smaller weight when you try to expand your network. Or just use different approach to initialize weight like explained by Glorot (2010) or He (2015). Hope it helps.
Related
I'm using pytorch as the basis for some machine translation tasks I want to do.
When instantiating my dataset, I have as each torchtext.data.Example a [list(untokenized_string_sentence), untokenized_string_sentence, untokenized_string_sentence]
where untokenized_string_sentence is a just a raw example sentence. I've made sure to be careful about handling the first input as a list in some custom forward functions. However, I don't know if the batch loss computation is treating it accordingly, or if I need to be careful about this. I'm simply calling nn.CrossEntropyLoss() with label smoothing and padding. The result after a couple of batches is a completely NaN forward pass (and subsequently, NaN softmax values and loss).
Does anyone know if there is issue in differing types of inputs using PyTorch?
Here are how my types are defined:
src_field = data.Field(init_token=None, eos_token=EOS_TOKEN,
pad_token=PAD_TOKEN, tokenize=tok_fun,
batch_first=True, lower=lowercase,
unk_token=UNK_TOKEN,
include_lengths=True)
trg_field = data.Field(init_token=BOS_TOKEN, eos_token=EOS_TOKEN,
pad_token=PAD_TOKEN, tokenize=tok_fun,
unk_token=UNK_TOKEN,
batch_first=True, lower=lowercase,
include_lengths=True)
prev_inputs_field = data.NestedField(data.Field(init_token=CONTEXT_TOKEN, eos_token=CONTEXT_EOS_TOKEN,
pad_token=PAD_TOKEN, tokenize=tok_fun,
batch_first=True, lower=lowercase,
unk_token=UNK_TOKEN,
include_lengths=False), tokenize=sent_fun)
What will happen when I use batch normalization but set batch_size = 1?
Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation. Normally, I know, when batch_size = 1, variance will be 0. And (x-mean)/variance will lead to error because of division by 0.
But why did errors not occur when I set batch_size = 1? Why my network was trained as good as I expected? Could anyone explain it?
Some people argued that:
The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero.
But some people disagree. They said that:
You should calculate mean and std across all pixels in the images of the batch. (So even batch_size = 1, there are still a lot of pixels in the batch. So the reason why batch_size=1 can still work is not because of 1e-19)
I have checked the Pytorch source code, and from the code I think the latter one is right.
Does anyone have different opinion???
variance will be 0
No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.
More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.
Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization
Implementation details: from source code:
reduction_axes = list(range(len(input_shape)))
del reduction_axes[self.axis]
Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.
In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.
Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.
Other answers: first one is misleading:
a small rational number is added (1e-19) to the variance
This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.
Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance (of the single train sample) is zero, but the mean & variance themselves aren't.
when batch_size = 1, variance will be 0
No, because when you compute mean and variance for BN (for example using tf.nn.monents) you will be computing it over axis [0, 1, 2] (assuming you have NHWC tensor channels order).
From "Group Normalization" paper:
https://arxiv.org/pdf/1803.08494.pdf
With batch_size=1 batch normalization is equal to instance normalization and it can be helpful in some tasks.
But if you are using sort of encoder-decoder and in some layer you have tensor with spatial size of 1x1 it will be a problem, because each channel only have only one value and mean of value will be equal to this value, so BN will zero out information.
Let's suppose I have a sequence of integers:
0,1,2, ..
and want to predict the next integer given the last 3 integers, e.g.:
[0,1,2]->5, [3,4,5]->6, etc
Suppose I setup my model like so:
batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))
It is my understanding that model has the following structure (please excuse the crude drawing):
First Question: is my understanding correct?
Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).
This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.
For example:
batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc
This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:
From the tensorflow docs:
stateful: Boolean (default False). If True, the last state for each
sample at index i in a batch will be used as initial state for the
sample of index i in the following batch.
it seems this "internal" state isn't available and all that is available is the final state. See this figure:
So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:
batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc
The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.
Batch vs. sample mechanism ("see AI" = see "additional info" section)
All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).
Overlap vs no-overlap batch
Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?
As 240k is too much for an RNN to handle, we use CNN for dimensionality reduction
We have the option to use "sliding windows" - i.e. feed a subsegment at a time; let's use 54k
Take 10 samples, shape (240000, 1). How to feed?
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...
Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...
A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.
Prediction: overlap bad?
If you are doing a one-step prediction, the information landscape is now changed:
Chances are, your sequence length is faaar from 240000, so overlaps of any kind don't suffer the "same batch several times" effect
Prediction fundamentally differs from classification in that, the labels (next timestep) differ for every subsample you feed; classification uses one for the entire sequence
This dramatically changes your loss function, and what is 'good practice' for minimizing it:
A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less
What should I do?
First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:
One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit
Your goal: balance the two; 1's main edge over 2 is:
2 can handicap the model by making it forget seen samples
1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly
Should I ever use (2) in prediction?
If your sequence lengths are very long and you can afford to "slide window" w/ ~50% its length, maybe, but depends on the nature of data: signals (EEG)? Yes. Stocks, weather? Doubt it.
Many-to-many prediction; more common to see (2), in large per longer sequences.
LSTM stateful: may actually be entirely useless for your problem.
Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:
t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0
In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.
When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:
Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.
When and how does LSTM "pass states" in stateful?
When: only batch-to-batch; samples are entirely independent
How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling
Per above, you cannot do this:
# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]
This implies 21 causally follows 10 - and will wreck training. Instead do:
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]
Batch vs. sample: additional info
A "batch" is a set of samples - 1 or greater (assume always latter for this answer)
. Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:
SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.
BONUS DIAGRAMS:
I have a dataset of the following form: A series of M observations of N-dimensional data. In order to obtain latent factors from this data, I wish to make a single hidden-layer autoencoder trained on this data. Every dimension of a single observation is either a 0 or a 1. But the keras Model returns floats. Is there a way to add a layer to enforce a 0 or 1 as output?
I tried using a simple keras Model to solve this problem. It claims good accuracy on the data, but when looking at the raw data it predicts the 0's correctly and often completely ignores the 1's.
n_nodes = 50
input_1 = tf.keras.layers.Input(shape=(x_train.shape[1],))
x = tf.keras.layers.Dense(n_nodes, activation='relu')(input_1)
output_1 = tf.keras.layers.Dense(x_train.shape[1], activation='sigmoid')(x)
model = tf.keras.models.Model(input_1, output_1)
my_optimizer = tf.keras.optimizers.RMSprop()
my_optimizer.lr = 0.002
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10000)
predictions = model.predict(x_test)
These observations I then validate by looking at all experiments and seeing if a large (>0.1) value is returned for the elements which are 1. The performance is very poor on the 1's.
I have seen that the loss converges around 10000 epochs. However, the autoencoder fails to properly predict almost all 1's in the data set. Even when setting the width of the hidden layer to be identical to the dimensionality of the data (n_nodes = x_train.shape[1]) the autoencoder still gives bad performance, even worsening if I increase the width of the hidden layer.
[0, 1] outputs should generally be rounded such that >=0.5 rounds to 1 when outputting a final prediction and <0.5 rounds to 0. However your labels should be float values {0.0, 1.0} for the loss function (which I expect they are already). You can compute accuracy by rounding the outputs and comparing to your binary labels to count errors for {0, 1}, but they must be in continuous form [0.0, 1.0] for the loss and gradient calculations to work.
If you are doing all of that (and it does appear that things are set up correctly in your code), there might be a number of reasons for poor performance:
Your dense, "constriction" layer should be significantly smaller than your input. In making it smaller you are forcing the auto-encoder to learn a representative form of the input that can be used to produce the output. This representative form is likely to generalize well. If you increase the size of your hidden layer the network will have much more capacity to memorize the inputs.
You might have many more 0 values than 1 values, if this is the case then in the absence of actual learning the network could get stuck just predicting 0 as a "best guess" because that's "usually right". This is a harder problem to tackle. You might consider multiplying the loss by a vector of labels * eta + 1, this would effectively increase the learning rate of the ones labels. Example: Your labels are [0, 1, 0], eta is a hyper-parameter value >1, let's say eta=2.0. labels * eta = [1.0, 3.0, 1.0] which scales up the gradient signal for 1 values by increasing the loss for only 1's. This isn't a bullet proof method of increasing the importance of the 1's class, but it's something simple to try. If it makes any improvement then follow up on this line of reasoning in more detail.
You have 1 hidden layer, which means your limited to linear relationships, you might try 3 hidden layers to add a little non linearity. Your center layer should be fairly small, try something like 5 or 10 neurons, it should need to squeeze the data into a fairly tight constriction point to extract a general purpose representation.
I am implementing a custom loss function in keras. The output of the model is 10 dimensional softmax layer. To calculate loss: first I need to find the index of y firing 1 and then subtract that value with true value. I'm doing the following:
from keras import backend as K
def diff_loss(y_true,y_pred):
# find the indices of neuron firing 1
true_ind=K.tf.argmax(y_true,axis=0)
pred_ind=K.tf.argmax(y_pred,axis=0)
# cast it to float32
x=K.tf.cast(true_ind,K.tf.float32)
y=K.tf.cast(pred_ind,K.tf.float32)
return K.abs(x-y)
but it gives error "raise ValueError("None values not supported.")
ValueError: None values not supported."
What's the problem here?
This happens because your function is not differentiable. It's made of constants.
There is simply no solution for this if you want argmax as result.
An approach to test
Since you're using "softmax", that means that only one class is correct (you don't have two classes at the same time).
And since you want index differences, maybe you could work with a single continuous result (continuous values are differentiable).
Work with only one output ranging from -0.5 to 9.5, and take the classes by rounding the result.
That way, you can have the last layer with only one unit:
lastLayer = Dense(1,activation = 'sigmoid', ....) #or another kind if it's not dense
And change the range with a lambda layer:
lambdaLayer = Lambda(lambda x: 10*x - 0.5)
Now your loss can be a simple 'mae' (mean absolute error).
The downside of this attempt is that the 'sigmoid' activation is not evenly distributed between the classes. Some classes will be more probable than others. But since it's important to have a limit, it seems at first the best idea.
This will only work if you classes follow a logical increasing sequence. (I guess they do, otherwise you'd not be trying that kind of loss, right?)