I am using the official Batch Normalization (BN) function (tf.contrib.layers.batch_norm()) of Tensorflow on the MNIST data. I use the following code for adding BN:
local4_bn = tf.contrib.layers.batch_norm(local4, is_training=True)
During testing, I change "is_training=False" in the above line of code and observe only 20% accuracy. However, it gives ~99% accuracy if I use the above code also for testing (i.e., keeping is_training=True) with a batch size of 100 images. This observation indicates that the exponential moving average and variance computed by batch_norm() are probably incorrect or I am missing something in my code.
Can anyone please answer about the solution of the above problem.
You get ~99% accuracy when you test you model with is_training=True only because of the batch size of 100.
If you change the batch size to 1 your accuracy will decrease.
This is due to the fact that you're computing the exponential moving average and variance for the input batch and than you're (batch-)normalizing the layers output using these values.
The batch_norm function have the parameter variables_collections that helps you to store the computed moving average and variance during the train phase and reuse them during the test phase.
If you define a collection for these variables, then the batch_norm layer will use them during the testing phase, instead of calculating new values.
Therefore, if you change you batch normalization layer definition to
local4_bn = tf.contrib.layers.batch_norm(local4, is_training=True, variables_collections=["batch_norm_non_trainable_variables_collection"])
The layer will store the computed variables into the "batch_norm_non_trainable_variables_collection" collection.
In the test phase, when you pass the is_training=False parameters, the layer will re-use the computed value that it find in the collection.
Note that the moving average and the variance are not trainable parameters and therefore, if you save only your model trainable parameters in the checkpoint files, you have to manually add the non-trainable variables stored into the previously defined collection.
You can do it when you create the Saver object:
saver = tf.train.Saver(tf.get_trainable_variables() + tf.get_collection_ref("batch_norm_non_trainable_variables_collection") + otherlistofvariables)
In addiction, since batch normalization can limit the expressive power of the layer which is applied to (because it restricts the range of the values), you should enable the network to learn the parameters gamma and beta (the affine transformation coefficients described in the paper) that allows the network to learn, thus, an affine transformation that increase the representation power of the layer.
You can enable the learning of these parameters setting to True the parameter of the batch_norm function, in this way:
local4_bn = tf.contrib.layers.batch_norm(
local4,
is_training=True,
center=True, # beta
scale=True, # gamma
variables_collections=["batch_norm_non_trainable_variables_collection"])
I have encountered the same question when processing MNIST. My train acc is normal, while test acc is very low at the begining, and then it grows gradually.
I changed default momentum=0.99 to momentum=0.9, then it works fine
My source code is here:
mnist_bn_fixed.py
Related
I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question). I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged. However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below. Here, the different lines show different training set sizes as a fraction of my total dataset.
As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase. Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour. Equally, I can't quite understand the variation with training set size.
Can anyone help me to understand this behaviour? And should I be concerned by these features?
from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam
...
model = QuasarNET(
X[0,:,None].shape,
nlines=len(args.lines)+len(args.lines_bal)
)
loss = []
for i in args.lines:
loss.append(custom_loss)
for i in args.lines_bal:
loss.append(custom_loss)
adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])
box, sample_weight = io.objective(z,Y,bal,lines=args.lines,
lines_bal=args.lines_bal)
print( "starting fit")
history = model.fit(X[:,:,None], box,
epochs = args.epochs,
batch_size = 256,
sample_weight = sample_weight)
Following some discussion from a colleague, I believe that we have solved this problem. As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history. When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate. This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.
You can avoid this by setting amsgrad=True when initialising the minimiser (http://www.satyenkale.com/papers/amsgrad.pdf). This prevents the learning rate from increasing in this way, and thus results in better convergence. The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question (norm loss) compared to the loss when setting amsgrad=True in the minimiser (amsgrad loss).
Clearly, the loss function is much better behaved with amsgrad=True, and, with more epochs of training, should result in a stable convergence.
In model.evaluate of tensorflow, there is a variable called "average_loss". Is it the same as MSE between labels and predictions? However in tf.losses, there is another function mean_squared_error as well. Which one is the correct MSE?
The loss is computed depending on the model (see here).
Very likely your model uses mean squared error for the loss function, thus the mean_squared_error is also the loss variable.
Instead, the average_loss would be the loss averaged over the last N iterations, because sometimes the training is performed with small batch size and the loss might be noisy.
I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...
A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.
I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.
In the Tensorflow package tf.contrib.quantize, there is a module that folds batch norm layers. It has a parameter called freeze_batch_norm_delay, which is supposed to freeze the moving mean and variance of the folded batch norm layers.
I am running some network (MobileNet+SSD) and inserted tf.contrib.quantize support. After 30k steps, the batch norms become frozen (freeze_bn_delay=30000). This is what happens to the loss:
Plot of the loss function, with batch norm freeze occuring at 30k steps
The loss makes a sudden jump when freezing the batch norm layers. I would expect that before and after the freeze should be identical, except that mean and variance won't be updated any longer ("freeze").
Could somebody explain to me what those corrections are?
Here is what the source code states, but it didn't help:
Computes batch norm correction params.
Before batch normalization is frozen:
We use batch statistics for batch norm.
correction_scale = sigma_b/sigma_mv
correction_recip = 1/correction_scale
correction_offset = 0
After batch normalization is frozen:
correction_scale = sigma_b/sigma_mv
correction_recip = 1
correction_offset = gamma*(mu_b/sigma_b-mu_mv/sigma_mv).
Batch norm is frozen if global_step > bn_freeze_delay.
The corrections ensure that:
a) The weights are quantized after scaling by gamma/sigma_mv. This enables
smoother training as the scaling on the weights changes slowly, rather than
jump across mini-batches
b) Changing the values of the corrections allows for one to switch between
using batch statistics to using moving mean and average, without requiring
changes to batch_norm
Here is the function definition:
_ComputeBatchNormCorrections
I do not understand why the text above claims that we switch between batch statistics and moving mean and average, when a freeze is supposed to occur. Is that a "freeze"?
I have searched for an answer on the web, however, as this package apparently is in active development, I have not found any explanations.
I am using Tensorflow DNNRegressor Estimator model for making a neural network. But calling estimator.train() function is giving output as follows:
I.e. my loss function is varying a lot with every step. But as far as I know, my loss function should decrease with no of iterations. Also, find the attached screenshot for Tensorboard Visualisation for loss function:
The doubts I'm not able to figure out are:
Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?
If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations? And In my knowledge that is the value we should look at while training a dataset.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
First of all, let me point out that tf.contrib.learn.DNNRegressor uses a linear regression head with mean_squared_loss, i.e. simple L2 loss.
Whether it is overall loss function value (combined loss for every
step processed till now) or just that step's loss value?
Each point on a chart is the value of a loss function on the last step after learning so far.
If it is that step's loss value, then how to get value of overall loss
function and see its trend, which I feel should decrease with
increasing no of iterations?
There's no overall loss function, probably you mean a chart how the loss changed after each step. That's exactly what tensorboard is showing to you. You are right, its trend is not downwards, as it should. This indicates that your neural network is not learning.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
A common reason for the neural network not learning is poor choice of hyperparameters (though there are many more mistakes you can possibly make). For example:
the learning rate is too large
it's also possible that the learning rate is too small, which means that the neural network is learning, but very very slowly, so that you can't see it
weights initialization is probably too large, try to decrease it
batch size may be too large as well
you're passing wrong labels for the inputs
training data contains missing values, or unnormalized
...
What I usually do to check if the neural network is at least somehow working is reduce the training set to few examples and try to overfit the network. This experiment is very fast, so I can try various learning rates, initialization variance and other parameters to find a sweet spot. Once I have a steady decreasing loss chart, I go on with a bigger set.
Though previous comment is very informative and good, it doesn't quite address your issue. When you instantiate DNNRegressor, add:
loss_reduction=tf.losses.Reduction.MEAN
in the constructor, and you'll see your average loss, converges.
estimator = tf.estimator.DNNRegressor(
feature_columns=feat_clmns,
hidden_units=[32, 64, 32],
weight_column=weight_clmn,
**loss_reduction=tf.losses.Reduction.MEAN**