Ordering of batch normalization and dropout? - python

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.
When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?
It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I'm missing?
Also, are there other pitfalls to look out for in when using these two together? For example, assuming I'm using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don't immediately see a problem with that, but I might be missing something.
Thank you much!
UPDATE:
An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They're both going down in the other case. But in my case the movements are slow, so things may change after more training and it's just a single test. A more definitive and informed answer would still be appreciated.

In the Ioffe and Szegedy 2015, the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution". So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.
As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.
So in summary, the order of using batch normalization and dropout is:
-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet
My 2 cents:
Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt.
So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.
If you think about it, in typical ML problems, this is the reason we don't compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets
so i suggest Scheme 1 (This takes pseudomarvin's comment on accepted answer into consideration)
-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC
as opposed to Scheme 2
-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer
Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2

Usually, Just drop the Dropout(when you have BN):
"BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively"
"Architectures like ResNet, DenseNet, etc. not using Dropout
For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by #Haramoz in the comments.

Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285
Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396
Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144
Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665
Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536
Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491
Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332
Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568
Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951
Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556
Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with
model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))
The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.
The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.
Edit:
When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the "variance shift". This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns.
The main idea can be found in this figure which is taken from this paper.
A small demo for this effect can be found in this notebook.

I read the recommended papers in the answer and comments from
https://stackoverflow.com/a/40295999/8625228
From Ioffe and Szegedy (2015)’s point of view, only use BN in the
network structure. Li et al. (2018) give the statistical and
experimental analyses, that there is a variance shift when the
practitioners use Dropout before BN. Thus, Li et al. (2018) recommend
applying Dropout after all BN layers.
From Ioffe and Szegedy (2015)’s point of view, BN is located
inside/before the activation function. However, Chen et al. (2019)
use an IC layer which combines dropout and BN, and Chen et al. (2019)
recommends use BN after ReLU.
On the safety background, I use Dropout or BN only in the network.
Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao,
and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization
and Dropout in the Training of Deep Neural Networks.” CoRR
abs/1905.05928. http://arxiv.org/abs/1905.05928.
Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate
Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.
Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding
the Disharmony Between Dropout and Batch Normalization by Variance
Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.

Based on the research paper for better performance we should use BN before applying Dropouts

ConV/FC - BN - Sigmoid/tanh - dropout.
If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task

After Reading Multiple Answers And Conducting Some Tests, These Are My Hypothesis
a) Always BN -> AC, (Nothing b/w them).
b) BN -> Dropout over Dropout -> BN, but try both. [Newer research, finds 1st better ]
c) BN eliminates the need of Dropout, no need to use Dropout.
d) Pool in the end.
e) BN before Dropout is data Leakage.
f) Best thing is to try every combination.
SO CALLED BEST METHOD -
Layer -> BN -> AC -> Dropout -> Pool ->Layer

Related

How can I increase the VGG16 Model's Accurancy? (Underfitting or Overfitting Issues)

I have a problem about increasing the accuracy of the VGG16 model.
Even if I defined some Dense layers, I couldn't handle with it. Can you help me how to get the best result if you don't mind? I tried to use Dropout but I couldn't increase its accuracy. Can you look through it if you don't want to open this file?
I think it can be overfitting or Underfitting in terms of model's behaviour.
Here is my model shown below.
base_model=VGG16(
include_top=False,
weights="imagenet",
input_shape=(IMAGE_SIZE,IMAGE_SIZE,3))
#freeze the base model
base_model.trainable = False
model=Sequential()
model.add(base_model)
model.add(Flatten())
model.add(Dense(512,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(256,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(128,activation='relu'))
#model.add(Dropout(0.2))
model.add(Dense(num_classes,activation='softmax'))
model.summary()
Here is my project link : Project
There are a number of different things you can do, and that depends on your problem scope. What you are showing is the basic transfer learning model with a couple of dense layers.
Regularisation is one thing that you have done already by using Dropout, but you have turned it off. Other regularisation tools are L2, L1 regularisation to keep things simple. The other thing you can do is to use a lower learning rate, reduce the batch size, use batch normalisation or change the optimisation function, or all of the above at the same time.
Creating a Neural Network Model is the easy part. The more important and hard to master skill is optimising it to perform good on general data by tweaking each parameter untill you produce better results.
Try looking at these three guides (or other ones that you can find about hyperparameter optimisation) to understand more:
Optimising Neural Networks Where To Start?
A Comprehensive Guide on Neural Networks Performance Optimization
An Overview of Regularization Techniques in Deep Learning 
Dropout is a regularization technique that introduces noise in the network weights in order to avoid overspecialization of the network to the training inputs (overfitting). The way this kind of layer introduces noise is by mapping some trained weights to 0. The probability according to each weight will remain untouched is given by the dropout value (the input value to the Dropout layer).
In your case, your dropout value set to 0.2 will mean that ~80% of the weights will be changed into 0 for each layer, which largely reduces the training effectiveness of any neural model. Typically you want to keep this value high, but not equal to 1, otherwise no regularization (annihilation of weights) will be performed.
Try reintroducing Dropout with higher values (like 0.95, 0.9, 0.8, 0.7) and compare them without the regularization (dropout=1 or just comment the dropout layer line) and with excess of regularization (dropout=0.2).
This fix on dropout may come in handy for boosting your model performances.

num_units in GRU and LSTM layers in keras Tensorflow 2 - confuse meaning

I know that this question raised many time, but I could not get a clear answer because there are different answers:
In tf.keras.layers.LSTM tf.keras.layers.GRU layers there is a parameter called num_units. I saw a lot of questions over the internet about this parameter. and there is not clear answer for what this parameter mean expect for the obvious meaning which is the shape of the output. some say that its mean that in each layer there num_units of LSTM or GRU units, some say that it is only one unit of LSTM or GRU, but with num_units hidden units (num_units of tanh, and sigmoids for each gate and so on..) inside the LSTM or GRU layer.
The argument, num_units in an LSTM Layer refers to number of LSTM Units in that Layer, with each LSTM Unit comprising the below Architecture.

Why does NN training loss flatten?

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.
The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...
Why is that?
Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.
My hidden layers use leaky relu and final layer uses sigmoid activation.
Im trying to classify the mnist handwritten digits.
code:
#FORWARDPROPAGATION
for i in range(layers-1):
cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])
cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])
yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m
if j%10==0:
print(cost)
costs.append(cost)
#BACKPROPAGATION
grad={"dz"+str(layers):yn-y}
for i in range(layers):
grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
if i<layers-1:
grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])
for i in range(layers):
param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]
The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.
I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

Homemade deep learning library: numerical issue with relu activation

For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade.
It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions.
Unfortunately I seem to get issues when replacing these with relus.
This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples:
Everything is fine for the first ~8 epochs and then I get a complete collapse on the score of a dummy classifier (~0.1 accuracy). I checked the code of the relu and it seems fine. Here are my forward and backward passes:
def fprop(self, inputs):
return np.maximum( inputs, 0.)
def bprop(self, inputs, outputs, grads_wrt_outputs):
derivative = (outputs > 0).astype( float)
return derivative * grads_wrt_outputs
The culprit seems to be in the numerical stability of the relu. I tried different learning rates and many parameter initializers for the same result. Tanh and sigmoid work properly. Is this a known issue? Is it a consequence of non-continuous derivative of the relu function?
Yes, it's quite possible that the ReLU's are to blame. Most of the classic perceptron-based models, including ConvNet (the classic MNIST trainer), depend on both positive and negative weights for their training accuracy. ReLU ignores the negative characteristics, thus detracting from the model's capabilities.
ReLU is better-suited for convolution layers; it's a filter that says, "If the kernel isn't excited about this part of the input, I don't care how deep the boredom goes; just ignore it." MNIST training depends on counter-correction, allowing nodes to say "No, this isn't good, run the other way!"

Where to apply batch normalization on standard CNNs

I have the following architecture:
Conv1
Relu1
Pooling1
Conv2
Relu2
Pooling3
FullyConnect1
FullyConnect2
My question is, where do I apply batch normalization? And what would be the best function to do this in TensorFlow?
The original batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
To your second question: in tensorflow, you can use a high-level tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization.
There's some debate on this question. This Stack Overflow thread and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue that show BN performing better after the activation layers.
My current opinion (open to being corrected) is that you should do BN after the activation layer, and if you have the budget for it and are trying to squeeze out extra accuracy, try before the activation layer.
So adding Batch Normalization to your CNN would look like this:
Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to
the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015)
recommend the latter. More specifically, XW+b should be replaced by a
normalized version of XW. The bias term should be omitted because it
becomes redundant with the β parameter applied by the batch
normalization reparameterization. The input to a layer is usually the
output of a nonlinear activation function such as the rectified linear
function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear
operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It's an open debate. I suggest that you test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

Categories

Resources