Learning with Batch Normalization vs without Batch Normalization

Learning with Batch Normalization vs without Batch Normalization - python

The primary objective of Batch Normalization is its faster optimization speed. To confirm it, I created a model with and without batch normalization as shown below;
Without BN:
model=Sequential()
model.add(Dense(units=1,activation='linear',use_bias=False,kernel_initializer=init_1,input_shape=(28,28,1),trainable=False))
model.add(Flatten())
model.add(Dense(units=1024,kernel_initializer=init_2))
model.add(Activation(activation='relu'))
model.add(Dense(units=10,kernel_initializer=init_2))
model.add(Activation(activation='softmax'))
With BN:
model=Sequential()
model.add(Dense(units=1,activation='linear',use_bias=False,kernel_initializer=init_1,input_shape=(28,28,1),trainable=False))
model.add(Flatten())
model.add(Dense(units=1024,kernel_initializer=init_2))
model.add(BatchNormalization())
model.add(Activation(activation='relu'))
model.add(Dense(units=10,kernel_initializer=init_2))
model.add(Activation(activation='softmax'))
The model was then compiled and trained on augmented data as shown below;
opt=optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=opt,loss='categorical_crossentropy',metrics=['accuracy'])
model.fit_generator(iter,steps_per_epoch=np.ceil(len(X_train)/64),epochs=30)
To my surprise, I found that the model Without BN converges faster.
To further confirm the findings, I tried it by varying the number of units in the hidden layer and got the following results(this time using fit() method rather than fit_generator() and using the original data rather than augmented data);
No. of hidden units = 5
No. of hidden units = 50
No. of hidden units = 100
From the graphs, we can observe that the model seems to converge faster without batch normalization and achieves higher accuracy than that being achieved with normalization.
Please help me in finding out where I am doing it wrong. Thank you.

Related

Need help defining a simple neural network

I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!

Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.

Keras neural network predicts the same number for all inputs

I am trying to create a keras neural network to predict distance on roads between two points in city. I am using Google Maps to get travel distance and then train neural network to do that.
import pandas as pd
arr=[]
for i in range(0,100):
arr.append(generateTwoPoints(55.901819,37.344735,55.589537,37.832254))
df=pd.DataFrame(arr,columns=['p1Lat','p1Lon','p2Lat','p2Lon', 'distnaceInMeters', 'timeInSeconds'])
print(df)
Neural network architecture:
from keras.optimizers import SGD
sgd = SGD(lr=0.00000001)
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(100, input_dim=4 , activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mse'])
Then i divide sets to test/train
Xtrain=train[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytrain=train[['distnaceInMeters']]/100000
Xtest=test[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytest=test[['distnaceInMeters']]/100000
Then i fit data into the model, but loss stays the same:
history = model.fit(Xtrain, Ytrain,
batch_size=1,
epochs=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(Xtest, Ytest))
I later print the data:
prediction = model.predict(Xtest)
print(prediction)
print (Ytest)
But result is the same for all the inputs:
[[0.26150784]
[0.26171574]
[0.2617755 ]
[0.2615582 ]
[0.26173398]
[0.26166356]
[0.26185763]
[0.26188275]
[0.2614446 ]
[0.2616575 ]
[0.26175532]
[0.2615183 ]
[0.2618127 ]]
distnaceInMeters
2 0.13595
6 0.27998
7 0.48849
16 0.36553
21 0.37910
22 0.40176
33 0.09173
39 0.24542
53 0.04216
55 0.38212
62 0.39972
64 0.29153
87 0.08788
I can not find the problem. What is it? I am new to machine learning.

You are doing a very elementary mistake: since you are in a regression setting, you should not use a sigmoid activation for your final layer (this is used for binary classification cases); change your last layer to
model.add(Dense(1,activation='linear'))
or even
model.add(Dense(1))
since, according to the docs, if you do not specify the activation argument it defaults to linear.
Various other advice offered already in the other answer and the comments may be useful (lower LR, more layers, other optimizers e.g. Adam), and you certainly need to increase your batch size; but nothing will work with the sigmoid activation function you currently use for your last layer.
Irrelevant to the issue, but in regression settings you don't need to repeat your loss function as a metric; this
model.compile(loss='mse', optimizer='sgd')
will suffice.

It would be very useful if you could post the the progression of the loss and MSE (of both the training and validation/test set) as it goes throughout the training. Even better, it would be best if you can visualize it as per https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/ and post the vizualization here.
In the meantime, based on the facts:
1) You say the loss isn't decreasing (I'm assuming on the training set, during training, based on your compile args).
2) You say that the prediction "accuracy" on your test set is bad.
3) My experience/intuition (not an empirical assessment) tells me that your two layer dense model is a little too small to be able to capture the complexity inherent in your data. AKA your model is suffering from too high a Bias https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
The fastest and easiest thing you can try, is to try to add both more layers and more nodes to each layer.
However, I should note that there is a lot of causal information that can affect the driving distance and driving time beyond just the the distance between two coordinates, which might be the feature that your Neural network will most readily extract. For example, whether you drive on a highway or sides treets, traffic lights, whehter the roads twist and turn or go straight... to infer all of that just from that the data you will need enormous amounts of data(examples) in my opinion. If you could add input columns with e.g. disatance to nearest higway from both points, you might be able to train with less data
I would also reccomend that you souble check that you are feeding as input what you think you are feeding (and its shape), and also, you should use some standardization from function sklearn which might help the model learn faster and converge faster to a higher "accuracy".
If and when you post either more code or the training history I can help you more (and also how many training samples).
EDIT 1: Try changing batch size to a larger number preferably batch_size=32 if it fits in your memory. you can use a small batch size (such as 1) when working with an "info rich" input like an image, but when using a very "info poor" datum like 4 floats (2 coordinates), the gradient will point each batch (with batch_size=1) to a practically random (pseudo...) direction and not neccessarily get any closer to a local minimum. Only when taking the gradient on the collective loss of a larger batch (like 32, and perhaps more) will you get a gradient that points at least approximately in the direction of the local minimum and converge to a better result. Also, I suggest that you don't mess with the learning rate manually and perhaps change to an optimizer like "adam" or "RMSProp".
Edit 2: #Desertnaut made an excellent point that I totally missed, a correction without which, your code will not work properly. He deserves the credit so I will not include it here. Please refer to his answer. Also, don't forget to raise your batch size, and not "manually mess" with your learning rate, "adam" for example, will do it for you.

Accuracy Decreasing with higher epochs

I am a newbie to Keras and machine learning in general. I'm trying to build a binary classification model using the Sequential model. After some experimenting, I saw that on multiple runs(not always) I was getting a an accuracy of even 97% on my validation data in the second or third epoch itself but this dramatically decreased to as much as 12%. What is the reason behind this ? How do I fine tune my model ?
Here's my code -
model = Sequential()
model.add(Flatten(input_shape=(6,size)))
model.add(Dense(6,activation='relu'))
model.add(Dropout(0.35))
model.add(Dense(3,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['binary_accuracy'])
model.fit(x, y,epochs=60,batch_size=40,validation_split=0.2)

According to me, you can take into consideration the following factors.
Reduce your learning rate to a very small number like 0.001 or even 0.0001.
Provide more data.
Set Dropout rates to a number like 0.2. Keep them uniform across the network.
Try decreasing the batch size.
Using appropriate optimizer: You may need to experiment a bit on this. Use different optimizers on the same network, and select an optimizer which gives you the least loss.
If any of the above factors work for you, please let me know about it in the comments section.

When I have this issue, I solved by changing the optimizer from RMSprop to adam. I also reduced the learning rate and added dropout after each fully connected layers. If your FC layers have small number of neurons then adding dropout does not make a major difference.

Reduce the learning rate (like 0.001 or 0.0001) and add batch normalization after every convolution layer.

Validation and training accuracy high in the first epoch [Keras]

I am training an image classifier with 2 classes and 53k images, and validating it with 1.3k images using keras. Here is the structure of the neural network :
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])
Training accuracy increases from ~50% to ~85% in the first epoch, with 85% validation accuracy. Subsequent epochs increase the training accuracy consistently, however, validation accuracy stays in the 80-90% region.
I'm curious, is it possible to get high validation and training accuracy in the first epoch? If my understanding is correct, it starts small and increases steadily with each passing epoch.
Thanks
EDIT : The image size is 150x150 after rescaling and the mini-batch size is 16.

Yes, it is entirely possible to get high accuracy on first epoch and then only modest improvements.
If there is enough redundancy in the data and you make enough updates (wrt. the complexity of your model, which seems fairly easy to optimize) in the first epoch (i.e. you use small minibatches), it's entirely possible that you learn most of the important stuff during the first epoch. When you show the data again, the model will start overfitting to pecularities introduced by the specific images in your train set (thus you get increasing training accuracy), but since you do not provide any novel samples, it will not learn anything new about the underlying properties of your classes.
You can think of your training data as an infinite stream (which actually SGD would like to enjoy all the convergence theorems). Do you think that you need more than 50k samples to learn what is important? You can actually test the data-hunger of your model by providing less data or reporting performance after a some sub-epoch number of updates.

You cannot expect to get an accuracy over 90-95% with image classification using feed forward neural networks.
You need to use another architecture called Convolution Neural network, state of the art in image recognition.
Also it is very easy to build that using keras, but it is computationally more intensive than this.
If you want to stick with feed forward layers the best thing you can do is early stopping, but even that wouldn't give you accuracy over 90%.

Yeah, epoch are supposed to fit the data on the model.
Try using 2 neurons at the end and one hot encode on ur Class label!
Like I have seen one case where I got better results doing that, instead of binary output.

Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease)

I am working on a very sparse dataset with the point of predicting 6 classes.
I have tried working with a lot of models and architectures, but the problem remains the same.
When I start training, the acc for training will slowly start to increase and loss will decrease where as the validation will do the exact opposite.
I have really tried to deal with overfitting, and I simply cannot still believe that this is what is coursing this issue.
What have I tried
Transfer learning on VGG16:
exclude top layer and add dense layer with 256 units and 6 units softmax output layer
finetune the top CNN block
finetune the top 3-4 CNN blocks
To deal with overfitting I use heavy augmentation in Keras and dropout after the 256 dense layer with p=0.5.
Creating own CNN with VGG16-ish architecture:
including batch normalization wherever possible
L2 regularization on each CNN+dense layer
Dropout from anywhere between 0.5-0.8 after each CNN+dense+pooling layer
Heavy data augmentation in "on the fly" in Keras
Realising that perhaps I have too many free parameters:
decreasing the network to only contain 2 CNN blocks + dense + output.
dealing with overfitting in the same manner as above.
Without exception all training sessions are looking like this:
Training & Validation loss+accuracy
The last mentioned architecture looks like this:
reg = 0.0001
model = Sequential()
model.add(Conv2D(8, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(16, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(16, kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='SGD',metrics=['accuracy'])
And the data is augmented by the generator in Keras and is loaded with flow_from_directory:
train_datagen = ImageDataGenerator(rotation_range=10,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.05,
zoom_range=0.05,
rescale=1/255.,
fill_mode='nearest',
channel_shift_range=0.2*255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
class_mode='categorical')
validation_datagen = ImageDataGenerator(rescale=1/255.)
validation_generator = validation_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=1,
shuffle = True,
class_mode='categorical')

What I can think of by analyzing your metric outputs (from the link you provided):
Seems to me that approximately near epoch 30 your model is starting to overfit. Therefore you can try stopping your training in that iteration, or well just train it for ~30 epochs (or the exact number). The Keras Callbacks may be useful here, specially the ModelCheckpoint to enable you to stop your training when desired (Ctrl +C) or when certain criteria is met. Here is an example of basic ModelCheckpoint use:
#save best True saves only if the metric improves
chk = ModelCheckpoint("myModel.h5", monitor='val_loss', save_best_only=False)
callbacks_list = [chk]
#pass callback on fit
history = model.fit(X, Y, ... , callbacks=callbacks_list)
(Edit:) As suggested in comments, another option you have available is to use the EarlyStopping callback, where you can specify the minimum change tolerated and the 'patience' or epochs without such improvement before stopping the training. If using this, you have to pass it to the callbacks argument as explained before.
At the current setup you model has (and with the modifications you have tried) that point in your training seems to be the optimal training time for your case; training it further will bring no benefits to your model (in fact, will make it generalize worse).
Given you have tried several modifications, one thing you can do is to try to increase your Network Depth, to give it more capacity. Try adding more layers, one at a time, and check for improvements. Also, you usually you want to start with simpler models first, before attempting a multi-layer solution.
If a simple model doesn't work, add one layer and test again, repeating until satisfied or possible. And by simple I mean really simple, have you tried a non-convolutional approach? Although CNN are great for images, maybe you are overkilling it here.
If nothing seems to work, maybe it is time to get more data, or to generate more data from the one you have by sampling or other techniques. For that last suggestion, try checking this keras blog I have found really useful. Deep learning algorithms usually require substantial amount of training data, specially for complex models, like images, so be aware this may not be an easy task. Hope this helps.

IMHO, this is just normal situation for DL. In Keras you can setup a callback that will save the best model (depending on evaluation metric that you provide), and callback that will stop training if model isn't improving.
See ModelCheckpoint & EarlyStopping callbacks respectively.
P.S. Sorry, maybe I misunderstood question - do you have validation loss decreasing form first step?

Validation loss is increasing. This means you need more data, or more regularization. Standard situation here, and nothing to be worried about. By the way, more parameters (bigger model) is just going to worsen this problem unless you fix it.
So you can now investigate profitably by introducing more examples, L2, L1, or dropout.

I faced a similar problem and managed to fix it by removing the Batch Normalisation layer that's just before the output dense layer. This made a ton of difference. Also one of the suggestions I was given is to remove the Dropout layer as it might be causing Shift Variance. Check this paper
I got part of the solution from this thread.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.