Please help me understand why my model overfits if my input data is normalized to [-0.5. 0.5] whereas it does not overfit otherwise.
I am solving a regression ML problem trying to detect location of 4 key points on images. To do that I import pretrained ResNet 50 and replace its top layer with the following architecture:
Flattening layer right after ResNet
Fully Connected (dense) layer with 256 nodes followed by LeakyRelu activation and Batch Normalization
Another Fully Connected layer with 128 nodes also followed by LeakyRelu and Batch Normalization
Last Fully connected layer (with 8 nodes) which give me 8 coordinates (4 Xs and 4 Ys) of 4 key points.
Since I stick with Keras framework, I use ImageDataGenerator to produce flow of data (images). Since output of my model (8 numbers: 2 coordinates for each out of 4 key points) normalized to [-0.5, 0.5] range, I decided that input to my model (images) should also be in this range and therefore normalized it to the same range using preprocessing_function in Keras' ImageDataGenerator.
Problem came out right after I started model training. I have frozen entire ResNet (training = False) with the goal in mind to first move gradients of the top layers to the proper degree and only then unfreeze a half of ResNet and finetune the model. When training with ResNet frozen, I noticed that my model suffers from overfitting right after a couple of epochs. Surprisingly, it happens even though my dataset is quite decent in size (25k images) and Batch Normalization is employed.
What's even more surprising, the problem completely disappears if I move away from input normalization to [-0.5, 0.5] and go with image preprocessing using tf.keras.applications.resnet50.preprocess_input. This preprocessing method DOES NOT normalize image data and surprisingly to me leads to proper model training without any overfitting.
I tried to use Dropout with different probabilities, L2 regularization. Also tried to reduce complexity of my model by reducing the number of top layers and the number of nodes in each top layer. I did play with learning rate and batch size. Nothing really helped if my input data is normalized and I have no idea why this happens.
IMPORTANT NOTE: when VGG is employed instead of ResNet everything seems to work well!
I really want to figure out why this happens.
UPD: the problem was caused by 2 reasons:
- batch normalization layers within ResNet didn't work properly when frozen
- image preprocessing for ResNet should be done using Z-score
After two fixes mentioned above, everything seems to work well!
Mentioning the Solution below for the benefit of the community.
Problem is resolved by making the changes mentioned below:
Batch Normalization layers within ResNet didn't work properly when frozen. So, Batch Normalization Layers within ResNet should be unfreezed, before Training the Model.
Image Preprocessing (Normalization) for ResNet should be done using Z-score, instead of preprocessing_function in Keras' ImageDataGenerator
Related
I had a NaNs issue that prevented me from running my model for even one iteration, and the batch normalization solution in this post enabled me to run my model. But I still have some iterations that return NaNs/ Infs, and they go away after a few iterations. Is this all right?
I also noticed that the number of LSTM nodes has an effect on this result. Can anyone explain the proper way to use batch normalization and the number of nodes in LSTM layers?
The structure of my model is similar to this post and I'm just wondering where in the model I should use batch normalization. Is this batch normalization correctly implemented, or do I need to add it after each LSTM layer?
I try to switch from pytorch to tensorflow and since the model now seems to be a fixed thing in tensorflow, i stumble upon a problem when working with Convolutional Neural Networks.
I have a very simple model, just one Conv1D layer and a kernel with size 2.
I want to train it on a small Configuration, say 16 input size and then export the training results on a 32 input size.
How can i access the 3 parameters in this network? (2 kernel, 1 bias) I want to do so to apply them for the higher size case. I struggle because i need to pre-define a input size of the model, this was not the case with pytorch.
Thanks for answering, I've only found outdated answers to this question
model.layers[0].get_weights() yields the weights of the first layer, assuming model is a tf.keras.Model object.
I'm training a pretty basic NN over mmnist fashion dataset. I'm using my own code, which is not important. I use a rather simplified algorith similar to ADAM and a cuadratic formula (train_value - real_value)**2 for training and error calculation. I apply a basic back propagation algorith for each weight, and I analyse 1/5 of the network weights for each trainning image. I use only a 128 layer as in the basic example for begginers in tensorflow, plus the input and output layers (the last with softmax and the first normalized to 0-1)
I'm not an expert at all, and I've only been able to train my network up to 77% accuracy over the test set.
As shown in the image bellow, I detected that the gradients of the weights for most of my neurons converge to cero after a few epochs. But there are few remarkable exceptions that just remain rebel (vertical lines at the first image divide the weights by neuron).
Could you recommend me some general techniques to train rogue neurons without affecting others?
You could add a constraint to the given kernel (the weight matrix in the Dense layer).
With one of those constraints, the weights can be normalized to a given, user-defined range.
See: TensorFlow.Keras Constraints
In addition you can try to use regularizers, to prevent overfitting of the mode, which may be indicated by some very large (absolute) weight values.
For that see for example L1 or L2 reguralizers: TensorFlow.Keras Regularizers
I want extract 1000 image features using pretrained Xception model.
but xception models last layer(avg_pool) give 2048 features.
Can i reduce final output feature number without additional training?
I want image features before the softmax not predcition result.
base_model = xception.Xception(include_top=True, weights='imagenet')
base_model.summary()
self.model = Model(inputs = base_model.input, outputs=base_model.get_layer('avg_pool').output)
This model was trained to produce embeddings in 2048-dimensional space to the classifier after it. There is no sense in trying to reduce the dimensionality of the embedding space, unless you are combining very complex and inflexible models. If you are just doing simple transfer learning with no memory constraints, just snap your new classifier (extra layers) on top of it, and retrain after freezing (or not) all layers in the original Xception. That should work regardless of Xception output_shape. See the keras docs.
That said, if you REALLY need to reduce dimensionality to 1000-d, you will need a method that preserves (or at least tries to preserve) the original topology of the embedding space, otherwise your model will not benefit at all from transfer learning. Take a look at PCA, SVD, or T-SNE.
I am studying some machine learning on my own and I am practicing (in Python) with the assignments of the course held by Andrew Ng.
After completing the fourth exercise by hand, I tought to do it in Keras to practice with the library.
In the exercise we have 5000 images of hand written digits, going from 0 to 9. Each image is a 20x20 matrix. The dataset is stored in a matrix X of shape 5000x400 (each image has been 'unrolled') and the labels are stored in a matrix y of shape 5000x10. Each row of y is a hot-one vector.
The exercise asks to implement backpropagation to maximaze the log likelihood, for a simple neural network with one input layer, one hidden layer and one output layer. The hidden layer has 25 neurons and the output layer 10. We use sigmoid as activation for both layers.
My code in Keras is this
model=Sequential()
model.add(Dense(25,input_shape=(400,),use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.add(Dense(10,use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy',optimizer='sgd',metrics=['accuracy'])
model.fit(X, y, batch_size=5000,epochs=100, verbose=1)
Since I want this to be as similar as possible to the assignment I have used the same initial weights as the assignment, the same regularization parameter, the same activations and gradient descent as a optimizer (actually the assignment uses the Truncated Newton Method but I don't think my problem lies here).
I thought I was doing everything correctly but when I train the network I get a 10% accuracy on the training dataset. Even playing a little bit with the parameters the accuracy doesn't change much. To try to understand better the problem I tested it with smaller pieces of the dataset. For instance if I select a subdataset of 100 elements containing x images of zero and 100-x images of one, I get a x% training accuracy. My guess is that the network is optimizing the parameters to recognise only the first digit.
Now my questions are: what I am missing? Why isn't this the right implementation of the neural network described above?
If you are practising on the MNIST dataset, to classify 10 digits, you have 10 classes to predict. Rather than sigmoid, you should use ReLU in the hidden layers ( in your case the first layer ) and use softmax activation on the output layer. Use categorical crossentropy loss function with adam or sgd optimizer.