NaNs and inf During Training a model with LSTM Layers

NaNs and inf During Training a model with LSTM Layers - python

I had a NaNs issue that prevented me from running my model for even one iteration, and the batch normalization solution in this post enabled me to run my model. But I still have some iterations that return NaNs/ Infs, and they go away after a few iterations. Is this all right?
I also noticed that the number of LSTM nodes has an effect on this result. Can anyone explain the proper way to use batch normalization and the number of nodes in LSTM layers?
The structure of my model is similar to this post and I'm just wondering where in the model I should use batch normalization. Is this batch normalization correctly implemented, or do I need to add it after each LSTM layer?

Related

How to train rebel neurons?

I'm training a pretty basic NN over mmnist fashion dataset. I'm using my own code, which is not important. I use a rather simplified algorith similar to ADAM and a cuadratic formula (train_value - real_value)**2 for training and error calculation. I apply a basic back propagation algorith for each weight, and I analyse 1/5 of the network weights for each trainning image. I use only a 128 layer as in the basic example for begginers in tensorflow, plus the input and output layers (the last with softmax and the first normalized to 0-1)
I'm not an expert at all, and I've only been able to train my network up to 77% accuracy over the test set.
As shown in the image bellow, I detected that the gradients of the weights for most of my neurons converge to cero after a few epochs. But there are few remarkable exceptions that just remain rebel (vertical lines at the first image divide the weights by neuron).
Could you recommend me some general techniques to train rogue neurons without affecting others?

You could add a constraint to the given kernel (the weight matrix in the Dense layer).
With one of those constraints, the weights can be normalized to a given, user-defined range.
See: TensorFlow.Keras Constraints
In addition you can try to use regularizers, to prevent overfitting of the mode, which may be indicated by some very large (absolute) weight values.
For that see for example L1 or L2 reguralizers: TensorFlow.Keras Regularizers

ML Model Overfits if input data is normalized

Please help me understand why my model overfits if my input data is normalized to [-0.5. 0.5] whereas it does not overfit otherwise.
I am solving a regression ML problem trying to detect location of 4 key points on images. To do that I import pretrained ResNet 50 and replace its top layer with the following architecture:
Flattening layer right after ResNet
Fully Connected (dense) layer with 256 nodes followed by LeakyRelu activation and Batch Normalization
Another Fully Connected layer with 128 nodes also followed by LeakyRelu and Batch Normalization
Last Fully connected layer (with 8 nodes) which give me 8 coordinates (4 Xs and 4 Ys) of 4 key points.
Since I stick with Keras framework, I use ImageDataGenerator to produce flow of data (images). Since output of my model (8 numbers: 2 coordinates for each out of 4 key points) normalized to [-0.5, 0.5] range, I decided that input to my model (images) should also be in this range and therefore normalized it to the same range using preprocessing_function in Keras' ImageDataGenerator.
Problem came out right after I started model training. I have frozen entire ResNet (training = False) with the goal in mind to first move gradients of the top layers to the proper degree and only then unfreeze a half of ResNet and finetune the model. When training with ResNet frozen, I noticed that my model suffers from overfitting right after a couple of epochs. Surprisingly, it happens even though my dataset is quite decent in size (25k images) and Batch Normalization is employed.
What's even more surprising, the problem completely disappears if I move away from input normalization to [-0.5, 0.5] and go with image preprocessing using tf.keras.applications.resnet50.preprocess_input. This preprocessing method DOES NOT normalize image data and surprisingly to me leads to proper model training without any overfitting.
I tried to use Dropout with different probabilities, L2 regularization. Also tried to reduce complexity of my model by reducing the number of top layers and the number of nodes in each top layer. I did play with learning rate and batch size. Nothing really helped if my input data is normalized and I have no idea why this happens.
IMPORTANT NOTE: when VGG is employed instead of ResNet everything seems to work well!
I really want to figure out why this happens.
UPD: the problem was caused by 2 reasons:
- batch normalization layers within ResNet didn't work properly when frozen
- image preprocessing for ResNet should be done using Z-score
After two fixes mentioned above, everything seems to work well!

Mentioning the Solution below for the benefit of the community.
Problem is resolved by making the changes mentioned below:
Batch Normalization layers within ResNet didn't work properly when frozen. So, Batch Normalization Layers within ResNet should be unfreezed, before Training the Model.
Image Preprocessing (Normalization) for ResNet should be done using Z-score, instead of preprocessing_function in Keras' ImageDataGenerator

Is there a way to increase the number of units in a dense layer and still be able to load previously saved weights that used a lower number of units?

I am starting to learn how to build neural networks. Here is what I did:
I ran a number of epochs with units in my dense layer at 512.
Then I saved the weights with the best accuracy.
Then I increased the number of units in my dense layer to 1024 and attempted to reload my weights with the best accuracy but with the old weights of 512.
I got an error. I understand why I got the error but I am wondering if there is a way to increase the number of units and still be able to use my saved weights or do I need to retrain my model from the beginning again?

In theory you could add more units and initialize them randomly, but that would make the original training worthless. A more common method for increasing the complexity of a model while leveraging earlier training is to add more layers and resume training.

Does the squared error depends on the number of hidden layers?

I wanted to know if the squared error depends also on the number of hidden layers and the number of neurons on each hidden layer , because I've created a neuron network with one hidden layer but I can't reach a small squared error , so maybe the function is not convex ? Can I optimize weights by adding more hidden layers ?

The more neurons (e.g. layers) you add to your model, the better you can approximate arbitrary functions. If your loss on your training data is not decreasing any further you are underfitting. This can be solved by making the model more complex, i.e. adding more trainable parameters. But you have to be careful, that you do not overdo it and end up overfitting.

Though this is not a programming question, I'll try my best to answer it here.
The squared error, i.e. the 'loss' of your neural network, depends on your neural network prediction and the ground truth. And it is convex from its definition.
The reasons that you're not getting low losses could be:
You're not normalizing your inputs. For example, if you got a series of house prices as input, which is around 500k to 1m, and you didn't normalize them, your prediction will be the linear combination of the prices, which is about the same order of magnitude, then pass through the activation function. This could result in large losses.
You're not initializing your weights and biases correctly. Similar to above, you could have large weights/biases which lead to large prediction values.
You didn't choose the proper activation function. When you're doing classification, your labels are generally one hot encoded, so your activation functions should limit the prediction to [0,1] or similar, so relu won't be a proper option. Also you don't want sigmoid as activation for regression problems.
Your labels are not predictable or have too much noise. Or maybe your network is not complex enough to capture important patterns, in that case you could try adding more layers and more nodes per layer.
Your learning rate is too small, this leads to slow convergence.
That's all I have in mind. You probably need more work to find out the reason to your problem.

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...

A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.

I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.