Strange pattern in convolutional neural network performance - python

I am learning about VGG, and I was struck by the following performance graph:
My question is this: Looking at the graph, it seems that first there is rapid growth, which then gradually slows down. This makes sense to me, since it becomes more difficult to improve a model the smaller the loss is. However, there are also three sudden drops around the 50, 75, and 100 epoch mark. I am curious as to why all the models experience this drop and rebound at the same time? What is causing it?
Thank you in advance for any help.

This is a common observation in complex model training. For instance, classic CNNs exhibit this behaviour: AlexNet and GoogleNet have two drop-and-improve irruptions in their training patterns. This is a very complex and organic effect of the model's holistic learning characteristics.
To oversimplify ... there are learning bottlenecks inherent in most models, even when the topology appears to be smooth. The model learns for a while until the latter layers have adapted well during back-prop ... until that learning bumps into one of the bottlenecks, some interference in input drive and feedback that tends to stall further real progress in training the earlier layers. This indicates a few false assumptions in the learning of those lower layers, assumptions which now encounter some statistical realities in the upper layers.
The natural operation of the training process forces some early-layer chaos back into the somewhat-stable late layers -- sort of an organic drop-out effect, although less random. Some of "learned" kernels in the late layers prove to be incorrect, and get their weights re-scrambled. As a result of this drop-out, the model gets briefly less accurate, but soon learns better than before, as seen in the graph.
I know of no way to predict when and how this will happen with a given topology. My personal hope is that it turns out to be some sort of harmonic resonance inherent in the topology, something like audio resonance in a closed space, or the spots/stripes on many animals.

Related

Why do we drop different nodes for each training example in Dropout Regularisation?

In Deep Learning course, Prof. Ng mentioned about how to implement Dropout Regularisation.
Implementation by Prof.Ng:-
First image tells the implementation of dropout and generating d3 matrix i.e. dropout matrix for third layer with shape (#nodes, #training examples).
As per my understanding, D3 would look like this for a single iteration and keeps on changing with every iteration (here, I've taken 5 nodes and 10 training examples)
Query: One thing I didn't get is why we need to drop different nodes for each training example. Instead, why we can't keep dropped nodes the same for all examples and again randomly drop in the next iteration. For example in the second image, 2nd training example is passing through 4 nodes while the first one is passing through all nodes. Why not the same nodes for all examples?
We choose different nodes within the same batch for the simple reason that it gives slightly better results. This is the rationale for virtually any choice in a neural network. As to why it gives better results ...
If you have many batches per epoch, you will notice very little difference, if any. The reason we have any dropout is that models sometimes exhibit superstitious learning (in the statistical sense): if an early batch of training examples, if several of them just happen to have a particular strong correlation of some sort, then the model will learn that correlation early on (primacy effect), and will take a long time to un-learn it. For instance, students doing the canonical dogs-vs-cats exercise will often use their own data set. Some students find that the model learns to identify anything on a soft chair as a cat, and anything on a lawn as a dog -- because those are common places to photograph each type of pet.
Now, imagine that your shuffling algorithm brings up several such photos in the first three batches. Your model will learn this correlation. It will take quite a few counter-examples (cat in the yard, dog on the furniture) to counteract the original assumption. Dropout disables one or another of the "learned" nodes, allowing others to be more strongly trained.
The broad concept is that a valid correlation will be re-learned easily; an invalid one is likely to disappear. This is the same concept as in repeating other experiments. For instance, if a particular experiment shows significance at p < .05 (a typical standard of scientific acceptance), then there is no more than a 5% chance that the correlation could have happened by chance, rather than the functional connection we hope to find. This is considered to be confident enough to move forward.
However, it's not certain enough to make large, sweeping changes in thousands of lives -- say, with a new drug. In those cases, we repeat the experiment enough times to achieve the social confidence desired. If we achieve the same result in a second, independent experiment, then instead of 1/20 chance of being mistaken, we have a 1/400 chance.
The same idea applies to training a model: dropout reduces the chance that we've learned something that isn't really true.
Back to the original question: why dropout based on each example, rather than on each iteration. Statistically, we do a little better if we drop out more frequently, but for shorter spans. If we drop the same nodes for several iterations, that slightly increase the chance that the active nodes will make a mistake while the few nodes were inactive.
Do note that this is mostly ex post facto rationale: this wasn't predicted in advance; instead, the explanation comes only after the false-training effect was discovered, a solution found, and the mechanics of the solution were studied.

What does it mean if my network can never overfit no matter how much I train it or expand its capacity?

I trained a model, got decent results, but then I got greedy and I wanted even more accuracy, so, I trained the model for longer, and longer and longer, but to no avail, nothing happens! according to theory, at some point, the validation accuracy must start to decrease after too much training (the loss start to INCREASE)! but this never seem to happen. So, I figured may be the NN is too simple to ever be able to overfit, so I increased its capacity and I ended up with millions of parameters, and I trained it for 10,000 epochs, still no overfitting happens.
The same question was asked here, but the answers there are anything but satisfying.
What does that mean?
It is a known thing with high capacity models. They are suprisingly resistant to overfitting which contradicts to the classical statistical learning theory that says that without explicit regularization you going to overfit. For example, this paper says
most of deep neural networks with learned parameters often generalize
very well empirically, even equipped with much more effective
parameters than the number of training samples, i.e. high capacity...
Thus, statistical learning theory cannot explain the generalization
ability of deep learning models.
Also, this and this papers are talking about it. You can keep on following the references in these papers to read more.
Personally, I have never seen high capacity model overfits even after training for 10s of thousands of epochs. If you want the example that does overfit: take Lenet 5 for Cifar10 with ReLU activations and without dropout and train it using SGD with learning rate 0.01. The number of training parameters in this model is ~ 60000 thousand which is the same as the number of samples in Cifar10 (low capacity model). After at most 500-1000 epochs you are going to see a very clear overfitting with increasing loss and error over time.

How do I better process my data and set parameters for my Neural Network?

When I run my NN the only way to get any training to occur is if I divide X by 1000. The network also needs to be trained under 70000 times with a 0.03 training rate and if those values are larger the NN gets worse. I think this is a due to bad processing of data and maybe the lack of having biases, but I don't really know.
Code on Google Colab
In short: all of the problems you mentioned and more.
Scaling is essential, typically to 0 mean and a variance of 1. Otherwise, you will quickly saturate the hidden units, their gradients will be near zero and (almost) no learning will be possible.
Bias is mandatory for such ANN. It's like an offset for fitting linear function. If you drop it, getting good fit will be very difficult.
You seem to be checking accuracy on your training data.
You have very few training samples.
Sigmoid is proven to be poor choice. Use ReLU and check e.g. here for explanation.
Also, I'd recommend spending some time on learning Python before going into this. For starter, avoid using global, it can get you unforeseen behaviour if you're not careful.

Dataset with only values (0,1,-1) with LSTM or CNN is giving 50% accuracy where as RF, SVM, ELM, Neural networks are giving above 90%

I have a dataset with 11k instances containing 0s,1s and -1s. I heard that deep learning can be applied to feature values.Hence applied the same for my dataset but surprisingly it resulted in less accuracy (<50%) compared to traditional machine learning algos (RF,SVM,ELM). Is it appropriate to apply deep learning algos to feature values for classification task? Any suggestion is greatly appreciated.
First of all, Deep Learning isn't a mythical hammer you can throw at every problem and expect better results. It requires careful analysis of your problem, choosing the right method, crafting your network, properly setting up your training, and only then, with a lot of luck will you see significantly better results than classical methods.
From what you describe (and without any more details about your implementation), it seems to me that there could have been several things going wrong:
Your task is simply not designed for a neural network. Some tasks are still better solved with classical methods, since they manually account for patterns in your data, or distill your advanced reasoning/knowledge into a prediction. You might not be directly aware of it, but sometimes neural networks are just overkill.
You don't describe how your 11000 instances are distributed with respect to the target classes, how big the input is, what kind of preprocessing you are performing for either method, etc, etc. Maybe your data is simply processed wrong, your training is diverging due to unfortunate parameter setups, or plenty of other things.
To expect a reasonable answer, you would have to share at least a bit of code regarding the implementation of your task, and parameters you are using for training.

How to study the effect of each data on a deep neural network model?

I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,
From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.
This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.

Categories

Resources