Strategies for solving overfitting - other options?

Strategies for solving overfitting - other options? - python

I am building a predictive model where I want to know can I predict whether a package will be delivered on time (Binary Yes / No), in the event that the package is not delivered on time, I wish to be able to predict by when it will be delivered in categories of <7days, <14days, <21days >28days after expected date.
I have built and tested a model for binary classification and have got an f Score of 0.92, which is satisfactory for my needs. However, when I train my categorical model, I start to see training accuracy and validation accuracy diverge (training accuracy is much better than validation accuracy). This is a sign of overfitting.
However, I have tried regularization and different values, plus using dropout and different values, and the validation accuracy never gets above 0.7. My total training set is of ~10k examples, ~3k validation, and whilst the catgorical spread is not equal there are sufficient examples of each category (I think). I am using a NN and have increased / decreased both layers and activations and still no joy
Any thoughts on where to go next. Thanks

Because you are using NN, introduce dropout layers. See if it can help to reduce the overfitting problem. And also checkout this How to choose the number of hidden layers and nodes in a feedforward neural network?
The more complex the network (hidden layers, number of neurons in them), also contribute to overfitting problem

The approach we have chosen is to carry out a linear regression with the expected duration as target variable. We have excluded some outliers, and then taken the differences between the actual and predicted days. We then max'd and min'd the difference and we now have a prediction with a tolerable range. We will keep working on the other techniques to see if we can improve. Thanks to everyone who suggested ideas

Related

What causes neural network accuracy to sharply increase after only one epoch?

I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.

As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.

Model doesn't learn from data

We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?

In tensorflow why for a same dropout value of 0.8 when run with adam optimiser with 50epochs give different accuracy each time i run it?

I am building ANN as below:-
model=Sequential()
model.add(Flatten(input_shape=(25,)))
model.add(Dense(25,activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(5,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(xtraindata,ytraindata,epochs=50)
test_loss,test_acc=model.evaluate(xtestdata,ytestdata)
print(test_acc)
I am adding different features into the model and checking whether the newly added feature decreases or increases the accuracy but the problem is that each time I run this code with the same values I get different accuracy, sometimes it gets as low as 0.50 and so, I have few doubts and kindly answer them:-
Is the model giving different accuracy each time because in dropout reg. there are random dropouts in nodes and each time I run diff. nodes get silenced so thereby giving different accuracies i.e sometimes low and sometimes high?
How can I trust the accuracy of the model if each time it gives different accuracies? How can I know that the feature I have added has resulted in a decrement or increment of the accuracy?
If I get high accuracy and wanted to reproduce these results how do I save the parameters that the model has used?

Great questions. Answers:
I think your theory is right; it's the dropout. That's the only layer with an element of randomness each run, so it's likely the culprit. Try removing that layer, leaving everything else fixed, and run multiple times. Check if the accuracy is the same.
Cross validation. This article explains how it works, but the gist is that it is a statistical technique that trains and checks the accuracy of multiple runs of your model, all with different slices of data. The average accuracy of all runs is used. So highs and lows will be averaged to a true(ish) accuracy. That being said, if your model has inconsistent results by just varying dropout, it's an indicator that when you move the model to production and use real data, it will perform poorly.
Keras api has a method model.save("model_name") to save models. You can use keras.models.load_models("model_name") to get it back. As I said in point 2 though; if your model is so finicky that some trainings drastically affect accuracy, then even if you train and get good accuracy, it probably won't be useful on new data. So when you say "If I get high accuracy and wanted to reproduce these results", really you shouldn't be thinking along these lines. Instead, try to get consistently high training accuracy.

why the validation accuracy does not increase in a normal way over the epochs?

I'm trying to transfer learning VGG16 model with imagenet in a dataset of retinal images but i'm confused to get a graph like this, I don't know why the validation accuracy didn't increase in a normal way over the epochs, like training accuracy did, is it an index of overfitting ? if yes, how can i overcome it ?

My first suggestion would be to use a ResNet (network that contains residual connections) as a first step towards the improvement, in order to avoid the vanishing gradient problem.
VGGs have become less used are no longer relevant for benchmarking. What you should use instead is a ResNet50, which is available in tensorflow.keras.applications alongside other relevant pre-trained neural networks.
Also, the validation accuracy fluctuates very much; in addition to the previously mentioned possible improvement, you may want to recheck the construction of your training and validation sets.

training loss decreases while dev loss increases

I'm observing the following patterns in a one-layer CNN, binary classification model:
Training loss decreases while dev loss increased with number of steps
Training accuracy increases while dev accuracy decreases with number of steps
Based on past SO questions and literature review, it seems that these patterns are indicative of over-fitting (the model performs well in training, but cannot generalize to new examples).
The graphs below illustrate the loss and accuracy with respect to the number of steps in training.
In both,
The orange line represents the summary of the dev set performance.
The blue line represents the summary of the training set performance.
Loss:
Accuracy:
Traditional remedies I've considered, and my observations about them:
Adding L2 Regularization : I've tried many coefficients of L2 regularization -- from 0.0 to 4.5; all of these tests yield a similar pattern by the 5,000th step in both loss and accuracy.
Cross validation : It seems that the role of cross-validation is widely mis-understood online. As this answer states, cross-validation is for model checking, not model building. Indeed, cross-validation would be a way to check if the model generalizes well. And actually, the graphs I show are from one fold of a 4-fold cross-validation. If I observe a similar pattern in the loss/accuracy in all the folds, what other insight does cross-validation offer other than the confirmation that the model does not generalize well?
Early stopping : This would seem the most intuitive, but the loss graph seems to indicate that the loss levels out only after a divergence in the dev set loss is observed; the starting point of this early stop, then, doesn't seem easy to decide.
Data : The amount of labeled data I have available is limited, so training on more data is not an option right now.
All this said, what I am asking is:
If the patterns observed in the loss and accuracy are indeed indicative of over-fitting, are there any other methods to counteract over-fitting that I haven't considered?
If these patterns are not indicative of over-fitting, what else could they mean?
Thanks -- any insight would be much appreciated.

I think that you are totally on the right track. Looks like classic over-fitting.
One option is adding dropout if you don't already have it. It falls into the category of regularization, but it is more commonly used now then L1 and L2 regularization.
Changing the model architcture could get better results but it's hard to say what specifically would be best. It could help to make it deeper with more layers and possibly some pooling layers. It will likely still overfit but you might get a higher accuracy on the dev set before that happens.
Getting more data may be one of the best things you could do. If you can't get more data you can try to augment the data. You can also try cleaning the data to remove noise which can help prevent the model from fitting to noise.
You may ultimately want to try setting up a hyperparameter optimization search. This, however, can take a while on neural nets which take a while to train. Make sure you remove a test set before hyper parameter tuning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.