I trained a model using CNN,
The results are,
Accuracy
Loss
I read from fast.ai, the experts say, a good model has the val_loss is slightly greater than the loss.
My model is different in points, So, Can I take this model as good or I need to train again...
Thank you
it seems that maybe your validation set is "too easy" for the CNN you trained or doesn't represent your problem enough. it's difficult to say witch the amount of information you provided.
I would say, your validation set is not properly chosen. or you could try using crossvalidation to get more insights.
First of all as already pointed out I'd check your data setup: How many data points do the training and the test (validation) data sets contain? And have they been properly chosen?
Additionally, this is an indicator that your model might be underfitting the training data. Generally, you are looking for the "sweet spot" regarding the trade off between predicting your training data well and not overfitting (i.e. doing bad on the test data):
If accuracy on the training data is not high then you might be underfitting (left hand side of the green dashed line). And doing worse on the training data than on the test data is an indication for that (although this case is not being shown in the plot). Therefore, I'd check what happens to your accuracy if you increase the model complexity and fit it better to the training data (moving towards the 'sweet spot' in the illustrative graph).
In contrast if your test data accuracy was low you might be in an overfitting situation (right hand side the green dashed line).
Related
I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.
I am learning Convolution Neural Network now and practicing it on kaggle digit recognizer (MNIST) dataset.
While training the data, I noticed that inspite of initial gradually growing accuracy, in between there was a huge jump i.e from 0.8984 to 0.9814.
As a beginner, I want to investigate what does this jump really show about my model. Here is the image of the epochs:
enter image description here
I have circled the jump in yellow. Thanks in advance!
As the loss gradually starts to decrease, this create an impact on fitting of the model. The cost function makes the loss go down, which directly creates an impact on the fitting of model. Better the fitting of model into training data, better the accuracy (which we can easily see as the accuracy increases with the reduction in loss). There is almost a difference of 0.08 in your consecutive loss function which is enough for the model to fit more from the current state.
Now as the model progresses, we try it on the testing dataset because the real world data is nothing like the data we trained it on.
However, a higher accuracy might not always be good as the model is considered to be over-evaluated which is also known as overfitting which means the model is performing too well that it can't handle any little changes. Therefore, a correct balance between learning rate and epochs are required in order to predict the classes correctly. It also depends on the architecture, Optimizing function which make sure the oscillations are low and numerous other things.
I'm training a fusion network. I get now these kinds of fluctuations in the training accuracy. I'm not entirely sure what that means. I've read a few answers where it says that it is because the boundaries between the categories are switching and it causes the model to swap between classifying some items correctly. I'm now expanding the training set as well with some augmented data but still quite worried about this.
My training set is fairly small - only 348 images. This is what is worrying me that it might be a sign of overfitting. The loss on the training set continues to go down and I stop training when the validation loss flattens using early stopping. The results are quite good on the test set but I'm just worried about this fluctuation.
Edit: The model minimizes categorical cross entropy using the Adam optimizer.
Edit 2: Loss
Edit 3: After adding the augmented images to the training set/testing set/validation set it smooths out:
Pictures before and after augmentation, wondering why first picture includes so many fluctuations.
Most of my code is based on this article and the issue I'm asking about is evident there, but also in my own testing. It is a sequential model with LSTM layers.
Here is a plotted prediction over real data from a model that was trained with around 20 small data sets for one epoch.
Here is another plot but this time with a model trained on more data for 10 epochs.
What causes this and how can I fix it? Also that first link I sent shows the same result at the bottom - 1 epoch does great and 3500 epochs is terrible.
Furthermore, when I run a training session for the higher data count but with only 1 epoch, I get identical results to the second plot.
What could be causing this issue?
A few questions:
Is this graph for training data or validation data?
Do you consider it better because:
The graph seems cool?
You actually have a better "loss" value?
If so, was it training loss?
Or validation loss?
Cool graph
The early graph seems interesting, indeed, but take a close look at it:
I clearly see huge predicted valleys where the expected data should be a peak
Is this really better? It sounds like a random wave that is completely out of phase, meaning that a straight line would indeed represent a better loss than this.
Take a look a the "training loss", this is what can surely tell you if your model is better or not.
If this is the case and your model isn't reaching the desired output, then you should probably make a more capable model (more layers, more units, a different method, etc.). But be aware that many datasets are simply too random to be learned, no matter how good the model.
Overfitting - Training loss gets better, but validation loss gets worse
In case you actually have a better training loss. Ok, so your model is indeed getting better.
Are you plotting training data? - Then this straight line is actually better than a wave out of phase
Are you plotting validation data?
What is happening with the validation loss? Better or worse?
If your "validation" loss is getting worse, your model is overfitting. It's memorizing the training data instead of learning generally. You need a less capable model, or a lot of "dropout".
Often, there is an optimal point where the validation loss stops going down, while the training loss keeps going down. This is the point to stop training if you're overfitting. Read about the EarlyStopping callback in keras documentation.
Bad learning rate - Training loss is going up indefinitely
If your training loss is going up, then you've got a real problem there, either a bug, a badly prepared calculation somewhere if you're using custom layers, or simply a learning rate that is too big.
Reduce the learning rate (divide it by 10, or 100), create and compile a "new" model and restart training.
Another problem?
Then you need to detail your question properly.
I'm observing the following patterns in a one-layer CNN, binary classification model:
Training loss decreases while dev loss increased with number of steps
Training accuracy increases while dev accuracy decreases with number of steps
Based on past SO questions and literature review, it seems that these patterns are indicative of over-fitting (the model performs well in training, but cannot generalize to new examples).
The graphs below illustrate the loss and accuracy with respect to the number of steps in training.
In both,
The orange line represents the summary of the dev set performance.
The blue line represents the summary of the training set performance.
Loss:
Accuracy:
Traditional remedies I've considered, and my observations about them:
Adding L2 Regularization : I've tried many coefficients of L2 regularization -- from 0.0 to 4.5; all of these tests yield a similar pattern by the 5,000th step in both loss and accuracy.
Cross validation : It seems that the role of cross-validation is widely mis-understood online. As this answer states, cross-validation is for model checking, not model building. Indeed, cross-validation would be a way to check if the model generalizes well. And actually, the graphs I show are from one fold of a 4-fold cross-validation. If I observe a similar pattern in the loss/accuracy in all the folds, what other insight does cross-validation offer other than the confirmation that the model does not generalize well?
Early stopping : This would seem the most intuitive, but the loss graph seems to indicate that the loss levels out only after a divergence in the dev set loss is observed; the starting point of this early stop, then, doesn't seem easy to decide.
Data : The amount of labeled data I have available is limited, so training on more data is not an option right now.
All this said, what I am asking is:
If the patterns observed in the loss and accuracy are indeed indicative of over-fitting, are there any other methods to counteract over-fitting that I haven't considered?
If these patterns are not indicative of over-fitting, what else could they mean?
Thanks -- any insight would be much appreciated.
I think that you are totally on the right track. Looks like classic over-fitting.
One option is adding dropout if you don't already have it. It falls into the category of regularization, but it is more commonly used now then L1 and L2 regularization.
Changing the model architcture could get better results but it's hard to say what specifically would be best. It could help to make it deeper with more layers and possibly some pooling layers. It will likely still overfit but you might get a higher accuracy on the dev set before that happens.
Getting more data may be one of the best things you could do. If you can't get more data you can try to augment the data. You can also try cleaning the data to remove noise which can help prevent the model from fitting to noise.
You may ultimately want to try setting up a hyperparameter optimization search. This, however, can take a while on neural nets which take a while to train. Make sure you remove a test set before hyper parameter tuning.