I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.
Related
For an exploratory semester project, I am trying to predict the outcome value of a quality control measurement using various measurements made during production. For the project I was testing different algorithms (LinearRegression, RandomForestRegressor, GradientBoostingRegressor, ...). I generally get rather low r2-values (around 0.3), which is probably due to the scattering of the feature values and not my real problem here.
Initially, I have around 100 features, which I am trying to reduce using RFE with LinearRegression() as estimator. Cross validation indicates, I should reduce my features to only 60 features. However, when I do so, for some models the R2-value increases. How is that possible? I was under the impression that adding variables to the model always increases R2 and thus reducing the number of variables should lead to lower R2 values.
Can anyone comment on this or provide an explanation?
Thanks in advance.
It depends on whether you are using the testing or training data to measure R2. This is a measure of how much of the variance of the data your model captures. So, if you increase the number of predictors then you are correct in that you do a better job predicting exactly where the training data lie and thus your R2 should increase (converse is true for decreasing the number of predictors).
However, if you increase number of predictors too much you can overfit to the training data. This means the variance of the model is actually artificially high and thus your predictions on the test set will begin to suffer. Therefore, by reducing the number of predictors you actually might do a better job of predicting the test set data and thus your R2 should increase.
When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!
As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.
In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.
I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.
I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.
There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one
I'm observing the following patterns in a one-layer CNN, binary classification model:
Training loss decreases while dev loss increased with number of steps
Training accuracy increases while dev accuracy decreases with number of steps
Based on past SO questions and literature review, it seems that these patterns are indicative of over-fitting (the model performs well in training, but cannot generalize to new examples).
The graphs below illustrate the loss and accuracy with respect to the number of steps in training.
In both,
The orange line represents the summary of the dev set performance.
The blue line represents the summary of the training set performance.
Loss:
Accuracy:
Traditional remedies I've considered, and my observations about them:
Adding L2 Regularization : I've tried many coefficients of L2 regularization -- from 0.0 to 4.5; all of these tests yield a similar pattern by the 5,000th step in both loss and accuracy.
Cross validation : It seems that the role of cross-validation is widely mis-understood online. As this answer states, cross-validation is for model checking, not model building. Indeed, cross-validation would be a way to check if the model generalizes well. And actually, the graphs I show are from one fold of a 4-fold cross-validation. If I observe a similar pattern in the loss/accuracy in all the folds, what other insight does cross-validation offer other than the confirmation that the model does not generalize well?
Early stopping : This would seem the most intuitive, but the loss graph seems to indicate that the loss levels out only after a divergence in the dev set loss is observed; the starting point of this early stop, then, doesn't seem easy to decide.
Data : The amount of labeled data I have available is limited, so training on more data is not an option right now.
All this said, what I am asking is:
If the patterns observed in the loss and accuracy are indeed indicative of over-fitting, are there any other methods to counteract over-fitting that I haven't considered?
If these patterns are not indicative of over-fitting, what else could they mean?
Thanks -- any insight would be much appreciated.
I think that you are totally on the right track. Looks like classic over-fitting.
One option is adding dropout if you don't already have it. It falls into the category of regularization, but it is more commonly used now then L1 and L2 regularization.
Changing the model architcture could get better results but it's hard to say what specifically would be best. It could help to make it deeper with more layers and possibly some pooling layers. It will likely still overfit but you might get a higher accuracy on the dev set before that happens.
Getting more data may be one of the best things you could do. If you can't get more data you can try to augment the data. You can also try cleaning the data to remove noise which can help prevent the model from fitting to noise.
You may ultimately want to try setting up a hyperparameter optimization search. This, however, can take a while on neural nets which take a while to train. Make sure you remove a test set before hyper parameter tuning.
Can someone explain why does the random_state parameter affects the model so much?
I have a RandomForestClassifier model and want to set the random_state (for reproducibility pourpouses), but depending on the value I use I get very different values on my overall evaluation metric (F1 score)
For example, I tried to fit the same model with 100 different random_state values and after the training ad testing the smallest F1 was 0.64516129 and the largest 0.808823529). That is a huge difference.
This behaviour also seems to make very hard to compare two models.
Thoughts?
If the random_state affects your results it means that your model has a high variance. In case of Random Forest this simply means that you use too small forest and should increase number of trees (which due to bagging - reduce variance). In scikit-learn this is controlled by n_estimators parameters in the constructor.
Why this happens? Each ML method tries to minimize the error, which from matematial perspective can be usually decomposed to bias and variance [+noise] (see bias variance dillema/tradeoff). Bias is simply how far from true values your model has to end up in the expectation - this part of an error usually comes from some prior assumptions, such as using linear model for nonlinear problem etc. Variance is how much your results differ when you train on different subsets of data (or use different hyperparameters, and in case of randomized methods random seed is a parameter). Hyperparameters are initialized by us and Parameters are learnt by the model itself in the training process. Finally - noise is not reducible error coming from the problem itself (or data representation). Thus, in your case - you simply encountered model with high variance, decision trees are well known for their extremely high variance (and small bias). Thus to reduce variance, Breiman proposed the specific bagging method, known today as Random Forest. The larger the forest - stronger the effect of variance reduction. In particular - forest with 1 tree has huge variance, forest of 1000 trees is nearly deterministic for moderate size problems.
To sum up, what you can do?
Increase number of trees - this has to work, and is well understood and justified method
treat random_seed as a hyperparameter during your evaluation, because this is exactly this - a meta knowledge you need to fix before hand if you do not wish to increase size of the forest.