Identifying overfitting in a cross validated SVM when tuning parameters

Identifying overfitting in a cross validated SVM when tuning parameters - python

I have an rbf SVM that I'm tuning with gridsearchcv. How do I tell if my good results are actually good results or whether they are overfitting?

Overfitting is generally associated with high variance, meaning that the model parameters that would result from being fitted to some realized data set have a high variance from data set to data set. You collected some data, fit some model, got some parameters ... you do it again and get new data and now your parameters are totally different.
One consequence of this is that in the presence of overfitting, usually the training error (the error from re-running the model directly on the data used to train it) will be very low, or at least low in contrast to the test error (running the model on some previously unused test data).
One diagnostic that is suggested by Andrew Ng is to separate some of your data into a testing set. Ideally this should have been done from the very beginning, so that happening to see the model fit results inclusive of this data would never have the chance to impact your decision. But you can also do it after the fact as long as you explain so in your model discussion.
With the test data, you want to compute the same error or loss score that you compute on the training data. If training error is very low, but testing error is unacceptably high, you probably have overfitting.
Further, you can vary the size of your test data and generate a diagnostic graph. Let's say that you randomly sample 5% of your data, then 10%, then 15% ... on up to 30%. This will give you six different data points showing the resulting training error and testing error.
As you increase the training set size (decrease testing set size), the shape of the two curves can give some insight.
The test error will be decreasing and the training error will be increasing. The two curves should flatten out and converge with some gap between them.
If that gap is large, you are likely dealing with overfitting, and it suggests to use a large training set and to try to collect more data if possible.
If the gap is small, or if the training error itself is already too large, it suggests model bias is the problem, and you should consider a different model class all together.
Note that in the above setting, you can also substitute a k-fold cross validation for the test set approach. Then, to generate a similar diagnostic curve, you should vary the number of folds (hence varying the size of the test sets). For a given value of k, then for each subset used for testing, the other (k-1) subsets are used for training error, and averaged over each way of assigning the folds. This gives you both a training error and testing error metric for a given choice of k. As k becomes larger, the training set sizes becomes bigger (for example, if k=10, then training errors are reported on 90% of the data) so again you can see how the scores vary as a function of training set size.
The downside is that CV scores are already expensive to compute, and repeated CV for many different values of k makes it even worse.
One other cause of overfitting can be too large of a feature space. In that case, you can try to look at importance scores of each of your features. If you prune out some of the least important features and then re-do the above overfitting diagnostic and observe improvement, it's also some evidence that the problem is overfitting and you may want to use a simpler set of features or a different model class.
On the other hand, if you still have high bias, it suggests the opposite: your model doesn't have enough feature space to adequately account for the variability of the data, so instead you may want to augment the model with even more features.

Related

Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?

In machine learning, you split the data into training data and test data.
In cross validation, you split the training data into training sets and validation set.
"And if scaling is required, at each iteration of the CV, the means and standard deviations of the training sets (not the entire training data) excluding the validation set are computed and used to scale the validation set, so that the scaling part never include information from the validation set. "
My question is when I include scaling in the pipeline, at each CV iteration, is scaling computed from the smaller training sets (excluding validation set) or the entire training data (including validation set)? Because if it computes means and std from entire training data , then this will lead to estimation bias in the validation set.

I thought about this, too, and although I think that scaling with the full data leaks some information from training data into validation data, I don't think it's that severe.
One one side, you shuffle the data anyway, and you assume that the distributions in all sets are the same, and so you expect means and standard deviations to be the same. (Of course, this is only theoretic (law of large numbers).)
On the other side, even if the means and stds are different, this difference will not be siginificant.
In my optinion, yes, you might have some bias, but it should be negligible.

Why does R2-value increase after feature-reduction with RFE?

For an exploratory semester project, I am trying to predict the outcome value of a quality control measurement using various measurements made during production. For the project I was testing different algorithms (LinearRegression, RandomForestRegressor, GradientBoostingRegressor, ...). I generally get rather low r2-values (around 0.3), which is probably due to the scattering of the feature values and not my real problem here.
Initially, I have around 100 features, which I am trying to reduce using RFE with LinearRegression() as estimator. Cross validation indicates, I should reduce my features to only 60 features. However, when I do so, for some models the R2-value increases. How is that possible? I was under the impression that adding variables to the model always increases R2 and thus reducing the number of variables should lead to lower R2 values.
Can anyone comment on this or provide an explanation?
Thanks in advance.

It depends on whether you are using the testing or training data to measure R2. This is a measure of how much of the variance of the data your model captures. So, if you increase the number of predictors then you are correct in that you do a better job predicting exactly where the training data lie and thus your R2 should increase (converse is true for decreasing the number of predictors).
However, if you increase number of predictors too much you can overfit to the training data. This means the variance of the model is actually artificially high and thus your predictions on the test set will begin to suffer. Therefore, by reducing the number of predictors you actually might do a better job of predicting the test set data and thus your R2 should increase.

Text classification - is it overfitting? How can I prove?

I have a multi classification problem and my data involves sequence of letters. It is a labelled data (used label encoder to encode string labels to numeric). There could be partial strings for the same class. May strings match but some could be just slightly different.
I am preparing my data with k-mer and countvectoriser (fitted on train data and transformed train and test data). With the combination of kmer size and ngram sizes, the dimension (feature size) varies between 8000+ to 35000+. I do not think that there is test information leak at the training of the model.
I fit different algorithms on the train data and test to review the generalisation. The test scores (accuracy, f1-score, precision and recall) are coming pretty high (more than 99%). Even though this is testing, do you think the model could be overfitting due to high dimensionality (curse of dimensionality)? I understand that if training score is high and generalises poorly then its overfitting but here the test scores are very high. This is not models as different algorithms giving similar results, its certainly about the data.
If I apply PCA to get 10 components which covers 99% variance, the test score on testing is high too. If I use selectkfeatures to select just about 10 best features, then the scores come down.
Really looking for your thoughts on how I can prove that this is not overfitting? Should I always go for reduced features size (through selection or pca) with such high dimension size? Thanks.
Regards,
Vijay

If your test score is high, then below are the possibilities
Overlap in test and train data: This can happen if you have duplicate records and while splitting one fall into train and other into test
Data Leak: If the class label information is some how encoded in the features. This can be easily verified: if train score are almost 100% even with basic models. Check this resource for understand what is a data leak.
You really have succeeded in building a good model
I suggest check the above 2 possibilities first and then try out K-fold cross validation.

Is there any rules of thumb for the relation of number of iterations and training size for lightgbm?

When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!

As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.

In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.

I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.

How can I test my classifier for overfitting?

I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen

The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset

There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.