Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?

Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline? - python

In machine learning, you split the data into training data and test data.
In cross validation, you split the training data into training sets and validation set.
"And if scaling is required, at each iteration of the CV, the means and standard deviations of the training sets (not the entire training data) excluding the validation set are computed and used to scale the validation set, so that the scaling part never include information from the validation set. "
My question is when I include scaling in the pipeline, at each CV iteration, is scaling computed from the smaller training sets (excluding validation set) or the entire training data (including validation set)? Because if it computes means and std from entire training data , then this will lead to estimation bias in the validation set.

I thought about this, too, and although I think that scaling with the full data leaks some information from training data into validation data, I don't think it's that severe.
One one side, you shuffle the data anyway, and you assume that the distributions in all sets are the same, and so you expect means and standard deviations to be the same. (Of course, this is only theoretic (law of large numbers).)
On the other side, even if the means and stds are different, this difference will not be siginificant.
In my optinion, yes, you might have some bias, but it should be negligible.

Related

Text classification - is it overfitting? How can I prove?

I have a multi classification problem and my data involves sequence of letters. It is a labelled data (used label encoder to encode string labels to numeric). There could be partial strings for the same class. May strings match but some could be just slightly different.
I am preparing my data with k-mer and countvectoriser (fitted on train data and transformed train and test data). With the combination of kmer size and ngram sizes, the dimension (feature size) varies between 8000+ to 35000+. I do not think that there is test information leak at the training of the model.
I fit different algorithms on the train data and test to review the generalisation. The test scores (accuracy, f1-score, precision and recall) are coming pretty high (more than 99%). Even though this is testing, do you think the model could be overfitting due to high dimensionality (curse of dimensionality)? I understand that if training score is high and generalises poorly then its overfitting but here the test scores are very high. This is not models as different algorithms giving similar results, its certainly about the data.
If I apply PCA to get 10 components which covers 99% variance, the test score on testing is high too. If I use selectkfeatures to select just about 10 best features, then the scores come down.
Really looking for your thoughts on how I can prove that this is not overfitting? Should I always go for reduced features size (through selection or pca) with such high dimension size? Thanks.
Regards,
Vijay

If your test score is high, then below are the possibilities
Overlap in test and train data: This can happen if you have duplicate records and while splitting one fall into train and other into test
Data Leak: If the class label information is some how encoded in the features. This can be easily verified: if train score are almost 100% even with basic models. Check this resource for understand what is a data leak.
You really have succeeded in building a good model
I suggest check the above 2 possibilities first and then try out K-fold cross validation.

Is there any rules of thumb for the relation of number of iterations and training size for lightgbm?

When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!

As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.

In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.

I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.

I need help understanding train_test_split?

from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!

First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.

Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.

How to do GridSearchCV with train and test being different datasets?

I would like to find the best parameters for a RandomForest classifier (with scikit-learn) in a way that it generalises well to other datasets (which may not be iid).
I was thinking doing grid search using the whole training dataset while evaluating the scoring function on other datasets.
Is there an easy to do this in python/scikit-learn?

I don't think you can evaluate on a different data set. The whole idea behind GridSearchCV is that it splits your training set into n folds, trains on n-1 of those folds and evaluates on the remaining one, repeating the procedure until every fold has been "the odd one out". This keeps you from having to set apart a specific validation set and you can simply use a training and a testing set.

If you can, you may simply merge the two datasets and perform GridSearchCV, this ensures the generalization ability to the other dataset. If you are talking about generalization to future unknown dataset, then this might not work, because there isn't a perfect dataset from which we can train a perfect model.

Identifying overfitting in a cross validated SVM when tuning parameters

I have an rbf SVM that I'm tuning with gridsearchcv. How do I tell if my good results are actually good results or whether they are overfitting?

Overfitting is generally associated with high variance, meaning that the model parameters that would result from being fitted to some realized data set have a high variance from data set to data set. You collected some data, fit some model, got some parameters ... you do it again and get new data and now your parameters are totally different.
One consequence of this is that in the presence of overfitting, usually the training error (the error from re-running the model directly on the data used to train it) will be very low, or at least low in contrast to the test error (running the model on some previously unused test data).
One diagnostic that is suggested by Andrew Ng is to separate some of your data into a testing set. Ideally this should have been done from the very beginning, so that happening to see the model fit results inclusive of this data would never have the chance to impact your decision. But you can also do it after the fact as long as you explain so in your model discussion.
With the test data, you want to compute the same error or loss score that you compute on the training data. If training error is very low, but testing error is unacceptably high, you probably have overfitting.
Further, you can vary the size of your test data and generate a diagnostic graph. Let's say that you randomly sample 5% of your data, then 10%, then 15% ... on up to 30%. This will give you six different data points showing the resulting training error and testing error.
As you increase the training set size (decrease testing set size), the shape of the two curves can give some insight.
The test error will be decreasing and the training error will be increasing. The two curves should flatten out and converge with some gap between them.
If that gap is large, you are likely dealing with overfitting, and it suggests to use a large training set and to try to collect more data if possible.
If the gap is small, or if the training error itself is already too large, it suggests model bias is the problem, and you should consider a different model class all together.
Note that in the above setting, you can also substitute a k-fold cross validation for the test set approach. Then, to generate a similar diagnostic curve, you should vary the number of folds (hence varying the size of the test sets). For a given value of k, then for each subset used for testing, the other (k-1) subsets are used for training error, and averaged over each way of assigning the folds. This gives you both a training error and testing error metric for a given choice of k. As k becomes larger, the training set sizes becomes bigger (for example, if k=10, then training errors are reported on 90% of the data) so again you can see how the scores vary as a function of training set size.
The downside is that CV scores are already expensive to compute, and repeated CV for many different values of k makes it even worse.
One other cause of overfitting can be too large of a feature space. In that case, you can try to look at importance scores of each of your features. If you prune out some of the least important features and then re-do the above overfitting diagnostic and observe improvement, it's also some evidence that the problem is overfitting and you may want to use a simpler set of features or a different model class.
On the other hand, if you still have high bias, it suggests the opposite: your model doesn't have enough feature space to adequately account for the variability of the data, so instead you may want to augment the model with even more features.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.