I would like to find the best parameters for a RandomForest classifier (with scikit-learn) in a way that it generalises well to other datasets (which may not be iid).
I was thinking doing grid search using the whole training dataset while evaluating the scoring function on other datasets.
Is there an easy to do this in python/scikit-learn?
I don't think you can evaluate on a different data set. The whole idea behind GridSearchCV is that it splits your training set into n folds, trains on n-1 of those folds and evaluates on the remaining one, repeating the procedure until every fold has been "the odd one out". This keeps you from having to set apart a specific validation set and you can simply use a training and a testing set.
If you can, you may simply merge the two datasets and perform GridSearchCV, this ensures the generalization ability to the other dataset. If you are talking about generalization to future unknown dataset, then this might not work, because there isn't a perfect dataset from which we can train a perfect model.
Related
I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?
This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).
I'm really confused as to when I should apply MinMaxScaler or scaling in general to my dataset, whether is before splitting into train_test_split or after splitting into training and validation split, then fit and transform on X_train and transform on X_test. Really confused about what the general rule of thumb is in applying this...
There is no rule of thumb, but generally, you definitely want to scale before splitting your dataset.
I will tell you why. Logically, it would be confusing (for the model) while predicting if you have a dataset and split it into three subsets without assuring that the subset are more or less in the same distribution. Now if you are sure that the three subsets have the same distribution and not imbalanced (which is hard to prove) than I see no problem splitting before or after. I would never recommend it though.
Hence, the logical choice is to scale before and then do the splitting. An alternative approach for this is to use the StandardScaler instead of the MinMaxScaler. I find it less sensitive and from my experience, it is in most cases a better choice than the minmax.
You want to scale before training on the training set. After that, you scale the test and validation using the same mean and variance (i.e. train mean and variance) before inference.
I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??
Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class
Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.
There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).
I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.
All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...
from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!
First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.
Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.