I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1)
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration n_estimators = 100 and max_depth = 4 is selected for model building.Now will the score for this model be choosen with the help of 10 fold cross-validation ?
a. My understanding about the process is as follows
1.X_train and y_train will be splitted in to 10 sets.
Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say score_list
This process will be repeated 9 more times and each of this 9 scores will be added to the score_list to give 10 score in all
Finally the average of the score_list will be taken to give a final_score for the model with parameters :n_estimators = 100 and max_depth = 4
b. The above process will repeated with all other possible combinations of n_estimators and max_depth and each time we will get a final_score for that model
c. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by CV_rfc.best_params_
Is my understanding about GridSearchCV correct ?
Now say I get best model parameters as {'max_depth': 10, 'n_estimators': 100}. I declare an intance of the model as below
RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
rm_score = -scores
rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct
Regarding (1), your understanding is indeed correct; a wording detail to be corrected in principle is "better final_score" instead of "higher", as there are several performance metrics (everything measuring the error, such as MSE, MAE etc) that are the-lower-the-better.
Now, step (2) is more tricky; it requires taking a step back to check the whole procedure...
To start with, in general CV is used either for parameter tuning (your step 1) or for model assessment (i.e. what you are trying to do in step 2), which are different things indeed. Splitting from the very beginning your data into training & test sets as you have done here, and then sequentially performing the steps 1 (for parameter tuning) and 2b (model assessment in unseen data) is arguably the most "correct" procedure in principle (as for the bias you note in the comment, this is something we have to live with, since by default all our fitted models are "biased" toward the data used for their training, and this cannot be avoided).
Nevertheless, since early on, practitioners have been wondering if they can avoid "sacrificing" a part of their precious data only for testing (model assessment) purposes, and trying to see if they can actually skip the model assessment part (and the test set itself), using as model assessment the best results obtained from the parameter tuning procedure (your step 1). This is clearly cutting corners, but, as usually, the question is how off the actual results will be? and will it still be meaninful?
Again, in theory, what Vivek Kumar writes in his linked answer is correct:
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
But here is a relevant excerpt of the (highly recommended) Applied Predictive Modeling book (p. 78):
In short: if you use the whole X in step 1 and consider the results of the tuning as model assessment, there will indeed be a bias/leakage, but it is usually small, at least for moderately large training sets...
Wrapping-up:
The "most correct" procedure in theory is indeed the combination of your steps 1 and 2b
You can try to cut corners, using the whole training set X in step 1, and most probably you will still be within acceptable limits regarding your model assessment.
Related
I am working with vehicle occupancy prediction and I am very much new to this, I have used random forest regression to predict the occupancy values.
Jupyter notebook_Random forest
I have around 48 M rows and I have used all the data to predict the occupancy, As the population and occupancy were normalized due to the higher numbers and I have predicted. I am sure the model is not good, how can I interpret the results from the RMSE and MAE. Also, the plot shows that it is not predicted well, Am I doing it in a correct way to predict the occupancy of the vehicles.
Kindly help me with the following,
Is Random forest regression is a good method to approach this problem?
How can I improve the model results?
How to interpret the results from the outcome
Is Random forest regression is a good method to approach this problem?
-> The model is just a tool and can of course be used. However, no one can answer whether it is suitable or not, because we have not studied the distribution of data. It is suggested that you can try logistic regression, support vector machine regression, etc.
How can I improve the model results?
-> I have several suggestions on how to improve: 1.Do not standardize without confirming whether the y value column has extreme values. 2.When calculating RMSE and Mae, use the original y value. 3.Deeply understand business logic and add new features. 4.Learn about data processing and Feature Engineering on the blog.
How to interpret the results from the outcome
-> Bad results do not necessarily mean no value. You need to compare whether the model is better than the existing methods and whether it has produced more economic value. For example, error is loss, and accuracy is gain.
Hope these can help you.
You were recommended XGBoost based regressor so you could try as well LightGBM based one: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html
You are getting RMSE of 0.002175863553610834 which is really close to zero. So, we can say that you have a good model. I don't think the model needs further improvement. If you still want to improve it, I think you should change the algorithm to XGBoost and use regularization and early stopping to avoid overfitting.
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators = 3000, learning_rate = 0.01, reg_alpha = 2, reg_lambda = 1, n_jobs = -1, random_state = 34, verbosity = 0)
evalset = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric = 'rmse', eval_set = evalset, early_stopping_rounds = 5)
Suppose I iterate with the following code until I acquire an accuracy that I'm satisfied with:
from sklearn.model_selection import train_test_split
x, y = # ... read in some data set ...
c = 3000 # iterate over some arbitrary range
for i in range(c):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)
model = # ... initialize some classifier of choice ...
model.fit(x_train, y_train)
p = model.predict(x_test)
p = np.round(p).reshape(-1)
test_accuracy = np.mean(p == y_test) * 100
For a particular data set and range, say I build a classifier such that the training accuracy is 97% and the test accuracy is 96%. Can I truly claim the model is 96% accurate? For the same range and data set, I can also build a classifier such that the training accuracy and test accuracy are as low as 99% and 70%, respectively.
Since I have selected random_state based on the test set accuracy, is the test set really a validation set here? I don't know why, but I think to claim the first model is 96% accurate would not be true. What should I do instead in order to make a correct claim about the model's accuracy?
Is it bad practice to iterate over many random training & test set splits until a high accuracy is achieved?
Yes, this is bad practice. You should be evaluating on data that your model has never been trained on, and this wouldn't really be the case if you train many times to find the best train/test split.
You can put aside a test set before you train the model. Then you can create as many train/validation splits as you want and train the model multiple times. You would evaluate on the test set, on which the model was never trained.
You can also look into nested cross-validation.
Kinda. There is Cross-Validation which is similar to what you have described. This is where the train/test split is randomised and the model trained each time. Except the final value quoted is the average test accuracy - not simply the best. This sort of thing is done in tricky situations e.g. with very small datasets.
In terms of the bigger picture, the test data should be representative of the training data and vice versa. Sure you can cheese it that way, but if the atypical 'weird' cases are hidden in your training set and the test set is just full of easy cases (e.g. only the digit 0 for MNIST) then you aren't really achieving anything. You're only cheating yourself.
I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).
I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:
Following this answer, I did the following:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.
I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.
Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.
Thanks!
You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.
Yes, there was a data leaking problem for the validation data. You need to set refit = False for GridSearchCV and it will not refit the whole data including training and validation data.
As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)
Im using Xgboost implementation on sklearn for a kaggle's competition.
However, im getting this 'warning' message :
$ python Script1.py
/home/sky/private/virtualenv15.0.1dev/myVE/local/lib/python2.7/site-packages/sklearn/cross_validation.py:516:
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
% (min_labels, self.n_folds)), Warning)
According to another question on stackoverflow :
Check that you have at least 3 samples per class to be able to do StratifiedKFold cross validation with k == 3 (I think this is the default CV used by GridSearchCV for classification)."
And well, i dont have at least 3 samples per class.
So my questions are:
what are the alternatives?
Why can't i use cross validation?
What can i use instead?
...
param_test1 = {
'max_depth': range(3, 10, 2),
'min_child_weight': range(1, 6, 2)
}
grid_search = GridSearchCV(
estimator=
XGBClassifier(
learning_rate=0.1,
n_estimators=3000,
max_depth=15,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=42,
scale_pos_weight=1,
seed=27),
param_grid=param_test1, scoring='roc_auc', n_jobs=42, iid=False, cv=None, verbose=1)
...
grid_search.fit(train_x, place_id)
References:
One-shot learning with scikit-learn
Using a support vector classifier with polynomial kernel in scikit-learn
If you have a target/class with only one sample, thats too few for any model. What you can do is get another dataset, preferably as balanced as possible, since most models behave better in balanced sets.
If you cannot have another dataset, you will have to play with what you have. I would suggest you remove the sample that has the lonely target. So you will have a model which does not cover that target. If that does not fit you requirements, you need a new dataset.