hyperparameter tuning using GridSearchCV - python

I have a K nearest neighbour classifier which you can see below. From what I understand, the GridSearchCV is testing the model with different values of k between 1-20. When I do y_pred=knn_grid_cv.predict(x_test) I get a bunch of y predictions, but what value k (between 1-20) was used to obtain these y predictions? Would it be the highest scoring k value from the GridSearchCV?
x=football_df["Pace"].values.reshape(-1, 1)
print(x)
y=football_df["Position"].values.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4,random_state=42)
param_grid={"n_neighbors":np.arange(1,20)}
knn = KNeighborsClassifier()
knn_grid_cv = GridSearchCV(knn, param_grid, cv=5)
knn_grid_cv.fit(x_train,y_train)
y_pred=knn_grid_cv.predict(x_test)
print(y_pred)

You are correct. The way you defined param_grid will test the performance of 20 different models, each with a different value for n_neighbors. The best model is chosen as the one with the highest average cross-validated score. In the case of a KNeighborsClassifier, the default score metric used is the mean accuracy.
In your case, that'd be the model with the highest mean accuracy across all five splits.
To see what value of n_neighbors was chosen, simply do:
# Option 1: print the parameters of the best classifier
print(knn_grid_cv.best_estimator_.get_params())
# Option 2: print results of all model combinations
import pandas as pd
res = pd.DataFrame(knn_grid_cv.cv_results_)
print(res)

Related

How come you can get a permutation feature importance greater than 1?

Take this simple code:
import lightgbm as lgb
from sklearn.inspection import permutation_importance
X_train, X_test, y_train, y_test = train_test_split(X, y)
lgbr = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.15)
model_lgb = lgbr.fit(X_train, y_train)
r = permutation_importance(model_lgb, X_test, y_test, n_repeats=30, random_state=0)
for i in r.importances_mean.argsort()[::-1]:
print(f"{i} " f"{r.importances_mean[i]:.3f}" f" +/- {r.importances_std[i]:.3f}")
When I run this on my dataset the top value is about 1.20.
But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column so this can't be more than 1 can it?
What am I missing?
(I get the same issue if I replace lightgbm by xgboost so I don't think it is a feature of the particular regression method.)
But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column[...]
Correct.
so this can't be more than 1 can it?
That depends on whether the score can "worsen" by more than 1. The default for the scoring parameter of permutation_importance is None, which uses the model's score function. For LGBMRegressor (and most regressors), that's the R2 score, which has a maximum of 1 but can take arbitrarily large negative values, so indeed the score can worsen by an arbitrarily large amount.

What is the difference between grid.score(X_valid, y_valid) and grid.best_score_

While doing GridSearchCV, what is the difference between the scores obtained through grid.score(...) and grid.best_score_
Kindly assume that a model, features, target, and param_grid are in place. Here is a part of the code I am very curious to know about.
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
There are two different outputs for each of best_score_1 and best_score_2
I am trying to know the difference between the two as well as which of the following should be considered to be the best scores that came out from the given param_grid.
Following is the full function.
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
UPDATED
As explained in the sklearn documentation, GridSearchCV takes all the parameter lists of parameters you pass and tries all possible combinations to find the best parameters.
To evaluate which are the best parameters, it calculates a k-fold cross-validation for each parameters combination. With k-fold cross-validation, the training set is divided into Training set and Validation set (which is a test set). If you choose, for example, cv=5 the dataset is divided into 5 non-overlapping folds, and each fold is used as a validation set, while all the other are used as training set. Hence, GridSearchCV, in the example, calculates the average validation score (which can be accuracy or something else) for each of the 5 folds, and does so for each parameters combination. Then, at the end of GridsearchCV there will be an average validation score for each parameter combination, and the one with the highest average validation score is returned. So, the average validation score, associated to the best parameters, is stored in the grid.best_score_ variable.
On the other hand, the grid.score(X_valid, y_valid) method gives the score on the given data, if the estimator has been refitted (refit=True).This means that it is not the average accuracy of the 5 folds, but is taken the model with the best parameters and is trained using the training set. Then, are computed the predictions on the X_valid and compared compared with the y_valid in order to get the score.

difference between cross_val_score and KFold

I am learning Machine learning and I am having this doubt. Can anyone tell me what is the difference between:-
from sklearn.model_selection import cross_val_score
and
from sklearn.model_selection import KFold
I think both are used for k fold cross validation, but I am not sure why to use two different code for same function.
If there is something I am missing please do let me know. ( If possible please explain difference between these two methods)
Thanks,
cross_val_score is a function which evaluates a data and returns the score.
On the other hand, KFold is a class, which lets you to split your data to K folds.
So, these are completely different. Yo can make K fold of data and use it on cross validation like this:
# create a splitter object
kfold = KFold(n_splits = 10)
# define your model (any model)
model = XGBRegressor(**params)
# pass your model and KFold object to cross_val_score
# to fit and get the mse of each fold of data
cv_score = cross_val_score(model,
X, y,
cv=kfold,
scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())
cross_val_score evaluates the score using cross validation by randomly splitting the training sets into distinct subsets called folds, then it trains and evaluated the model on the folds, picking a different fold for evaluation every time and training on the other folds.
cv_score = cross_val_score(model, data, target, scoring, cv)
KFold procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
cv_score = cross_val_score(model, data, target, scoring, cv=cv)
where model is your model on which you want to evaluate,
data is training data,
target is target variable,
scoring parameter controls what metric applied to the estimator applied and cv is the number of splits.

How to choose n_estimators in RandomForestClassifier?

I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.
see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example
I used RandomSearchCV to find the best params for the Random Forest Classifier
n_estimators is the number of decision trees to use.
try using XBBoost to get more accuracy.
parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf':
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}
number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)
random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)
print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
don't use gridsearch for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.
there is a stage_predict attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)
# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.
First, you already answer it from your code, in this lines.
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
As you can see, you already saved the accuracy_score into scores. So you just need to recall it by find the maximum value from the socres's list.
maxs = max(scores)
maxs_idx = scores.index(maxs)
Then just put the print command in the final lines.
print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")
I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.

Scikit learn GridSearchCV AUC performance

I'm using GridSearchCV to identify the best set of parameters for a random forest classifier.
PARAMS = {
'max_depth': [8,None],
'n_estimators': [500,1000]
}
rf = RandomForestClassifier()
clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4)
clf.fit(data, labels)
where data and labels are respectively the full dataset and the corresponding labels.
Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_) with a "manual" AUC estimation:
aucs = []
for fold in range (0,n_folds):
probabilities = []
train_data,train_label = read_data(train_file_fold)
test_data,test_labels = read_data(test_file_fold)
clf = RandomForestClassifier(n_estimators = 1000,max_depth=8)
clf = clf.fit(train_data,train_labels)
predicted_probs = clf.predict_proba(test_data)
for value in predicted_probs:
for k, pr in enumerate(value):
if k == 1:
probabilities.append(pr)
fpr, tpr, thresholds = metrics.roc_curve(test_labels, probabilities, pos_label=1)
fold_auc = metrics.auc(fpr, tpr)
aucs.append(fold_auc)
performance = np.mean(aucs)
where I manually pre-split the data into training and test set (same 5 CV approach).
The AUC values returned by GridSearchCV are always higher than the one manually calculated (e.g. 0.62 vs. 0.70) when using the same parameter for RandomForest.
I know that different training and test split might give you different performance but this occurred constantly when testing 100 repetitions of the GridSearchCV. Interesting, if I use the accuarcy instead of roc_auc as scoring metric, the difference in performance is minimal and can be associated to the fact that I use different training and test set. Is this happening because the AUC value of GridSearchCV is estimated in a different way than by using metrics.roc_curve?

Categories

Resources