GridSearchCV with Scoring Function and Refit Parameter - python

My question seems to be similar to this one but there is no solid answer there.
I'm doing a multi-class multi-label classification, and for doing that I have defined my own scorers. However, in order to have the refit parameter and get the best parameters of the model at the end we need to introduce one of the scorer functions for the refit. If I do so, I get the error that missing 1 required positional argument: 'y_pred'. y_pred should be the outcome of fit. But not sure where this issue is coming from and how I can solve it.
Below is the code:
scoring = {'roc_auc_score':make_scorer(roc_auc_score),
'precision_score':make_scorer(precision_score, average='samples'),
'recall_score':make_scorer(recall_score, average='samples')}
params = {'estimator__n_estimators': [500,800],
'estimator__max_depth': [10,50],}
model = xgb.XGBClassifier(n_jobs=4)
model = MultiOutputClassifier(model)
cls = GridSearchCV(model, params, cv=3, refit=make_scorer(roc_auc_score), scoring = scoring, verbose=3, n_jobs= -1)
model = cls.fit(x_train_ups, y_train_ups)
print(model.best_params_)

You should use refit="roc_auc_score", the name of the scorer in your dictionary. From the docs:
For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Using a callable for refit has a different purpose: the callable should take the cv_results_ dict and return the best_index_. That explains the error message: sklearn is trying to pass cv_results_ to your auc scorer function, but that function should take parameters y_true and y_pred.

Related

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters

I am tuning a gradient boosted classifier using a pipeline and grid search
My pipeline is
pipe = make_pipeline(StandardScaler(with_std=True, with_mean=True), \
RFE(RandomForestClassifier(), n_features_to_select= 15), \
GradientBoostingClassifier(random_state=42, verbose=True))
The parameter gri is:
tuned_parameters = [{'gradientboostingclassifier__max_depth': range(3, 5),\
'gradientboostingclassifier__min_samples_split': range(4,6),\
'gradientboostingclassifier__learning_rate':np.linspace(0.1, 1, 10)}]
The grid search is done as
grid = GridSearchCV(pipe, tuned_parameters, cv=5, scoring='accuracy', refit=True)
grid.fit(X_train, y_train)
After fitting the model in train data, when I check the grid.best_estimator I can only find the 2 parameters(learning_rate and min_samples_split )that I am fitting. I don't find the max_depth parameter in the best estimator.
grid.best_estimator_.named_steps['gradientboostingclassifier'] =
GradientBoostingClassifier(learning_rate=0.9, min_samples_split=5,
random_state=42, verbose=True)
But, if I use the grid.cv_results to find the best 'mean_test_score' and find the corresponding parameters for that test score, then I can find the max_depth in it.
inde = np.where(grid.cv_results_['mean_test_score'] == max(grid.cv_results_['mean_test_score']))
grid.cv_results_['params'][inde[-1][0]]
{'gradientboostingclas...rning_rate': 0.9, 'gradientboostingclas..._max_depth': 3, 'gradientboostingclas...ples_split': 5}
special variables
function variables
'gradientboostingclassifier__learning_rate':0.9
'gradientboostingclassifier__max_depth':3
'gradientboostingclassifier__min_samples_split':5
My doubt now is, if I use the trained pipeline (name of the object is 'grid' in my case) will it still use the 'max_depth' parameter also or will it not?
Is it then better to use the 'best parameters' which gave me the best 'mean_test_score' taken from the grid.cv_results
Your pipeline has been tuned on all three parameters that you specified. It is just that the best value for max_depth happens to be the default value. When printing the classifier, default values will not be included. Compare the following outputs:
print(GradientBoostingClassifier(max_depth=3)) # default
# output: GradientBoostingClassifier()
print(GradientBoostingClassifier(max_depth=5)) # not default
# output: GradientBoostingClassifier(max_depth=5)
In general, it is best-practice to access the best parameters by the best_params_ attribute of the fitted GridSearchCV object since this will always include all parameters:
grid.best_params_

sklearn: give param to F1 score in gridsearchCV/Pipeline

I have setup a sklearn.GridsearchCV with a Pipeline as the estimator. My problem is a multiclass classification. I clearly receive this error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Which is because I use F1 score without setting the average argument. My question is where exactly should I pass this argument to the object?
my code:
estimator = GridSearchCV(
estimator=Pipeline(setting['layers']),
param_grid=setting['hyper_parameters'],
cv=cv,
scoring=self.scoring,
refit=self.refit_metric,
n_jobs=n_jobs,
return_train_score=True,
verbose=True
)
and then:
estimator.fit(
self.x_train,
self.y_train
)
The error is raised on the .fit() line, but I guess I should pass the parameter when instantiating the GridsearchCV.
For your scoring parameter of the GridSearchCV, you can just pass e.g. f1_weighted as a string. That should do the trick. You can have a look at the sklearn docs for possible values.

Error when using scikit-learn to use pipelines

I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.
You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation

GridsearchCV to find the optimum parameter for BIRCH

I am using gridsearchCV to find the optimum parameters for BIRCH, my code is:
RAND_STATE=50 # for reproducibility and consistency
folds=3
k_fold = KFold(n_splits=folds, shuffle=True, random_state=RAND_STATE)
hyperparams = { "branching_factor": [50,100,200,300,400,500,600,700,800,900],
"n_clusters": [5,7,9,11,13,17,21],
"threshold": [0.2,0.3,0.4,0.5,0.6,0.7]}
birch = Birch()
def sil_score(ndata):
labels = ensemble.predict(ndata)
score = silhouette_score(ndata, labels)
return score
sil_scorer = make_scorer(sil_score)
ensemble = GridSearchCV(estimator=birch,param_grid=hyperparams,scoring=sil_scorer,cv=k_fold,verbose=10,n_jobs=-1)
ensemble.fit(x)
print ensemble
best_parameters = ensemble.best_params_
print best_parameters
best_score = ensemble.best_score_
print best_score
however the output gives me an error:
I am confused why the score value is looking for 4 arguments when the I already stated the required parameters needed for scoring in the sil_score function.
Your scoring function is incorrect. The syntax should be sil_score(y_true,y_pred) where y_true are the ground truth lables and y_pred are the predicted labels. Also you need not separately predict the labels using the ensemble object inside your scoring function. Also in your case it makes more sense to directly use silhouette_score as the scoring function since you are calling your ensemble to predict labels inside the scoring function which is not required at all. Just pass the silhouette_score as the scoring function and GridSearchCV will take care of predicting the scoring on it's own.
Here is an example if you want to see how it works.

Categories

Resources