sklearn: give param to F1 score in gridsearchCV/Pipeline

sklearn: give param to F1 score in gridsearchCV/Pipeline - python

I have setup a sklearn.GridsearchCV with a Pipeline as the estimator. My problem is a multiclass classification. I clearly receive this error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Which is because I use F1 score without setting the average argument. My question is where exactly should I pass this argument to the object?
my code:
estimator = GridSearchCV(
estimator=Pipeline(setting['layers']),
param_grid=setting['hyper_parameters'],
cv=cv,
scoring=self.scoring,
refit=self.refit_metric,
n_jobs=n_jobs,
return_train_score=True,
verbose=True
)
and then:
estimator.fit(
self.x_train,
self.y_train
)
The error is raised on the .fit() line, but I guess I should pass the parameter when instantiating the GridsearchCV.

For your scoring parameter of the GridSearchCV, you can just pass e.g. f1_weighted as a string. That should do the trick. You can have a look at the sklearn docs for possible values.

Related

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?

You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

GridSearchCV with Scoring Function and Refit Parameter

My question seems to be similar to this one but there is no solid answer there.
I'm doing a multi-class multi-label classification, and for doing that I have defined my own scorers. However, in order to have the refit parameter and get the best parameters of the model at the end we need to introduce one of the scorer functions for the refit. If I do so, I get the error that missing 1 required positional argument: 'y_pred'. y_pred should be the outcome of fit. But not sure where this issue is coming from and how I can solve it.
Below is the code:
scoring = {'roc_auc_score':make_scorer(roc_auc_score),
'precision_score':make_scorer(precision_score, average='samples'),
'recall_score':make_scorer(recall_score, average='samples')}
params = {'estimator__n_estimators': [500,800],
'estimator__max_depth': [10,50],}
model = xgb.XGBClassifier(n_jobs=4)
model = MultiOutputClassifier(model)
cls = GridSearchCV(model, params, cv=3, refit=make_scorer(roc_auc_score), scoring = scoring, verbose=3, n_jobs= -1)
model = cls.fit(x_train_ups, y_train_ups)
print(model.best_params_)

You should use refit="roc_auc_score", the name of the scorer in your dictionary. From the docs:
For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Using a callable for refit has a different purpose: the callable should take the cv_results_ dict and return the best_index_. That explains the error message: sklearn is trying to pass cv_results_ to your auc scorer function, but that function should take parameters y_true and y_pred.

Error when using scikit-learn to use pipelines

I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.

You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation

GridsearchCV to find the optimum parameter for BIRCH

I am using gridsearchCV to find the optimum parameters for BIRCH, my code is:
RAND_STATE=50 # for reproducibility and consistency
folds=3
k_fold = KFold(n_splits=folds, shuffle=True, random_state=RAND_STATE)
hyperparams = { "branching_factor": [50,100,200,300,400,500,600,700,800,900],
"n_clusters": [5,7,9,11,13,17,21],
"threshold": [0.2,0.3,0.4,0.5,0.6,0.7]}
birch = Birch()
def sil_score(ndata):
labels = ensemble.predict(ndata)
score = silhouette_score(ndata, labels)
return score
sil_scorer = make_scorer(sil_score)
ensemble = GridSearchCV(estimator=birch,param_grid=hyperparams,scoring=sil_scorer,cv=k_fold,verbose=10,n_jobs=-1)
ensemble.fit(x)
print ensemble
best_parameters = ensemble.best_params_
print best_parameters
best_score = ensemble.best_score_
print best_score
however the output gives me an error:
I am confused why the score value is looking for 4 arguments when the I already stated the required parameters needed for scoring in the sil_score function.

Your scoring function is incorrect. The syntax should be sil_score(y_true,y_pred) where y_true are the ground truth lables and y_pred are the predicted labels. Also you need not separately predict the labels using the ensemble object inside your scoring function. Also in your case it makes more sense to directly use silhouette_score as the scoring function since you are calling your ensemble to predict labels inside the scoring function which is not required at all. Just pass the silhouette_score as the scoring function and GridSearchCV will take care of predicting the scoring on it's own.
Here is an example if you want to see how it works.

Get corresponding classes to predict_proba (GridSearchCV sklearn)

I'm using GridSearchCV and a pipeline to classify some text documents. A code snippet:
clf = Pipeline([('vect', TfidfVectorizer()), ('clf', SVC())])
parameters = {'vect__ngram_range' : [(1,2)], 'vect__min_df' : [2], 'vect__stop_words' : ['english'],
'vect__lowercase' : [True], 'vect__norm' : ['l2'], 'vect__analyzer' : ['word'], 'vect__binary' : [True],
'clf__kernel' : ['rbf'], 'clf__C' : [100], 'clf__gamma' : [0.01], 'clf__probability' : [True]}
grid_search = GridSearchCV(clf, parameters, n_jobs = -2, refit = True, cv = 10)
grid_search.fit(corpus, labels)
My problem is that when using grid_serach.predict_proba(new_doc) and then wanting to find out what classes the probabilities corresponds to with grid_search.classes_, I get the following error:
AttributeError: 'GridSearchCV' object has no attribute 'classes_'
What have I missed? I thought that if the last "step" in the pipeline was a classifier, then the return of GridSearchCV is also a classifier. Hence one can use the attributes of that classifier, e.g. classes_.

As mentioned in the comments above, the grid_search.best_estimator_.classes_ returned an error message since it returns a pipeline with no attribute .classes_. However, by first calling the step classifier of the pipeline I was able to use the classes attribute. Here is the solution
grid_search.best_estimator_.named_steps['clf'].classes_

Try grid_search.best_estimator_.classes_.
The return of GridSearchCV is a GridSearchCV instance which is not really an estimator itself. Rather, it instantiates a new estimator for each parameter combination it tries (see the docs).
You may think the return value is a classifier because you can use methods such as predict or predict_proba when refit=True, but the GridSearchCV.predict_proba actually looks like (spoiler from the source):
def predict_proba(self, X):
"""Call predict_proba on the estimator with the best found parameters.
Only available if ``refit=True`` and the underlying estimator supports
``predict_proba``.
Parameters
-----------
X : indexable, length n_samples
Must fulfill the input assumptions of the
underlying estimator.
"""
return self.best_estimator_.predict_proba(X)
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sklearn: give param to F1 score in gridsearchCV/Pipeline - python

For your scoring parameter of the GridSearchCV, you can just pass e.g. f1_weighted as a string. That should do the trick. You can have a look at the sklearn docs for possible values.

Related

Get all prediction values for each CV in GridSearchCV

GridSearchCV with Scoring Function and Refit Parameter

Error when using scikit-learn to use pipelines

GridsearchCV to find the optimum parameter for BIRCH

Get corresponding classes to predict_proba (GridSearchCV sklearn)

Categories

Resources