GridsearchCV to find the optimum parameter for BIRCH - python

I am using gridsearchCV to find the optimum parameters for BIRCH, my code is:
RAND_STATE=50 # for reproducibility and consistency
folds=3
k_fold = KFold(n_splits=folds, shuffle=True, random_state=RAND_STATE)
hyperparams = { "branching_factor": [50,100,200,300,400,500,600,700,800,900],
"n_clusters": [5,7,9,11,13,17,21],
"threshold": [0.2,0.3,0.4,0.5,0.6,0.7]}
birch = Birch()
def sil_score(ndata):
labels = ensemble.predict(ndata)
score = silhouette_score(ndata, labels)
return score
sil_scorer = make_scorer(sil_score)
ensemble = GridSearchCV(estimator=birch,param_grid=hyperparams,scoring=sil_scorer,cv=k_fold,verbose=10,n_jobs=-1)
ensemble.fit(x)
print ensemble
best_parameters = ensemble.best_params_
print best_parameters
best_score = ensemble.best_score_
print best_score
however the output gives me an error:
I am confused why the score value is looking for 4 arguments when the I already stated the required parameters needed for scoring in the sil_score function.

Your scoring function is incorrect. The syntax should be sil_score(y_true,y_pred) where y_true are the ground truth lables and y_pred are the predicted labels. Also you need not separately predict the labels using the ensemble object inside your scoring function. Also in your case it makes more sense to directly use silhouette_score as the scoring function since you are calling your ensemble to predict labels inside the scoring function which is not required at all. Just pass the silhouette_score as the scoring function and GridSearchCV will take care of predicting the scoring on it's own.
Here is an example if you want to see how it works.

Related

Is there a parameter for GridSearchCV to select the best with the lowest difference between train and test set?

My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0)
The problem is, the GridSearchCV only consider the test set metrics. It disregard the train set metrics. Therefore, the result is still an overfitted model.
What I've did:
I tried to access the cv_results_ attribute, but there is tons of output, I am not sure how to read it, and I believe we are not supposed to do that manually.
The code
# model definition
rf_cv = GridSearchCV(estimator=rf_clf_default,
# what the user care is the model ability to find class 1.
scoring=make_scorer(score_func=f1_score, pos_label=1),
param_grid={'randomforestclassifier__n_estimators': [37,38,39,100,200],
'randomforestclassifier__max_depth': [4,5,6,10,20,30],
'randomforestclassifier__min_samples_leaf': [2,3,4]},
return_train_score=True,
refit=True)
# ignore OneHotEncoder warning about unkonwn categories
with warnings.catch_warnings():
warnings.simplefilter(action="ignore", category=UserWarning)
# Train the algorithm
rf_cv.fit(X=X_train, y=y_train)
# get the recall score for label 1
print("best recall score class 1", rf_cv.best_score_)
# get the best params
display("best parameters", rf_cv.best_params_)
You can provide a callable for the refit parameter:
Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.
For example, if you want to only consider hyperparameters whose train and test recall scores are within 0.05:
import pandas as pd
def my_refit_criteria(cv_results_):
cv_frame = pd.DataFrame(cv_results_)
candidate_mask = cv_frame['mean_train_recall'] - cv_frame['mean_test_recall'] < 0.05
if candidate_mask.sum() > 0:
candidates = cv_frame[candidate_mask]
else:
# if none, just pick the best
candidates = cv_frame
return candidates['mean_test_recall'].idxmax()
search = GridSearchCV(..., refit=my_refit_criteria)
(I haven't tested this; if you see errors let me know.)
There's a more complex example in the docs:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters

I am tuning a gradient boosted classifier using a pipeline and grid search
My pipeline is
pipe = make_pipeline(StandardScaler(with_std=True, with_mean=True), \
RFE(RandomForestClassifier(), n_features_to_select= 15), \
GradientBoostingClassifier(random_state=42, verbose=True))
The parameter gri is:
tuned_parameters = [{'gradientboostingclassifier__max_depth': range(3, 5),\
'gradientboostingclassifier__min_samples_split': range(4,6),\
'gradientboostingclassifier__learning_rate':np.linspace(0.1, 1, 10)}]
The grid search is done as
grid = GridSearchCV(pipe, tuned_parameters, cv=5, scoring='accuracy', refit=True)
grid.fit(X_train, y_train)
After fitting the model in train data, when I check the grid.best_estimator I can only find the 2 parameters(learning_rate and min_samples_split )that I am fitting. I don't find the max_depth parameter in the best estimator.
grid.best_estimator_.named_steps['gradientboostingclassifier'] =
GradientBoostingClassifier(learning_rate=0.9, min_samples_split=5,
random_state=42, verbose=True)
But, if I use the grid.cv_results to find the best 'mean_test_score' and find the corresponding parameters for that test score, then I can find the max_depth in it.
inde = np.where(grid.cv_results_['mean_test_score'] == max(grid.cv_results_['mean_test_score']))
grid.cv_results_['params'][inde[-1][0]]
{'gradientboostingclas...rning_rate': 0.9, 'gradientboostingclas..._max_depth': 3, 'gradientboostingclas...ples_split': 5}
special variables
function variables
'gradientboostingclassifier__learning_rate':0.9
'gradientboostingclassifier__max_depth':3
'gradientboostingclassifier__min_samples_split':5
My doubt now is, if I use the trained pipeline (name of the object is 'grid' in my case) will it still use the 'max_depth' parameter also or will it not?
Is it then better to use the 'best parameters' which gave me the best 'mean_test_score' taken from the grid.cv_results
Your pipeline has been tuned on all three parameters that you specified. It is just that the best value for max_depth happens to be the default value. When printing the classifier, default values will not be included. Compare the following outputs:
print(GradientBoostingClassifier(max_depth=3)) # default
# output: GradientBoostingClassifier()
print(GradientBoostingClassifier(max_depth=5)) # not default
# output: GradientBoostingClassifier(max_depth=5)
In general, it is best-practice to access the best parameters by the best_params_ attribute of the fitted GridSearchCV object since this will always include all parameters:
grid.best_params_

GridSearchCV with Scoring Function and Refit Parameter

My question seems to be similar to this one but there is no solid answer there.
I'm doing a multi-class multi-label classification, and for doing that I have defined my own scorers. However, in order to have the refit parameter and get the best parameters of the model at the end we need to introduce one of the scorer functions for the refit. If I do so, I get the error that missing 1 required positional argument: 'y_pred'. y_pred should be the outcome of fit. But not sure where this issue is coming from and how I can solve it.
Below is the code:
scoring = {'roc_auc_score':make_scorer(roc_auc_score),
'precision_score':make_scorer(precision_score, average='samples'),
'recall_score':make_scorer(recall_score, average='samples')}
params = {'estimator__n_estimators': [500,800],
'estimator__max_depth': [10,50],}
model = xgb.XGBClassifier(n_jobs=4)
model = MultiOutputClassifier(model)
cls = GridSearchCV(model, params, cv=3, refit=make_scorer(roc_auc_score), scoring = scoring, verbose=3, n_jobs= -1)
model = cls.fit(x_train_ups, y_train_ups)
print(model.best_params_)
You should use refit="roc_auc_score", the name of the scorer in your dictionary. From the docs:
For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Using a callable for refit has a different purpose: the callable should take the cv_results_ dict and return the best_index_. That explains the error message: sklearn is trying to pass cv_results_ to your auc scorer function, but that function should take parameters y_true and y_pred.

Scoring in GridSearchCV for XGBoost

I'm currently trying to analyze data for the first time using XGBoost. I want to find the best parameters using GridsearchCV. I want to minimize the root mean squared error and to do this, I used "rmse" as eval_metric. However, scoring in grid search does not have such a metric. I found on this site that the "neg_mean_squared_error" does the same, but I found that this gives me different results than the RMSE. When I calculate the root of the absolute value of the "neg_mean_squared_error", I get a value of around 8.9 while a different function gives me a RMSE of about 4.4.
I don't know what goes wrong or how I get these two functions to agree/give the same values?
Because of this problem, I get wrong values as "best_params_" which give me a higher RMSE than some values I initially started with to tune.
Can anyone please explain me how to get score on the RMSE in the grid search or why my code gives different values?
Thanks in advance.
def modelfit(alg, trainx, trainy, useTrainCV=True, cv_folds=10, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(trainx, label=trainy)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])
# Fit the algorithm on the data
alg.fit(trainx, trainy, eval_metric='rmse')
# Predict training set:
dtrain_predictions = alg.predict(trainx)
# dtrain_predprob = alg.predict_proba(trainy)[:, 1]
print(dtrain_predictions)
print(np.sqrt(mean_squared_error(trainy, dtrain_predictions)))
# Print model report:
print("\nModel Report")
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(trainy, dtrain_predictions)))
param_test2 = {
'max_depth':[6,7,8],
'min_child_weight':[2,3,4]
}
grid2 = GridSearchCV(estimator = xgb.XGBRegressor( learning_rate =0.1, n_estimators=2000, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'reg:linear', nthread=4, scale_pos_weight=1, random_state=4),
param_grid = param_test2, scoring='neg_mean_squared_error', n_jobs=4,iid=False, cv=10, verbose=20)
grid2.fit(X_train,y_train)
# Mean cross-validated score of the best_estimator
print(grid2.best_params_, np.sqrt(np.abs(grid2.best_score_))), print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
modelfit(grid2.best_estimator_, X_train, y_train)
print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
In GridSearchCV the scoring parameter is transformed so that higher values are always better than lower values. In your example neg_mean_squared_error is just a negated version of RMSE. You should not interpret neg_mean_squared_error to be RMSE, rather in your cross-validation you should compare values of neg_mean_squared_error where a higher value is better than lower values.
In the scoring parameter portion of the model_evaluation documentation this behavior is mentioned.
Scikit-Learn Scoring Parameter Documentation
It's because XGBoostRegressor.score returns the coefficient of determination of the prediction, not RMSE.

Categories

Resources