How to modify GridSearchCV fit() method - python

i would like to ask how could i modify the fit() method of GridSearchCV in order to be able to pass some parameters to the custom cross validator. Here is the code snippet that im using:
from sklearn.model_selection import GridSearchCV
skf = CombPurgedKFoldCV(n_splits=10, n_test_splits= 2 ,embargo_td=pd.Timedelta(minutes=100))
clf = DecisionTreeClassifier(criterion='entropy',max_features='auto',class_weight='balanced',min_weight_fraction_leaf=0.)
classifier=BaggingClassifier(base_estimator=clf,n_estimators=1000,max_features=1.,
max_samples=avgU,oob_score=True,n_jobs=1)
gs = GridSearchCV(estimator = classifier, param_grid = grid_param, scoring = 'f1',n_jobs = 1, cv=skf)
gs.fit(X,y,pred_times,eval_times)

Related

How to set AUC as scoring method while searching for hyperparameters?

I want to perform a random search, in classification problem, where the scoring method will be chosen as AUC instead of accuracy score. Have a look at my code for reproducibility:
# Define imports and create data
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
x = np.random.normal(0, 1, 100)
y = np.random.binomial(0, 1, 100)
### Let's define parameter grid
rf = RandomForestClassifier(random_state=0)
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=4)]
min_samples_split = [2, 5, 10]
param_grid = {'n_estimators': n_estimators,
'min_samples_split': min_samples_split}
# Define model
clf = RandomizedSearchCV(rf,
param_grid,
random_state=0,
n_iter=3,
cv=5).fit(x.reshape(-1, 1), y)
And now, according to documentation of function RandomizedSearchCV I can pass another argument scoring which will choose metric to evaluate the model. I tried to pass scoring = auc, but I got an error that there is no such metric. Do you know what I have to do to have AUC instead of accuracy?
According to documentation of function RandomizedSearchCV scoring can be a string or a callable. Here you can find all possible string values for the score parameter. You can also try to set score as a callable auc.
As explained by Danylo and this answer you can specify the search optimal function to be the ROC-AUC, so as to pick the parameter value maximizing it:
clf = RandomizedSearchCV(rf,
param_grid,
random_state=0,
n_iter=3,
cv=5,
scoring='roc_auc').fit(x.reshape(-1, 1), y)

Adding feature scaling to a nested cross-validation with randomized search and recursive feature elimination

I have a classification task and want to use a repeated nested cross-validation to simultaneously perform hyperparameter tuning and feature selection. For this, I am running RandomizedSearchCV on RFECV using Python's sklearn library, as suggested in this SO answer.
However, I additionally need to scale my features and impute some missing values first. Those two steps should also be included into the CV framework to avoid information leakage between training and test folds. I tried to create a Pipeline to get there but I think it "destroys" my CV-nesting (i.e., performs the RFECV and random search separately from each other):
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
# create example data with missings
Xtrain, ytrain = make_classification(n_samples = 500,
n_features = 150,
n_informative = 25,
n_redundant = 125,
random_state = 1897)
c = 10000 # number of missings
Xtrain.ravel()[np.random.choice(Xtrain.size, c, replace = False)] = np.nan # introduce random missings
folds = 5
repeats = 5
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
n_iter = 100
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 1897)
sel = RFECV(sgdc, cv = folds)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('selector', sel),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 1897, verbose = 1, n_jobs = -1)
rskfold_search.fit(Xtrain, ytrain)
Does anyone know how to include scaling and imputation into the CV framework without losing the nesting of my RandomizedSearchCV and RFECV?
Any help is highly appreciated!
You haven't lost the nested cv.
You have a search object at the top level; when you call fit, it splits the data into multiple folds. Let's focus on one such train fold. Your pipeline gets fitted on that, so you scale and impute, then the RFECV gets it to split into inner folds. Finally a new estimator gets fitted on the outer training fold, and scored on the outer testing fold.
That means the RFE is getting perhaps a little leakage, since scaling and imputing happen before its splits. You can add them in a pipeline before the estimator, and use that pipeline as the RFE estimator. And since RFECV refits its estimator using the discovered optimal number of features and exposes that for predict and so on, you don't really need the second copy of sgdc; using just the one copy has the side effect of hyperparameter-tuning the selection as well:
scl = StandardScaler()
imp = KNNImputer(n_neighbors=5, weights='uniform')
sgdc = SGDClassifier(loss='log', penalty='elasticnet', class_weight='balanced', random_state=1897)
base_pipe = Pipeline([
('scaler', scl),
('imputer', imp),
('clf', sgdc),
])
sel = RFECV(base_pipe, cv=folds)
param_rand = {'estimator__clf__l1_ratio': stats.uniform(0, 1),
'estimator__clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(sel, param_rand, n_iter=n_iter, cv=rskfold, scoring='accuracy', random_state=1897, verbose=1, n_jobs=-1)

Combine GridSearchCV and StackingClassifier

I want to use StackingClassifier to combine some classifiers and then use GridSearchCV to optimize the parameters:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
sclf = StackingClassifier(estimators=[clf1, clf2],final_estimator=dt)
params = {'randomforestclassifier__n_estimators': [10, 50],
'logisticregression__C': [1,2,3]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(x, y)
But this turns out an error:
'RandomForestClassifier' object has no attribute 'estimators_'
I have used n_estimators. Why it warns me that no estimators_?
Usually GridSearchCV is applied to single model so I just need to write the name of parameters of the single model in a dict.
I refer to this page https://groups.google.com/d/topic/mlxtend/5GhZNwgmtSg but it uses parameters of early version. Even though I change the newly parameters it doesn't work.
Btw, where can I learn the details of the naming rule of these params?
First of all, the estimators need to be a list containing the models in tuples with the corresponding assigned names.
estimators = [('model1', model()), # model() named model1 by myself
('model2', model2())] # model2() named model2 by myself
Next, you need to use the names as they appear in sclf.get_params().
Also, the name is the same as the one you gave to the specific model in the bove estimators list. So, here for model1 parameters you need:
params = {'model1__n_estimators': [5,10]} # model1__SOME_PARAM
Working toy example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
estimators = [('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('logreg', LogisticRegression())]
sclf = StackingClassifier(estimators= estimators , final_estimator=DecisionTreeClassifier())
params = {'rf__n_estimators': [5,10]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(X, y)
After some trial, maybe I find an available solution.
The key to solve this problem is to use get_params() to know the parameters of StackingClassifier.
I use another way to create sclf:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
estimators = [('rf', clf1),
('lr', clf2)]
sclf = StackingClassifier(estimators=estimators,final_estimator=dt)
params = {'rf__n_estimators': list(range(100,1000,100)),
'lr__C': list(range(1,10,1))}
grid = GridSearchCV(estimator=sclf, param_grid=params,verbose=2, cv=5,n_jobs=-1)
grid.fit(x, y)
In this way, I can name every basic classifiers and then set the params with their names.

Cross validation with multiple parameters using f1-score

I am trying to do feature selection using SelectKBest and the best tree depth for binary classification using f1-score. I have created a scorer function to select the best features and to evaluate the grid search. An error of "call() missing 1 required positional argument: 'y_true'" pops up when the classifier is trying to fit to the training data.
#Define scorer
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f1_scorer, k=all)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
#initialize a grid search with features to be optimized
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
#order best selected features into a single variable
selector = SelectKBest(score_func=f1_scorer, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)
On the fit line I get a TypeError: __call__() missing 1 required positional argument: 'y_true'.
The problem is in the score_func which you have used for SelectKBest. score_func is a function which takes two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores, but in your code you have fed the callable f1_scorer as the score_func which just takes your y_true and y_pred and computes the f1 score. You can use one of chi2, f_classif or mutual_info_classif as your score_func for the classification task. Also, there is a minor bug in the parameter k for SelectKBest it should have been "all" instead of all. I have modified your code incorporating these changes,
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif
from sklearn.metrics import f1_score, make_scorer
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_classes=2,
n_informative=4, weights=[0.7, 0.3],
random_state=0)
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f_classif)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
gs.best_params_
OUTPUT
{'dt__max_depth': 6, 'kbest__k': 9}
Also modify your last two lines as below:
selector = SelectKBest(score_func=f_classif, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)

Using GridSearchCV with AdaBoost and DecisionTreeClassifier

I am attempting to tune an AdaBoost Classifier ("ABT") using a DecisionTreeClassifier ("DTC") as the base_estimator. I would like to tune both ABT and DTC parameters simultaneously, but am not sure how to accomplish this - pipeline shouldn't work, as I am not "piping" the output of DTC to ABT. The idea would be to iterate hyper parameters for ABT and DTC in the GridSearchCV estimator.
How can I specify the tuning parameters correctly?
I tried the following, which generated an error below.
[IN]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {dtc__criterion : ["gini", "entropy"],
dtc__splitter : ["best", "random"],
abc__n_estimators: [none, 1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
[OUT]
ValueError: Invalid parameter dtc for estimator AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight='auto', criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=11, splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=11)
There are several things wrong in the code you posted:
The keys of the param_grid dictionary need to be strings. You should be getting a NameError.
The key "abc__n_estimators" should just be "n_estimators": you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string "abc" represents your AdaBoostClassifier.
None (and not none) is not a valid value for n_estimators. The default value (probably what you meant) is 50.
Here's the code with these fixes.
To set the parameters of your Tree estimator you can use the "__" syntax that allows accessing nested parameters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators": [1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
Also, 1 or 2 estimators does not really make sense for AdaBoost. But I'm guessing this is not the actual code you're running.
Hope this helps.
Trying to provide a shorter (and hopefully generic) answer.
If you want to grid search within a BaseEstimator for the AdaBoostClassifier e.g. varying the max_depth or min_sample_leaf of a DecisionTreeClassifier estimator, then you have to use a special syntax in the parameter grid.
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__max_depth':[i for i in range(2,11,2)],
'base_estimator__min_samples_leaf':[5,10],
'n_estimators':[10,50,250,1000],
'learning_rate':[0.01,0.1]}
clf = GridSearchCV(abc, parameters,verbose=3,scoring='f1',n_jobs=-1)
clf.fit(X_train,y_train)
So, note the 'base_estimator__max_depth' and 'base_estimator__min_samples_leaf' keys in the parameters dictionary. That's the way to access the hyperparameters of a BaseEstimator for an ensemble algorithm like AdaBoostClassifier when you are doing a grid search. Note the __ double underscore notation in particular. Other two keys in the parameters are the regular AdaBoostClassifier parameters.

Categories

Resources