In the following code:
# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
clf = RandomForestClassifier(5000)
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
What should be used for a prediction?
gs?
gs.best_estimator_?
or
gs.best_estimator_.named_steps['clf']?
What is the difference between these 3?
gs.predict(X_test) is equivalent to gs.best_estimator_.predict(X_test). Using either, X_test will be passed through your entire pipeline and it will return the predictions.
gs.best_estimator_.named_steps['clf'].predict(), however is only the last phase of the pipeline. To use it, the feature selection step must already have been performed. This would only work if you have previously run your data through gs.best_estimator_.named_steps['fs'].transform()
Three equivalent methods for generating predictions are shown below:
Using gs directly.
pred = gs.predict(X_test)
Using best_estimator_.
pred = gs.best_estimator_.predict(X_test)
Calling each step in the pipeline individual.
X_test_fs = gs.best_estimator_.named_steps['fs'].transform(X_test)
pred = gs.best_estimator_.named_steps['clf'].predict(X_test_fs)
If you pass True to the value of refit parameter of GridSearchCV (which is the default value anyway), then the estimator with best parameters refits on the whole dataset, so you can use gs.fit(X_test) for prediction.
If the value of refit is equal to False while fitting the GridSearchCV object on your training set, then for prediction, you have only one option which is using gs.best_estimator_.predict(X_test).
Related
I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).
First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.
X shape : (25433, 17)
y shape : (25433, 1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=(0))
Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.
'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)
X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)
'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)
I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing).
SVM might also be an option i want to check out.
Then i set up the GridSearchCV
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
fbeta = make_scorer(fbeta_score,
average = 'micro',
needs_proba=True,
beta=1)
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=fbeta,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
The Problem:
GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
You can use GridSearchCV for unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.
Here's an example in the docs that uses grid search for KernelDensity, an unsupervised estimator. It works without issue because this estimator has a score method (docs).
In your case, since IsolationForest doesn't have a score method, you'll need to define a custom scorer to pass as the search's scoring method. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.
Agree with #Ben Reiniger's answer and it has good links for other SO posts on this topic.
You can try creating a custom scorer by assuming you can make use of y_train . This is not strictly unsupervised .
Here is one example where R2 score is used as a scoring metric.
from sklearn.metrics import r2_score
def scorer_f(estimator, X_train,Y_train):
y_pred=estimator.predict(Xtrain)
return r2_score(Y_train, y_pred)
Then you can use it as normal.
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=scorer_f,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
I have used RandomizedSearchCV to tune the parameters of my Random Forest model, as in the code cell below:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
It turns out that the rf_random model outperforms any of my manually trained models with the same parameters which were retrieved using
rf_random.best_params_
I need to reproduce the exact prediction that I have made using RandomizedSearchCV, but I am unable to do so, mainly due to two reasons:
best_params_ differ on each run
I am having trouble understanding how RandomizedSearchCV splits the data into train set and validation set, which means that it is nearly impossible for me to train a new model that behaves the same.
What can I do? What more information do I need to reproduce the results? Or is it even possible to reproduce results from RandomizedSearchCV, despite the fact that I have fixed my random_state to 42? Should I stick to GridSearchCV instead if I need to reproduce the results?
I believe you are looking for the best_estimator_ attribute of RandomizedSearchCV which will return the fitted estimator which scored highest on the left out data:
kf = KFold(n_splits=3, random_state=42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = kf, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
tuned_rf = rf_random.best_estimator_
for train_index, test_index in kf.split(X_train[rf_cols], y_train):
# use train and test indices here
I am setting up a predictive analytics pipeline on some data, and I am in the process of model selection. My target variable is skewed, so I would like to log-transform it in order to increase the performance of my linear regression estimators. I came across the relatively new TransformedTargetRegressor of scikit-learn, and I thought I could use it as part of a pipeline. I am attaching my code
My initial attempt was to transform y_train before calling gs.fit(), by casting it to np.log1p(y_train). This works, and I can perform the nested cross-validation and return the metrics of interest for all estimators. However, I would like to be able to get R^2 and RMSE for the trained model on previously unseen data (validation set), and I understand that in order to do that, I need to use (for example) r2_score function on y_val, preds, where the predictions need to have been transformed back to the real values, i.e., preds = np.expm1(gs.predict(X_val))
### Create a pipeline
pipe = Pipeline([
# the transformer stage is populated by the param_grid
('transformer', TransformedTargetRegressor(func=np.log1p, inverse_func=np.expm1)),
('reg', DummyEstimator()) # Placeholder Estimator
])
### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [
{'transformer__regressor': Lasso(),
'reg': [Lasso()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': LassoLars(),
'reg': [LassoLars()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': Ridge(),
'reg': [Ridge()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': ElasticNet(),
'reg': [ElasticNet()], # Actual Estimator
'reg__alpha': alphas,
'reg__l1_ratio': [0.25, 0.5, 0.75]}]
### Create grid search (Inner CV)
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1,
scoring=scoring, refit='r2', return_train_score=True)
### Fit
best_model = gs.fit(X_train, y_train)
### scoring metrics for outer CV
scoring = ['neg_mean_absolute_error', 'r2', 'explained_variance', 'neg_mean_squared_error']
### Outer CV
linear_cv_results = cross_validate(gs, X_train, y_train_transformed,
scoring=scoring, cv=5, verbose=3, return_train_score=True)
### Calculate mean metrics
train_r2 = (linear_cv_results['train_r2']).mean()
test_r2 = (linear_cv_results['test_r2']).mean()
train_mae = (-linear_cv_results['train_neg_mean_absolute_error']).mean()
test_mae = (-linear_cv_results['test_neg_mean_absolute_error']).mean()
train_exp_var = (linear_cv_results['train_explained_variance']).mean()
test_exp_var = (linear_cv_results['test_explained_variance']).mean()
train_rmse = (np.sqrt(-linear_cv_results['train_neg_mean_squared_error'])).mean()
test_rmse = (np.sqrt(-linear_cv_results['test_neg_mean_squared_error'])).mean()
Obviously this code snippet does not work, because apparently I can not add TransformedTargetRegressor into my pipeline, since it does not have a transform method (I get this TypeError: TypeError: All intermediate steps should be transformers and implement fit and transform).
Is there a "proper" way of doing this, or do I just have to take the log transformation of y_val on the fly when I want to call r2_score function etc?
No, because the scikit-learn original Pipeline does not change the y or the number of samples in X and y during the steps.
Your use-case is little unclear. What is the need of reg step if that same reg is already added to the TransformedTargetRegressor?
Looking at the documentation of TransformedTargetRegressor, the parameter regressor accepts a regressor (which can be also a pipeline which have some feature selection operations on X and a regressor at final stage). The working of TransformedTargetRegressor will be:
fit():
regressor.fit(X, func(y))
predict():
inverse_func(regressor.predict(X))
So there is no need to append that same regressor as a new step. Your model selection code now can be:
pipe = TransformedTargetRegressor(regressos = DummyEstimator(),
func=np.log1p,
inverse_func=np.expm1)),
### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [
{'transformer__regressor': Lasso(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': LassoLars(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': Ridge(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': ElasticNet(),
'transformer__regressor__alpha': alphas,
'transformer__regressor__l1_ratio': [0.25, 0.5, 0.75]}
]
I am using sklearn GridSearch to find best parameters for random forest classification using a predefined validation set. The scores from the best estimator returned by GridSearch do not match the scores obtained by training a separate classifier with the same parameters.
The data split definition
X = pd.concat([X_train, X_devel])
y = pd.concat([y_train, y_devel])
test_fold = -X.index.str.contains('train').astype(int)
ps = PredefinedSplit(test_fold)
The GridSearch definition
n_estimators = [10]
max_depth = [4]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
rf = RandomForestClassifier(random_state=0)
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = ps, scoring='recall_macro')
rf_grid.fit(X, y)
The classifier definition
clf = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=0)
clf.fit(X_train, y_train)
The recall was calculated explicitly using sklearn.metrics.recall_score
y_pred_train = clf.predict(X_train)
y_pred_devel = clf.predict(X_devel)
uar_train = recall_score(y_train, y_pred_train, average='macro')
uar_devel = recall_score(y_devel, y_pred_devel, average='macro')
GridSearch
uar train: 0.32189884516029466
uar devel: 0.3328299259976279
Random Forest:
uar train: 0.483040291148839
uar devel: 0.40706644557392435
What is the reason for such a mismatch?
There are multiple issues here:
Your input arguments to recall_score are reversed. The actual correct order is:
recall_score(y_true, y_test)
But you are are doing:
recall_score(y_pred_train, y_train, average='macro')
Correct that to:
recall_score(y_train, y_pred_train, average='macro')
You are doing rf_grid.fit(X, y) for grid-search. That means that after finding the best parameter combinations, the GridSearchCV will fit the whole data (whole X, ignoring the PredefinedSplit because that's only used during cross-validation in search of best parameters). So in essence, the estimator from GridSearchCV will have seen the whole data, so scores will be different from what you get when you do clf.fit(X_train, y_train)
It's because in your GridSearchCV you are using the scoring function as recall-macro which basically return the recall score which is macro averaged. See this link.
However, when you are returning the default score from your RandomForestClassifier it returns the mean accuracy. So, that is why the scores are different. See this link for info on the same. (Since one is recall and the other is accuracy).
I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.