I would like to make a prediction of a single tree of my random forest. However, if I wrap my pipeline around TransformedTargetRegressor .set_params does not seem to work.
Please find below an example:
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# loading data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# pipeline and training
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators = 100, max_depth = 4, random_state = 0))
])
treg = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())
treg.fit(X, Y)
# single tree from random forest
tree = treg.regressor_.named_steps['model'].estimators_[0]
x_sample = X[0:1]
print('baseline: ', treg.predict(x_sample))
x_scaled = treg.regressor_.named_steps['scaler'].transform(x_sample)
y_predicted = tree.predict(x_scaled)
y_transformed = treg.transformer_.inverse_transform([y_predicted])
print("internal pipeline changes: ", y_transformed)
new_model = treg.set_params(**{'regressor__model': tree})
y_predicted = new_model.predict(x_sample)
print('with set_params(): ', y_predicted)
The output that I am getting is shown below. I would expect 'with set_params()' to be the same like 'internal pipeline changes:
baseline: [26.41013313]
internal pipeline changes: [[30.02424242]]
with set_params(): [26.41013313]
TransformedTargetRegressor has a parameter regressor and an attribute regressor_. The former can be set with set_params and is considered a hyperparameter, but is not used in prediction; rather, it is cloned and fitted when the TTR is fitted, and stored in the regressor_ attribute.
So you cannot use set_params to update the fitted regressor attribute. (You can check that in your code, new_model.regressor_['model'] is still a random forest.) The best you can do is directly modify the attribute (though this is probably unorthodox, and in some situations may lead to other issues):
import copy
mod_model = copy.deepcopy(treg)
mod_model.regressor_.steps[-1] = ('model', tree)
y_predicted = mod_model.predict(x_sample)
print('with modifying regressor: ', y_predicted)
Apparently, scikit-learn TransformedTargetRegressor objects don't allow you to change the regressor used to predict, unless you re-fit the dataset on the new regressor in set_params. If you do this:
new_model = treg.set_params(**{'regressor__model': tree})
print(new_model)
you can see that the new parameters have been set. However, as you correctly discovered, the estimator used in predict is still the old one. If you want to change the estimator in the object, you can do:
new_model = treg.set_params(**{'regressor__model': tree})
new_model.fit(X, Y)
new_model.predict(x_sample)
And you can see that the prediction changes and uses the single tree to perform the estimation. If you are interested in the sinlge tree's prediciton and not re-fit on the whole dataset, you can just call, tree.predict() separately.
Related
I am trying to determine which alpha is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'.
I have an array with some values for alpha ranging from 5e09 to 5e-03:
array([5.00000000e+09, 3.78231664e+09, 2.86118383e+09, 2.16438064e+09,
1.63727458e+09, 1.23853818e+09, 9.36908711e+08, 7.08737081e+08,
5.36133611e+08, 4.05565415e+08, 3.06795364e+08, 2.32079442e+08,
1.75559587e+08, 1.32804389e+08, 1.00461650e+08, 7.59955541e+07,
5.74878498e+07, 4.34874501e+07, 3.28966612e+07, 2.48851178e+07,
1.88246790e+07, 1.42401793e+07, 1.07721735e+07, 8.14875417e+06,
6.16423370e+06, 4.66301673e+06, 3.52740116e+06, 2.66834962e+06,
2.01850863e+06, 1.52692775e+06, 1.15506485e+06, 8.73764200e+05,
6.60970574e+05, 5.00000000e+05, 3.78231664e+05, 2.86118383e+05,
2.16438064e+05, 1.63727458e+05, 1.23853818e+05, 9.36908711e+04,
7.08737081e+04, 5.36133611e+04, 4.05565415e+04, 3.06795364e+04,
2.32079442e+04, 1.75559587e+04, 1.32804389e+04, 1.00461650e+04,
7.59955541e+03, 5.74878498e+03, 4.34874501e+03, 3.28966612e+03,
2.48851178e+03, 1.88246790e+03, 1.42401793e+03, 1.07721735e+03,
8.14875417e+02, 6.16423370e+02, 4.66301673e+02, 3.52740116e+02,
2.66834962e+02, 2.01850863e+02, 1.52692775e+02, 1.15506485e+02,
8.73764200e+01, 6.60970574e+01, 5.00000000e+01, 3.78231664e+01,
2.86118383e+01, 2.16438064e+01, 1.63727458e+01, 1.23853818e+01,
9.36908711e+00, 7.08737081e+00, 5.36133611e+00, 4.05565415e+00,
3.06795364e+00, 2.32079442e+00, 1.75559587e+00, 1.32804389e+00,
1.00461650e+00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])
Then, I used RidgeCV to try and determine which of these values would be best:
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error',
normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
and I got ridgecv.alpha_ = 0.006609705742330144
However, I received a warning that normalize = True is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline and StandardScaler instead. Then, following instructions of how to do a Pipeline, I did:
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
y_pred = ridge_pipe.predict(X_test)
ridge_pipe2.named_steps.model.alpha_
Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342
For a last check, I also used GridSearchCV as follows:
steps = [
('scalar', StandardScaler()),
('model',Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
parameters = [{'model__alpha':alphas}]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'neg_mean_squared_error',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_.get_params
where I got grid_search.best_estimator_.get_params = 1.328043891473342 (same as the other Pipeline approach).
Therefore, my question... why normalizing my dataset with normalize=True or with StandardScaler() yields different best alpha values?
The corresponding warning message for ordinary Ridge makes an additional mention:
Set parameter alpha to: original_alpha * n_samples.
(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV along these lines.) Changing your alphas parameter in the second approach to [alph * X.shape[0] for alph in alphas] should work. The selected alpha_ will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0] and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).
(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)
You need to ensure the same cross validation is used and scale without centering the data.
When you run with normalize=True, you get this as part of the warning :
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
Regarding the cv, if you check the documentation, RidgeCV by default performs leave-one-out cross validation :
Ridge regression with built-in cross-validation.
See glossary entry for cross-validation estimator.
By default, it performs efficient Leave-One-Out Cross-Validation.
So to get the same result, we can define a cross-validation to use :
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
kf = KFold(10)
X_train, y_train = datasets.make_regression()
alphas = [0.001,0.005,0.01,0.05,0.1]
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.001
And use it on pipeline :
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',cv=kf))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
ridge_pipe2.named_steps.model.alpha_
0.001
I want to use StackingClassifier to combine some classifiers and then use GridSearchCV to optimize the parameters:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
sclf = StackingClassifier(estimators=[clf1, clf2],final_estimator=dt)
params = {'randomforestclassifier__n_estimators': [10, 50],
'logisticregression__C': [1,2,3]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(x, y)
But this turns out an error:
'RandomForestClassifier' object has no attribute 'estimators_'
I have used n_estimators. Why it warns me that no estimators_?
Usually GridSearchCV is applied to single model so I just need to write the name of parameters of the single model in a dict.
I refer to this page https://groups.google.com/d/topic/mlxtend/5GhZNwgmtSg but it uses parameters of early version. Even though I change the newly parameters it doesn't work.
Btw, where can I learn the details of the naming rule of these params?
First of all, the estimators need to be a list containing the models in tuples with the corresponding assigned names.
estimators = [('model1', model()), # model() named model1 by myself
('model2', model2())] # model2() named model2 by myself
Next, you need to use the names as they appear in sclf.get_params().
Also, the name is the same as the one you gave to the specific model in the bove estimators list. So, here for model1 parameters you need:
params = {'model1__n_estimators': [5,10]} # model1__SOME_PARAM
Working toy example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
estimators = [('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('logreg', LogisticRegression())]
sclf = StackingClassifier(estimators= estimators , final_estimator=DecisionTreeClassifier())
params = {'rf__n_estimators': [5,10]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(X, y)
After some trial, maybe I find an available solution.
The key to solve this problem is to use get_params() to know the parameters of StackingClassifier.
I use another way to create sclf:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
estimators = [('rf', clf1),
('lr', clf2)]
sclf = StackingClassifier(estimators=estimators,final_estimator=dt)
params = {'rf__n_estimators': list(range(100,1000,100)),
'lr__C': list(range(1,10,1))}
grid = GridSearchCV(estimator=sclf, param_grid=params,verbose=2, cv=5,n_jobs=-1)
grid.fit(x, y)
In this way, I can name every basic classifiers and then set the params with their names.
I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.
I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.
I'm a beginner, and I have the following code below.
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
modelwithpca = GridSearchCV(pipeline, param_grid= ,cv=cv)
modelwithpca.fit(X_train,y_train)
This is a local testing, what I'm trying to accomplish is,
i. Perform PCA on the dataset
ii. Use Gaussian Naive Bayes with only the default parameters
iii. Use StratifiedShuffleSplit
So in the end I want the above steps to be carried over to another function that dumps the classifier, the dataset and the feature list to test for performance.
dump_classifier_and_data(modelwithpca, dataset, features)
In the param_grid part, I don't want to test any list of parameters. I just want to have the default parameters used in Gaussian Naive Bayes if that makes sense. What do I change?
Also should there be any changes as to how I instantiate the classifier objects?
The purpose of GridSearchCV is to test with different parameters for at least one thing in your pipeline (if you don't want to test for different parameters you don't need to use GridSearchCV).
So, in general, if you want let's say to test for different PCA n_components.
The format to use a pipeline with GridSearchCV would be the following:
gscv = GridSearchCV(pipeline, param_grid={'{step_name}__{parameter_name}': [possible values]}, cv=cv)
e.g.:
# this would perform cv for the 3 different values of n_components for pca
gscv = GridSearchCV(pipeline, param_grid={'pca__n_components': [3, 6, 10]}, cv=cv)
If you use GridSearchCV to tune PCA as above, this of course would mean that your model would have the default values.
If you don't need parameter tuning then GridSearchCV is not the way to go, since using the default parameters of your model for GridSearchCV like this, will only produce a parameter grid with one combination, so it would be like just performing only CV. It wouldn't make sense to do it like this - if I have understood your question correctly:
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
# get the default parameters of your model and use them as a param_grid
modelwithpca = GridSearchCV(pipeline, param_grid={'model__' + k: [v] for k, v in model.get_params().items()}, cv=cv)
# will run 5 times as your cv is configured
modelwithpca.fit(X_train,y_train)
Hope this helps, good luck!
I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.