I have a sklearn pipeline with PolynomialFeatures() and LinearRegression() in series. My aim is to fit data to this using different degree of the polynomial features and measure the score. The following is the code I use -
steps = [('polynomials',preprocessing.PolynomialFeatures()),('linreg',linear_model.LinearRegression())]
pipeline = pipeline.Pipeline(steps=steps)
scores = dict()
for i in range(2,6):
params = {'polynomials__degree': i,'polynomials__include_bias': False}
#pipeline.set_params(**params)
pipeline.fit(X_train,y=yCO_logTrain,**params)
scores[i] = pipeline.score(X_train,yCO_logTrain)
scores
I receive the error - TypeError: fit() got an unexpected keyword argument 'degree'.
Why is this error thrown even though the parameters are named in the format <estimator_name>__<parameter_name>?
As per sklearn.pipeline.Pipeline documentation:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
This means that the parameters passed this way are directly passed to s step .fit() method. If you check PolynomialFeatures documentation, degree argument is used in construction of the PolynomialFeatures object, not in its .fit() method.
If you want to try different hyperparameters for estimators/transformators within a pipeline, you could use GridSearchCV as shown here. Here's an example code from the link:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('select', SelectKBest()),
('model', calibrated_forest)])
param_grid = {
'select__k': [1, 2],
'model__base_estimator__max_depth': [2, 4, 6, 8]}
search = GridSearchCV(pipe, param_grid, cv=5).fit(X, y)
Related
I am trying to determine which alpha is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'.
I have an array with some values for alpha ranging from 5e09 to 5e-03:
array([5.00000000e+09, 3.78231664e+09, 2.86118383e+09, 2.16438064e+09,
1.63727458e+09, 1.23853818e+09, 9.36908711e+08, 7.08737081e+08,
5.36133611e+08, 4.05565415e+08, 3.06795364e+08, 2.32079442e+08,
1.75559587e+08, 1.32804389e+08, 1.00461650e+08, 7.59955541e+07,
5.74878498e+07, 4.34874501e+07, 3.28966612e+07, 2.48851178e+07,
1.88246790e+07, 1.42401793e+07, 1.07721735e+07, 8.14875417e+06,
6.16423370e+06, 4.66301673e+06, 3.52740116e+06, 2.66834962e+06,
2.01850863e+06, 1.52692775e+06, 1.15506485e+06, 8.73764200e+05,
6.60970574e+05, 5.00000000e+05, 3.78231664e+05, 2.86118383e+05,
2.16438064e+05, 1.63727458e+05, 1.23853818e+05, 9.36908711e+04,
7.08737081e+04, 5.36133611e+04, 4.05565415e+04, 3.06795364e+04,
2.32079442e+04, 1.75559587e+04, 1.32804389e+04, 1.00461650e+04,
7.59955541e+03, 5.74878498e+03, 4.34874501e+03, 3.28966612e+03,
2.48851178e+03, 1.88246790e+03, 1.42401793e+03, 1.07721735e+03,
8.14875417e+02, 6.16423370e+02, 4.66301673e+02, 3.52740116e+02,
2.66834962e+02, 2.01850863e+02, 1.52692775e+02, 1.15506485e+02,
8.73764200e+01, 6.60970574e+01, 5.00000000e+01, 3.78231664e+01,
2.86118383e+01, 2.16438064e+01, 1.63727458e+01, 1.23853818e+01,
9.36908711e+00, 7.08737081e+00, 5.36133611e+00, 4.05565415e+00,
3.06795364e+00, 2.32079442e+00, 1.75559587e+00, 1.32804389e+00,
1.00461650e+00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])
Then, I used RidgeCV to try and determine which of these values would be best:
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error',
normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
and I got ridgecv.alpha_ = 0.006609705742330144
However, I received a warning that normalize = True is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline and StandardScaler instead. Then, following instructions of how to do a Pipeline, I did:
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
y_pred = ridge_pipe.predict(X_test)
ridge_pipe2.named_steps.model.alpha_
Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342
For a last check, I also used GridSearchCV as follows:
steps = [
('scalar', StandardScaler()),
('model',Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
parameters = [{'model__alpha':alphas}]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'neg_mean_squared_error',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_.get_params
where I got grid_search.best_estimator_.get_params = 1.328043891473342 (same as the other Pipeline approach).
Therefore, my question... why normalizing my dataset with normalize=True or with StandardScaler() yields different best alpha values?
The corresponding warning message for ordinary Ridge makes an additional mention:
Set parameter alpha to: original_alpha * n_samples.
(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV along these lines.) Changing your alphas parameter in the second approach to [alph * X.shape[0] for alph in alphas] should work. The selected alpha_ will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0] and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).
(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)
You need to ensure the same cross validation is used and scale without centering the data.
When you run with normalize=True, you get this as part of the warning :
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
Regarding the cv, if you check the documentation, RidgeCV by default performs leave-one-out cross validation :
Ridge regression with built-in cross-validation.
See glossary entry for cross-validation estimator.
By default, it performs efficient Leave-One-Out Cross-Validation.
So to get the same result, we can define a cross-validation to use :
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
kf = KFold(10)
X_train, y_train = datasets.make_regression()
alphas = [0.001,0.005,0.01,0.05,0.1]
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.001
And use it on pipeline :
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',cv=kf))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
ridge_pipe2.named_steps.model.alpha_
0.001
Since SVR supports only a single output, I am trying to employ SVR on my model which has 6 inputs and 19 outputs using MultiOutputRegressor.
I am starting with hyper-parameter tuning. However, I am getting the error below. How can I modify my code to support MultiOutputRegressor?
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
svr = SVR()
svr_regr = MultiOutputRegressor(svr)
from sklearn.model_selection import KFold
kfold_splitter = KFold(n_splits=6, random_state = 0,shuffle=True)
#On each iteration, the algorithm will choose a difference combination of the features.
svr_random = RandomizedSearchCV(svr_regr,
param_distributions = {'kernel': ('linear','poly','rbf','sigmoid'),
'C': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10],
'degree': [3,8],
'coef0': [0.01,0.1,0.5],
'gamma': ('auto','scale')
'tol': [1e-3, 1e-4, 1e-5, 1e-6]},
n_iter=100,
cv=kfold_splitter,
n_jobs=-1,
random_state=42,
scoring='r2')
svr_random.fit(X_train, y_train)
print(svr_random.best_params_)
Error:
ValueError: Invalid parameter kernel for estimator MultiOutputRegressor(estimator=SVR()). Check the list of available parameters with `estimator.get_params().keys()`.
After getting the optimum parameters:
SVR_model = svr_regr (kernel='rbf',C=10,
coef0=0.01,degree=3,
gamma='auto',tol=1e-6,random_state=42)
SVR_model.fit(X_train, y_train)
SVR_model_y_predict = SVR_model.predict((X_test))
SVR_model_y_predict
Error after getting the optimum parameters:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_96269/769104914.py in <module>
----> 1 SVR_model = svr_regr (estimator__kernel='rbf',estimator__C=10,
2 estimator__coef0=0.01,estimator__degree=3,
3 estimator__gamma='auto',estimator__tol=1e-6,random_state=42)
4
5
TypeError: 'MultiOutputRegressor' object is not callable
I tried to reproduce a simple example of MultiOutputRegressor without using GridSearchCV (i.e. just the fit and predict methods), which seemed to work fine. The error message:
Check the list of available parameters with estimator.get_params().keys()
suggests that the parameters that you are optimising in GridSearchCV, i.e. through param_distributions, don't match the parameters accepted by MultiOutputRegressor. Looking at the API reference, there are only a few parameters that MultiOutputRegressor takes, and the parameters you are trying to pass through to SVR, e.g. C and tol belong to the support vector machine estimator.
You may be able to pass through parameters to SVR via nested parameters similar to how it's done in a pipeline.
I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}
I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.
You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation
I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.