Problem with SelectKBest method in pipeline - python

I am trying to solve a problem where I use KNN algorithm for classification. While using pipeline, I decided to add SelectKBest but I get the error below :
All intermediate steps should be transformers and implement fit and transform.
I don't know if I can use this selection algorithm with KNN. But I tried with SVM as well and got the same result. Here is my code :
sel = SelectKBest('chi2',k = 3)
clf = kn()
s = ss()
step = [('scaler', s), ('kn', clf), ('sel',sel)]
pipeline = Pipeline(step)
parameter = {'kn__n_neighbors':range(1,40,1), 'kn__weights':['uniform','distance'], 'kn__p':[1,2] }
kfold = StratifiedKFold(n_splits=5, random_state=0)
grid = GridSearchCV(pipeline, param_grid = parameter, cv=kfold, scoring = 'accuracy', n_jobs = -1)
grid.fit(x_train, y_train)

The order of the operations in the pipeline, as determined in steps, matters; from the docs:
steps : list
List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
The error is due to adding SelectKBest as the last element of your pipeline:
step = [('scaler', s), ('kn', clf), ('sel',sel)]
which is not an estimator (it is a transformer), as well as to your intermediate step kn not being a transformer.
I guess you don't really want to perform feature selection after you have fitted the model...
Change it to:
step = [('scaler', s), ('sel', sel), ('kn', clf)]
and you should be fine.

So, I didn't think the order of the pipeline is important but then, I found out the last member of the pipeline has to be able to fit/transform. I changed the order of the pipeline by making clf the last. Problem is solved.

Related

Scikit Learn Pipeline: Calling .fit() and .score() vs cross_val_score()

Imagine we have the following pipeline:
example_pipe = Pipeline(steps=[
('scaler', StandardScaler()),
('selector', SelectKBest(k=len(X.columns)-5)),
('classifier', KNeighborsClassifier())
])
Now we want to get the performance of the pipeline with:
# 1)
cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
# 2)
example_pipe.fit(X_train, y_train)
example_pipe.score(X_test, y_test)
How is the first different from the second in regards to the score we get (except of course that it does cross-validation)? Do we have to call example_pipe.fit() before using cross_val_score().
I've found the following methods in the documentation, but it's a bit confusing because I thought that calling .fit() already implies calling .transform().
fit(X[, y]) --> Fit the model
fit_predict(X[, y]) --> Applies fit_predict of last step in pipeline
after transforms.
fit_transform(X[, y]) --> Fit the model and transform with the final
estimator
score(X[, y, sample_weight]) --> Apply transforms, and score with the
final estimator
Do we have to call example_pipe.fit() before using cross_val_score()?
If you go to Scikit-Learn Documentation, you find the answer:
cross_val_score first fits your example_pipe, then gets the score of the cross validation.

Feature selection: after or during nested cross-validation?

I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.
I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.

Pass object attribute from previous sklearn pipeline step as argument to next step method

tl;dr: Is there any way to call .get_feature_names() on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?
I have a Pipeline that includes fitting and transforming text data with TfidfVectorizer, and then runs a RandomForestClassifier. I want to GridSearchCV across various levels of max_features in the classifier, based on the number of features that the transformation produced from the text.
#setup pipeline
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', RandomForestClassifier(random_state=1,
criterion='entropy',
n_estimators=800))
])
#setup parameter grid
params = {
'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}
Instantiating returns the following error:
NameError: name 'vect' is not defined
Edit:
This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer such as ngram_range, one could see how this could change the number of features output to the next step...
The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.
I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features, i.e. a percentage of the columns coming out of the vectorizer.
If you really want a score for every integer max_features, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit is called:
estimator = RandomForestClassifier(
random_state=1,
criterion='entropy',
n_estimators=800
)
class MySearcher(GridSearchCV):
def fit(self, X, y):
m = X.shape[1]
self.param_grid = {'max_features': np.arange(1, m, 1)}
return super().fit(X, y)
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', MySearcher(estimator=estimator,
param_grid={'fake': ['passes', 'check']}))
])
Now the search results will be awkwardly nested (best values of, say, ngram_range give you a refitted copy of pipe, whose second step will itself have a best value of max_features and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.

Error when using scikit-learn to use pipelines

I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.
You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation

Invalid parameter for sklearn estimator pipeline

I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.

Categories

Resources