What is the difference between pipeline and make_pipeline in scikit? - python

I got this from the sklearn webpage:
Pipeline: Pipeline of transforms with a final estimator
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.
But I still do not understand when I have to use each one. Can anyone give me an example?

The only difference is that make_pipeline generates names for steps automatically.
Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
compare it with make_pipeline:
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
So, with Pipeline:
names are explicit, you don't have to figure them out if you need them;
name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.
make_pipeline:
shorter and arguably more readable notation;
names are auto-generated using a straightforward rule (lowercase name of an estimator).
When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

Related

GridSearchCV with n_jobs=-1 is not working for Decision Tree/Random Forest classification

I am trying to use the GridSearchCV to evaluate different models with different parameter sets. Logistic Regression and k-NN do not cause a problem but Decision Tree, Random Forest and some of the other types of classifiers do not work when n_jobs=-1.
for classifier, paramSet, classifierName in zip(list_classifiers, list_paramSets, list_clfNames):
gs = GridSearchCV(
estimator = classifier,
param_grid = paramSet,
cv = 10,
n_jobs = -1
)
gs.fit(X_train, y_train)
plot_learning_curve(gs, "Learning Curve", X_train, y_train, n_jobs=-1)
I am working on Google Colab and either of the solution proposals below did not solve my problem.
from sklearn.externals.joblib import parallel_backend
clf = GridSearchCV(...)
with parallel_backend('threading',n_jobs = -1):
clf.fit(x_train, y_train)
if __name__ == "__main__":
import multiprocessing as mp; mp.set_start_method('forkserver', force=True) // 'spawn' has also failed
/// Gridsearch and fit here ///
Here is my source code : https://github.com/bahadirbasaran/pulsarDetection/blob/master/main.ipynb
The error log:
Any help will be appreciated!
you shouldn't need to use any additional calls to create threads. Your first code snippet should work. If you call:
n_cpus = multiprocessing.cpu_count()
what does it return?
If you use that and then pass n_jobs=n_cpus or n_jobs=n_cpus-1 (if you don't want your computer pegged), see if that works.
From the sklearn logistic regression documentation:
n_jobs int, default=None Number of CPU cores used when parallelizing
over classes if multi_class=’ovr’”. This parameter is ignored when the
solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is
specified or not.
So it may be that the models that are working are actually not using more than 1 CPU.
Hope some of these ideas help.

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
So X_new are the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.
I am getting the following error:
All intermediate steps should be transformers and implement fit and transform,
ExtraTreesClassifier ...
Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
Without reading between the lines too much about what you're trying to do here, this would work for you:
First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading:
Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation
You may also get the error in the title if you were oversampling or undersampling your data using imblearn module and fitting it into a model in a pipeline. If you got this message, then it means you have imported sklearn.pipeline.Pipeline. Import imblearn.pipeline.Pipeline instead and you're golden. For example,
from imblearn.pipeline import Pipeline
pipe = Pipeline([('o', SMOTE()), ('svc', SVC())])
The problem is, if you're sampling your data, the intermediate steps obviously need to sample the data as well, which is not supported by sklearn's Pipeline but is supported by imblearn's Pipeline.
This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.
So change the order of the pipeline and add another transformer for the first step in the pipeline

How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?

I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit argument to the sklearn pipeline?
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':
pipe.fit(X, y) # works fine
# however this does not work (even after correcting the underscores):
pipe.fit(X, y, classifier__eval_metric='auc') # fails
The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:
pipe.fit(X_train, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc'
pipe.fit(X, y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation
gsearch.fit(X, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_

Scaling data in RFECV with scikit-learn

It is common to scale the training and testing data separately before training and predicting progress of a classification task.
I want to embed the aforementioned process in RFECV which runs CV tests thus I tried the following:
Do
X_scaled = preprocessing.scale(X) in the first place, where X is the whole data set. By doing so, training and testing data are not scaled separately, which is not considered.
The other way I tried is to pass:
scaling_svm = Pipeline([('scaler', preprocessing.StandardScaler()),
('svm',LinearSVC(penalty=penalty, dual=False, class_weight='auto'))])
as parameter to the argument in RFECV :
rfecv = RFECV(estimator=scaling_svm, step=1, cv=StratifiedKFold(y, 7),
scoring=score, verbose=0)
However, I got an error since RFECV needs the estimator to have attribute .coef_.
What should I suppose to do? Any help would be appreciated.
A bit late to the party, admittedly, but if anyone is interested you can create a customised pipeline as follows:
from sklearn.pipeline import Pipeline
class RfePipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
And then replace Pipeline with RfePipeline in your code.
See similar question here.

Getting model attributes from pipeline

I typically get PCA loadings like this:
pca = PCA(n_components=2)
X_t = pca.fit(X).transform(X)
loadings = pca.components_
If I run PCA using a scikit-learn pipeline:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[
('scaling',StandardScaler()),
('pca',PCA(n_components=2))
])
X_t=pipeline.fit_transform(X)
is it possible to get the loadings?
Simply trying loadings = pipeline.components_ fails:
AttributeError: 'Pipeline' object has no attribute 'components_'
(Also interested in extracting attributes like coef_ from pipelines.)
Did you look at the documentation: http://scikit-learn.org/dev/modules/pipeline.html
I feel it is pretty clear.
Update: in 0.21 you can use just square brackets:
pipeline['pca']
or indices
pipeline[1]
There are two ways to get to the steps in a pipeline, either using indices or using the string names you gave:
pipeline.named_steps['pca']
pipeline.steps[1][1]
This will give you the PCA object, on which you can get components.
With named_steps you can also use attribute access with a . which allows autocompletion:
pipeline.names_steps.pca.<tab here gives autocomplete>
Using Neuraxle
Working with pipelines is simpler using Neuraxle. For instance, you can do this:
from neuraxle.pipeline import Pipeline
# Create and fit the pipeline:
pipeline = Pipeline([
StandardScaler(),
PCA(n_components=2)
])
pipeline, X_t = pipeline.fit_transform(X)
# Get the components:
pca = pipeline[-1]
components = pca.components_
You can access your PCA these three different ways as wished:
pipeline['PCA']
pipeline[-1]
pipeline[1]
Neuraxle is a pipelining library built on top of scikit-learn to take pipelines to the next level. It allows easily managing spaces of hyperparameter distributions, nested pipelines, saving and reloading, REST API serving, and more. The whole thing is made to also use Deep Learning algorithms and to allow parallel computing.
Nested pipelines:
You could have pipelines within pipelines as below.
# Create and fit the pipeline:
pipeline = Pipeline([
StandardScaler(),
Identity(),
Pipeline([
Identity(), # Note: an Identity step is a step that does nothing.
Identity(), # We use it here for demonstration purposes.
Identity(),
Pipeline([
Identity(),
PCA(n_components=2)
])
])
])
pipeline, X_t = pipeline.fit_transform(X)
Then you'd need to do this:
# Get the components:
pca = pipeline["Pipeline"]["Pipeline"][-1]
components = pca.components_

Categories

Resources