Getting model attributes from pipeline - python

I typically get PCA loadings like this:
pca = PCA(n_components=2)
X_t = pca.fit(X).transform(X)
loadings = pca.components_
If I run PCA using a scikit-learn pipeline:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[
('scaling',StandardScaler()),
('pca',PCA(n_components=2))
])
X_t=pipeline.fit_transform(X)
is it possible to get the loadings?
Simply trying loadings = pipeline.components_ fails:
AttributeError: 'Pipeline' object has no attribute 'components_'
(Also interested in extracting attributes like coef_ from pipelines.)

Did you look at the documentation: http://scikit-learn.org/dev/modules/pipeline.html
I feel it is pretty clear.
Update: in 0.21 you can use just square brackets:
pipeline['pca']
or indices
pipeline[1]
There are two ways to get to the steps in a pipeline, either using indices or using the string names you gave:
pipeline.named_steps['pca']
pipeline.steps[1][1]
This will give you the PCA object, on which you can get components.
With named_steps you can also use attribute access with a . which allows autocompletion:
pipeline.names_steps.pca.<tab here gives autocomplete>

Using Neuraxle
Working with pipelines is simpler using Neuraxle. For instance, you can do this:
from neuraxle.pipeline import Pipeline
# Create and fit the pipeline:
pipeline = Pipeline([
StandardScaler(),
PCA(n_components=2)
])
pipeline, X_t = pipeline.fit_transform(X)
# Get the components:
pca = pipeline[-1]
components = pca.components_
You can access your PCA these three different ways as wished:
pipeline['PCA']
pipeline[-1]
pipeline[1]
Neuraxle is a pipelining library built on top of scikit-learn to take pipelines to the next level. It allows easily managing spaces of hyperparameter distributions, nested pipelines, saving and reloading, REST API serving, and more. The whole thing is made to also use Deep Learning algorithms and to allow parallel computing.
Nested pipelines:
You could have pipelines within pipelines as below.
# Create and fit the pipeline:
pipeline = Pipeline([
StandardScaler(),
Identity(),
Pipeline([
Identity(), # Note: an Identity step is a step that does nothing.
Identity(), # We use it here for demonstration purposes.
Identity(),
Pipeline([
Identity(),
PCA(n_components=2)
])
])
])
pipeline, X_t = pipeline.fit_transform(X)
Then you'd need to do this:
# Get the components:
pca = pipeline["Pipeline"]["Pipeline"][-1]
components = pca.components_

Related

Inspecting a 'fitted sk-learn' pipeline still results in 'TFIdfVectorizer not fitted yet'

this is an insecurity I have with sk-learn's Pipelines. Whenever I create a pipeline in sk-learn and do some predictions with this pipeline, I seem to come across the problem that I can't actually examine the intermediate steps of the pipeline. The predictions work, I get my scores, but if I want to get the 'feature importances' for instances, or examine what a tf-idf vectorizer's features are, the pipeline is claimed to not be fit (eventhough it was just recently used for inference and I already called training on it).
To take an example, calling .fit() on the following snippet from Scikit-learn's documentation from here works for prediction, but it claims the same unfitted problem when I want to check the pipeline's tfidf.
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),
# Use ColumnTransformer to combine the features from subject and body
('union', ColumnTransformer(
[
# Pulling features from the post's subject line (first column)
('subject', TfidfVectorizer(min_df=50), 0),
# Pipeline for standard bag-of-words model for body (second column)
('body_bow', Pipeline([
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
]), 1),
# Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
]), 1),
],
# weight components in ColumnTransformer
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
}
)),
# Use a SVC classifier on the combined features
('svc', LinearSVC(dual=False)),
], verbose=True)
After fitting the pipeline on the data (as is done in the link), when I try to access the vectorizer using
pipeline.named_steps.union.transformers[1][1].named_steps['tfidf'].get_feature_names()
it claims 'Vocabulary not fitted or provided'.
So, is this a misunderstanding I have of pipelines? Are we not supposed to access the intermediate steps? Or a setting maybe needs to be setup?
you need to access the transformers via .transformers_
So pipeline.named_steps.union.transformers_[1][1].named_steps['tfidf'].get_feature_names()

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
So X_new are the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.
I am getting the following error:
All intermediate steps should be transformers and implement fit and transform,
ExtraTreesClassifier ...
Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
Without reading between the lines too much about what you're trying to do here, this would work for you:
First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading:
Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation
You may also get the error in the title if you were oversampling or undersampling your data using imblearn module and fitting it into a model in a pipeline. If you got this message, then it means you have imported sklearn.pipeline.Pipeline. Import imblearn.pipeline.Pipeline instead and you're golden. For example,
from imblearn.pipeline import Pipeline
pipe = Pipeline([('o', SMOTE()), ('svc', SVC())])
The problem is, if you're sampling your data, the intermediate steps obviously need to sample the data as well, which is not supported by sklearn's Pipeline but is supported by imblearn's Pipeline.
This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.
So change the order of the pipeline and add another transformer for the first step in the pipeline

How to Get feature_importance when using sklearn2pmml

Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem:
1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:
GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_
there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.
and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:
GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))])
PMMLPipeline.fit(train[list(x_features),Train['Target']])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file.
Thank you very much!
Important points:
Instantiate the classifier outside of pipeline
Instantiate the (PMML-) pipeline, insert this classifier into it.
Fit this pipeline as a whole.
Print the feature importances of this classifier, and export this pipeline into a PMML document.
In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).
A complete example based on the Iris dataset:
import pandas
iris_df = pandas.read_csv("Iris.csv")
from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)
If you have come here like me to include the importances inside the pipeline from Python to pmml, then I have a good news.
I have tried searching for it on the internet and came to know that: We would have to make the importance field manually in the RF model in python so then it would be able to store them inside the PMML.
TL;DR Here is the code:
# Keep the model object outside which is the trick
RFModel = RandomForestRegressor()
# Make the pipeline as usual
column_trans = ColumnTransformer([
('onehot', OneHotEncoder(drop='first'), ["sex", "smoker", "region"]),
('Stdscaler', StandardScaler(), ["age", "bmi"]),
('MinMxscaler', MinMaxScaler(), ["children"])
])
pipeline = PMMLPipeline([
('col_transformer', column_trans),
('model', RFModel)
])
# Fit the pipeline
pipeline.fit(X, y)
# Store the importances in the temproary variable
importances = RFModel.feature_importances_
# Assign them in the MODEL ITSELF (The main part)
RFModel.pmml_feature_importances_ = importances
# Finally save the model as usual
sklearn2pmml(pipeline, r"path\file.pmml")
Now, you will see the importances in the PMML file!!
Reference from: Openscoring
Another way to do this is by referring to the model in the pmml pipeline, very similar to Aayush Shah answer but we are actually using the PMMLPipeline to see the importances. See bellow:
model = DecisionTreeClassifier()
pmml_pipeline = PMMLPipeline([
("preprocessing",preprocessing_step),
('decisiontree',model)
])
# access to your model using pmml_pipeline[1] , then call feature importances
pmml_pipeline[1].feature_importances_

What is the difference between pipeline and make_pipeline in scikit?

I got this from the sklearn webpage:
Pipeline: Pipeline of transforms with a final estimator
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.
But I still do not understand when I have to use each one. Can anyone give me an example?
The only difference is that make_pipeline generates names for steps automatically.
Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
compare it with make_pipeline:
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
So, with Pipeline:
names are explicit, you don't have to figure them out if you need them;
name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.
make_pipeline:
shorter and arguably more readable notation;
names are auto-generated using a straightforward rule (lowercase name of an estimator).
When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

Scaling data in RFECV with scikit-learn

It is common to scale the training and testing data separately before training and predicting progress of a classification task.
I want to embed the aforementioned process in RFECV which runs CV tests thus I tried the following:
Do
X_scaled = preprocessing.scale(X) in the first place, where X is the whole data set. By doing so, training and testing data are not scaled separately, which is not considered.
The other way I tried is to pass:
scaling_svm = Pipeline([('scaler', preprocessing.StandardScaler()),
('svm',LinearSVC(penalty=penalty, dual=False, class_weight='auto'))])
as parameter to the argument in RFECV :
rfecv = RFECV(estimator=scaling_svm, step=1, cv=StratifiedKFold(y, 7),
scoring=score, verbose=0)
However, I got an error since RFECV needs the estimator to have attribute .coef_.
What should I suppose to do? Any help would be appreciated.
A bit late to the party, admittedly, but if anyone is interested you can create a customised pipeline as follows:
from sklearn.pipeline import Pipeline
class RfePipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
And then replace Pipeline with RfePipeline in your code.
See similar question here.

Categories

Resources