Getting feature_importances_ after getting optimal TPOT pipeline? - python

I've read through a few pages but need someone to help explain how to make this work for.
I'm using TPOTRegressor() to get an optimal pipeline, but from there I would love to be able to plot the .feature_importances_ of the pipeline it returns:
best_model = TPOTRegressor(cv=folds, generations=2, population_size=10, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
I saw this kind of set up from a now closed issue on Github, but currently I get the error:
Best pipeline: LassoLarsCV(input_matrix, normalize=True)
Traceback (most recent call last):
File "main2.py", line 313, in <module>
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
AttributeError: 'LassoLarsCV' object has no attribute 'feature_importances_'
So, how would I get these feature importances from the optimal pipeline, regardless of which one it lands on? Or is this even possible? Or does someone have a better way of going about trying to plot feature importances from a TPOT run?
Thanks!
UPDATE
For clarification, what is meant by Feature Importance is the determination of how important each feature (X's) of your dataset is in determining the predicted (Y) label, using a barchart to plot each feature's level of importance in coming up with its predictions. TPOT doesn't do this directly (I don't think), so I was thinking I'd grab the pipeline it came up with, re-run it on the training data, and then somehow use a .feature_imprtances_ to then be able to graph the feature importances, as these are all sklearn regressor's I'm using?

Very nice question.
You just need to fit again the best model in order to get the feature importances.
best_model.fit(X_train, Y_train)
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
The last line returns the best model based on the CV.
You can then use:
exctracted_best_model.fit(X_train, Y_train)
to train it. If the best model has the desired attribure, then you will be able to access it after exctracted_best_model.fit(X_train, Y_train)
More details (in my comments) and a Toy example:
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# reduce training features for time sake
X_train = X_train[:100,:]
y_train = y_train[:100]
# Fit the TPOT pipeline
tpot = TPOTRegressor(cv=2, generations=5, population_size=50, verbosity=2)
# Fit the pipeline
tpot.fit(X_train, y_train)
# Get the best model
exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
print(exctracted_best_model)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square',
n_estimators=100, random_state=None)
# Train the `exctracted_best_model` using THE WHOLE DATASET.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X, y) # X,y IMPORTNANT
# Access it's features
exctracted_best_model.feature_importances_
# Plot them using barplot
# Here I fitted the model on X_train, y_train and not on the whole dataset for TIME SAKE
# So I got importances only for the features in `X_train`
# If you use `exctracted_best_model.fit(X, y)` we will have importances for all the features !!!
positions= range(exctracted_best_model.feature_importances_.shape[0])
plt.bar(positions, exctracted_best_model.feature_importances_)
plt.show()
IMPORTNANT NOTE: *In the above example, the best model based on the pipeline was AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square'). This model indeed has the attribute feature_importances_.
In the case where the best model does not have an attribute feature_importances_, the exact same code will not work. You will need to read the docs and see the attributes of each returned best model. E.g. if the best model was LassoCV then you would use the coef_ attribute.
Output:

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.

How to use KNeighborsClassifier in BaggingClassifier & How to solve "KNN doesn't support sample weights issue"

I am new to Sklearn, and I am trying to combine KNN, Decision Tree, SVM, and Gaussian NB for BaggingClassifier.
Part of my code looks like this:
best_KNN = KNeighborsClassifier(n_neighbors=5, p=1)
best_KNN.fit(X_train, y_train)
majority_voting = VotingClassifier(estimators=[('KNN', best_KNN), ('DT', best_DT), ('SVM', best_SVM), ('gaussian', gaussian_NB)], voting='hard')
majority_voting.fit(X_train, y_train)
bagging = BaggingClassifier(base_estimator=majority_voting)
bagging.fit(X_train, y_train)
But this causes an error saying:
TypeError: Underlying estimator KNeighborsClassifier does not support sample weights.
The "bagging" part worked fine if I remove KNN.
Does anyone have any idea to solve this issue? Thank you for your time.
In BaggingClassifier you can only use base estimators that support sample weights because it relies on score method, which takes in sample_weightparam.
You can list all the available classifiers like:
import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
if 'sample_weight' in inspect.getargspec(clf.fit)[0]:
print(name)

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
So X_new are the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.
I am getting the following error:
All intermediate steps should be transformers and implement fit and transform,
ExtraTreesClassifier ...
Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
Without reading between the lines too much about what you're trying to do here, this would work for you:
First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading:
Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation
You may also get the error in the title if you were oversampling or undersampling your data using imblearn module and fitting it into a model in a pipeline. If you got this message, then it means you have imported sklearn.pipeline.Pipeline. Import imblearn.pipeline.Pipeline instead and you're golden. For example,
from imblearn.pipeline import Pipeline
pipe = Pipeline([('o', SMOTE()), ('svc', SVC())])
The problem is, if you're sampling your data, the intermediate steps obviously need to sample the data as well, which is not supported by sklearn's Pipeline but is supported by imblearn's Pipeline.
This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.
So change the order of the pipeline and add another transformer for the first step in the pipeline

How to Get feature_importance when using sklearn2pmml

Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem:
1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:
GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_
there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.
and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:
GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))])
PMMLPipeline.fit(train[list(x_features),Train['Target']])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file.
Thank you very much!
Important points:
Instantiate the classifier outside of pipeline
Instantiate the (PMML-) pipeline, insert this classifier into it.
Fit this pipeline as a whole.
Print the feature importances of this classifier, and export this pipeline into a PMML document.
In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).
A complete example based on the Iris dataset:
import pandas
iris_df = pandas.read_csv("Iris.csv")
from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)
If you have come here like me to include the importances inside the pipeline from Python to pmml, then I have a good news.
I have tried searching for it on the internet and came to know that: We would have to make the importance field manually in the RF model in python so then it would be able to store them inside the PMML.
TL;DR Here is the code:
# Keep the model object outside which is the trick
RFModel = RandomForestRegressor()
# Make the pipeline as usual
column_trans = ColumnTransformer([
('onehot', OneHotEncoder(drop='first'), ["sex", "smoker", "region"]),
('Stdscaler', StandardScaler(), ["age", "bmi"]),
('MinMxscaler', MinMaxScaler(), ["children"])
])
pipeline = PMMLPipeline([
('col_transformer', column_trans),
('model', RFModel)
])
# Fit the pipeline
pipeline.fit(X, y)
# Store the importances in the temproary variable
importances = RFModel.feature_importances_
# Assign them in the MODEL ITSELF (The main part)
RFModel.pmml_feature_importances_ = importances
# Finally save the model as usual
sklearn2pmml(pipeline, r"path\file.pmml")
Now, you will see the importances in the PMML file!!
Reference from: Openscoring
Another way to do this is by referring to the model in the pmml pipeline, very similar to Aayush Shah answer but we are actually using the PMMLPipeline to see the importances. See bellow:
model = DecisionTreeClassifier()
pmml_pipeline = PMMLPipeline([
("preprocessing",preprocessing_step),
('decisiontree',model)
])
# access to your model using pmml_pipeline[1] , then call feature importances
pmml_pipeline[1].feature_importances_

Scaling data in RFECV with scikit-learn

It is common to scale the training and testing data separately before training and predicting progress of a classification task.
I want to embed the aforementioned process in RFECV which runs CV tests thus I tried the following:
Do
X_scaled = preprocessing.scale(X) in the first place, where X is the whole data set. By doing so, training and testing data are not scaled separately, which is not considered.
The other way I tried is to pass:
scaling_svm = Pipeline([('scaler', preprocessing.StandardScaler()),
('svm',LinearSVC(penalty=penalty, dual=False, class_weight='auto'))])
as parameter to the argument in RFECV :
rfecv = RFECV(estimator=scaling_svm, step=1, cv=StratifiedKFold(y, 7),
scoring=score, verbose=0)
However, I got an error since RFECV needs the estimator to have attribute .coef_.
What should I suppose to do? Any help would be appreciated.
A bit late to the party, admittedly, but if anyone is interested you can create a customised pipeline as follows:
from sklearn.pipeline import Pipeline
class RfePipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
And then replace Pipeline with RfePipeline in your code.
See similar question here.

Categories

Resources