I'm trying to save all models generated by autosklearn, but I can only get the best model.
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
# Load data
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder='/tmp/autosklearn_classification_example_tmp',
output_folder='/tmp/autosklearn_classification_example_out',
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')
# Show all models
print(automl.show_models())
# Here it uses the best model
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Is show_models() printing models being used in the best ensamble model?
Is there any way to get other models?
Autosklearn pipelines may be checked in some ways, one of which is by using PipelineProfiler; this allows you to examine all of the best-performing pipelines.
To begin, install the pipeline profile on your machine:
pip install pipelineprofiler
Then write the following code:
import PipelineProfiler
# automl is an object Which has already been created.
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)
Your output should be a plot that shows all the pipelines.
Related
Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
])
X_new = fs_pipeline.fit_transform(X_train, y_train)
I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.
Now, assume that I want to add a ML model to the pipeline like below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
If I use fit_transform method in the above code (model.fit_transform(X_train, y_train)), I get the error:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'
So. I should use model.fit(X_train, y_train). But, how can I be sure that pipeline applied fit_transform method for feature selection steps?
A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.
Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier).
Whenever the last step is made of an estimator rather than a transformer, as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform().
Summing up,
case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:
final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
which in your case becomes
gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform()); model.fit_transform(X_train, y_train) means the following:
final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351
I've read through a few pages but need someone to help explain how to make this work for.
I'm using TPOTRegressor() to get an optimal pipeline, but from there I would love to be able to plot the .feature_importances_ of the pipeline it returns:
best_model = TPOTRegressor(cv=folds, generations=2, population_size=10, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
I saw this kind of set up from a now closed issue on Github, but currently I get the error:
Best pipeline: LassoLarsCV(input_matrix, normalize=True)
Traceback (most recent call last):
File "main2.py", line 313, in <module>
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
AttributeError: 'LassoLarsCV' object has no attribute 'feature_importances_'
So, how would I get these feature importances from the optimal pipeline, regardless of which one it lands on? Or is this even possible? Or does someone have a better way of going about trying to plot feature importances from a TPOT run?
Thanks!
UPDATE
For clarification, what is meant by Feature Importance is the determination of how important each feature (X's) of your dataset is in determining the predicted (Y) label, using a barchart to plot each feature's level of importance in coming up with its predictions. TPOT doesn't do this directly (I don't think), so I was thinking I'd grab the pipeline it came up with, re-run it on the training data, and then somehow use a .feature_imprtances_ to then be able to graph the feature importances, as these are all sklearn regressor's I'm using?
Very nice question.
You just need to fit again the best model in order to get the feature importances.
best_model.fit(X_train, Y_train)
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
The last line returns the best model based on the CV.
You can then use:
exctracted_best_model.fit(X_train, Y_train)
to train it. If the best model has the desired attribure, then you will be able to access it after exctracted_best_model.fit(X_train, Y_train)
More details (in my comments) and a Toy example:
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# reduce training features for time sake
X_train = X_train[:100,:]
y_train = y_train[:100]
# Fit the TPOT pipeline
tpot = TPOTRegressor(cv=2, generations=5, population_size=50, verbosity=2)
# Fit the pipeline
tpot.fit(X_train, y_train)
# Get the best model
exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
print(exctracted_best_model)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square',
n_estimators=100, random_state=None)
# Train the `exctracted_best_model` using THE WHOLE DATASET.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X, y) # X,y IMPORTNANT
# Access it's features
exctracted_best_model.feature_importances_
# Plot them using barplot
# Here I fitted the model on X_train, y_train and not on the whole dataset for TIME SAKE
# So I got importances only for the features in `X_train`
# If you use `exctracted_best_model.fit(X, y)` we will have importances for all the features !!!
positions= range(exctracted_best_model.feature_importances_.shape[0])
plt.bar(positions, exctracted_best_model.feature_importances_)
plt.show()
IMPORTNANT NOTE: *In the above example, the best model based on the pipeline was AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square'). This model indeed has the attribute feature_importances_.
In the case where the best model does not have an attribute feature_importances_, the exact same code will not work. You will need to read the docs and see the attributes of each returned best model. E.g. if the best model was LassoCV then you would use the coef_ attribute.
Output:
I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
So X_new are the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.
I am getting the following error:
All intermediate steps should be transformers and implement fit and transform,
ExtraTreesClassifier ...
Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
Without reading between the lines too much about what you're trying to do here, this would work for you:
First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading:
Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation
You may also get the error in the title if you were oversampling or undersampling your data using imblearn module and fitting it into a model in a pipeline. If you got this message, then it means you have imported sklearn.pipeline.Pipeline. Import imblearn.pipeline.Pipeline instead and you're golden. For example,
from imblearn.pipeline import Pipeline
pipe = Pipeline([('o', SMOTE()), ('svc', SVC())])
The problem is, if you're sampling your data, the intermediate steps obviously need to sample the data as well, which is not supported by sklearn's Pipeline but is supported by imblearn's Pipeline.
This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.
So change the order of the pipeline and add another transformer for the first step in the pipeline
I'm working on an online course and used the provided code to create two different predictive models. I played around with them a bit to see if I can get different results, which went fine.
My concern is why they are giving me identical accuracy and cross validation scores.
Logit Regression:
Decision Tree:
I've provided a snippet of code below, as well as the function for classification that I took from python 2.x code. Currently, I am working with python 3.x and 0.17.1 of scikit-learn.
# import models from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold # for K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
# generic fxn for making classification model and processing accessing performance
def classification_model(model, data, predictors, outcome):
model.fit(data[predictors], data[outcome]) # fit model
predictions = model.predict(data[predictors]) # make predictions on training set
# print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy: %s" % "{0:.3%}".format(accuracy))
# k-fold cross validatiom, 5 folds
kf = KFold(data.shape[0], n_folds = 5)
error = []
for train, test in kf:
train_predictors = (data[predictors].iloc[train,:]) # filter training data
train_target = data[outcome].iloc[train]
model.fit(train_predictors, train_target)
# record error from each run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
print("Cross-Validation Score: %s" % "{0:.3%}".format(np.mean(error)))
# fit model (again) so it can be referred to outside fxn
model.fit(data[predictors], data[outcome])
Am I using outdated parameter names? Or is there something wrong with the classifier function?
I currently am not getting any errors from the code that I'm running but getting the exact same accuracy and cross validation scores is concerning.
I'm trying to write a unit test for some of my code that uses scikit-learn. However, my unit tests seem to be non-deterministic.
AFAIK, the only places in my code where scikit-learn uses any randomness are in its LogisticRegression model and its train_test_split, so I have the following:
RANDOM_SEED = 5
self.lr = LogisticRegression(random_state=RANDOM_SEED)
X_train, X_test, y_train, test_labels = train_test_split(docs, labels, test_size=TEST_SET_PROPORTION, random_state=RANDOM_SEED)
But this doesn't seem to work -- even when I pass a fixed docs and a fixed labels, the prediction probabilities on a fixed validation set vary from run to run.
I also tried adding a numpy.random.seed(RANDOM_SEED) call at the top of my code, but that didn't seem to work either.
Is there anything I'm missing? Is there a way to pass a seed to scikit-learn in a single place, so that seed is used throughout all of scikit-learn's invocations?
from sklearn import datasets, linear_model
iris = datasets.load_iris()
(X, y) = iris.data, iris.target
RANDOM_SEED = 5
lr = linear_model.LogisticRegression(random_state=RANDOM_SEED)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=RANDOM_SEED)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)
produced 0.93333333333333335 several times now. The way you did it seems ok. Another way is to set np.random.seed() or use Sacred for documented randomness. Using random_state is what the docs describe:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.