How to save sklearn model parameters in a json file? - python

Suppose I have very simple machine learning model as follows:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
X, y = load_diabetes(return_X_y=True)
estimators = [
('lr', RidgeCV()),
('svr', LinearSVR(random_state=42))
]
reg = StackingRegressor(
estimators=estimators,
final_estimator=RandomForestRegressor(n_estimators=10,
random_state=42)
)
steps = [
("preprocessing", StandardScaler()),
("regression", reg)
]
pipe = Pipeline(steps)
What I would like to do is to store the whole model parameter as json file. By that I mean the following information is saved in json file.
Pipeline(steps=[('preprocessing', StandardScaler()),
('regression',
StackingRegressor(estimators=[('lr',
RidgeCV(alphas=array([ 0.1, 1. , 10. ]))),
('svr',
LinearSVR(random_state=42))],
final_estimator=RandomForestRegressor(n_estimators=10,
random_state=42)))])
When I use json.dumps(pipe) I face with error that Object of type Pipeline is not JSON serializable. Any idea how one can do that?

The output you've given is just the string representation, so str(pipe) will produce that. It has a bit of formatting for spacing and newlines, so perhaps not ideal for storage.
You could use pipe.get_params() to retrieve a dictionary of parameters. That will include all the default parameters unlike the string; you could use the private method _changed_params from sklearn.utils._pprint (source), but being private you'd need to be careful about breaking changes. There might be a more public method along the lines of pretty-printing and display of estimators that gives you what you need.

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.

Extract log probabilities from MulinomialNB

I have a scikit-learn Pipeline made of a feature extractor, and a VotingClassifier, which contains MulinomialNB and some other models. When I train MulinomialNB separately I can extract the log probabilities using nb.feature_log_prob_, but inside a pipeline this attribute is missing.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
vclf = Pipeline([
('vect', CountVectorizer()),
('clf', VotingClassifier(
estimators=[
('nb', MultinomialNB()),
[...]
]
))
])
vclf.fit(train_X, train_y)
nb = vclf.named_steps['clf'].estimators[0][1]
nb.feature_log_prob_
AttributeError: 'MultinomialNB' object has no attribute 'feature_log_prob_'
According to the documentation, estimators_ is the correct attribute to access the list of fitted sub-estimators of the VotingClassifier. Your code should therefore look like this:
nb = vclf.named_steps['clf'].estimators_[0]
print(nb.feature_log_prob_)
The MulinomialNB you accessed with estimators was not fitted and, therefore, did not provide the feature_log_prob_ attribute. That is where the error came from.

How to use the imbalanced library with sklearn pipeline?

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?
It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())

How to Get feature_importance when using sklearn2pmml

Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem:
1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:
GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_
there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.
and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:
GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))])
PMMLPipeline.fit(train[list(x_features),Train['Target']])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file.
Thank you very much!
Important points:
Instantiate the classifier outside of pipeline
Instantiate the (PMML-) pipeline, insert this classifier into it.
Fit this pipeline as a whole.
Print the feature importances of this classifier, and export this pipeline into a PMML document.
In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).
A complete example based on the Iris dataset:
import pandas
iris_df = pandas.read_csv("Iris.csv")
from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)
If you have come here like me to include the importances inside the pipeline from Python to pmml, then I have a good news.
I have tried searching for it on the internet and came to know that: We would have to make the importance field manually in the RF model in python so then it would be able to store them inside the PMML.
TL;DR Here is the code:
# Keep the model object outside which is the trick
RFModel = RandomForestRegressor()
# Make the pipeline as usual
column_trans = ColumnTransformer([
('onehot', OneHotEncoder(drop='first'), ["sex", "smoker", "region"]),
('Stdscaler', StandardScaler(), ["age", "bmi"]),
('MinMxscaler', MinMaxScaler(), ["children"])
])
pipeline = PMMLPipeline([
('col_transformer', column_trans),
('model', RFModel)
])
# Fit the pipeline
pipeline.fit(X, y)
# Store the importances in the temproary variable
importances = RFModel.feature_importances_
# Assign them in the MODEL ITSELF (The main part)
RFModel.pmml_feature_importances_ = importances
# Finally save the model as usual
sklearn2pmml(pipeline, r"path\file.pmml")
Now, you will see the importances in the PMML file!!
Reference from: Openscoring
Another way to do this is by referring to the model in the pmml pipeline, very similar to Aayush Shah answer but we are actually using the PMMLPipeline to see the importances. See bellow:
model = DecisionTreeClassifier()
pmml_pipeline = PMMLPipeline([
("preprocessing",preprocessing_step),
('decisiontree',model)
])
# access to your model using pmml_pipeline[1] , then call feature importances
pmml_pipeline[1].feature_importances_

How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?

I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit argument to the sklearn pipeline?
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':
pipe.fit(X, y) # works fine
# however this does not work (even after correcting the underscores):
pipe.fit(X, y, classifier__eval_metric='auc') # fails
The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:
pipe.fit(X_train, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc'
pipe.fit(X, y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation
gsearch.fit(X, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_

Categories

Resources