transfer a sklearn random forest model to a new server - python

I built one model with sklearn RandomForestClassifier in an old server and now I need to migrate it to another server. How can I transfer the model to the new server? Which Python package should I use? Pickle? joblib? Thanks!

Use "joblib".
Suppose your model is in a variable "my_model".
Then the 'joblib' code would go like this:
# On your development machine
from joblib import dump
dump(my_model, 'model.joblib')
# On your new machine, following code would go to load the model
from joblib import load
my_model = load('model.joblib')
Note: Replace "model.joblib" with path to the model.joblib file.

pickle is the way to go
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import pickle
# Fit the model on training set
model = LogisticRegression()
model.fit(X_train, Y_train) # fit on some data ...
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test) # predict some test data
print(result)

Related

Cannot load logistic regression models in python2

I am training and storing logistic regression models in python3 using sklearn. For storing, I use the pickle module as show below
filename = 'models/logistic_regression/protocol_2/test/{}_pixels.p'.format(i//2)
pickle.dump(clf, open(filename, 'wb'), protocol=2)
Then, in another script, I am loading the models in python2. The loading is achieved with the following code
f = open(model_names[i//2-1], 'rb')
clf = pickle.load(f)
However, I get the error
ImportError: No module named _logistic
Could someone tell me why I cannot load the model? Thanks in advance

use a saved trained model to predict on new dataset

I am using theano, sklearn and numpy in Python. I found this code for saving my trained network and predict on my new dataset in this link https://github.com/lzhbrian/RBM-DBN-theano-DL4J/blob/master/src/theano/code/logistic_sgd.py. the part of the code I am using is this :
"""
An example of how to load a trained model and use it
to predict labels.
"""
def predict():
# load the saved model
classifier = pickle.load(open('best_model.pkl'))
# compile a predictor function
predict_model = theano.function(
inputs=[classifier.input],
outputs=classifier.y_pred)
# We can test it on some examples from test test
dataset='mnist.pkl.gz'
datasets = load_data(dataset)
test_set_x, test_set_y = datasets[2]
test_set_x = test_set_x.get_value()
predicted_values = predict_model(test_set_x[:10])
print("Predicted values for the first 10 examples in test set:")
print(predicted_values)
if __name__ == '__main__':
sgd_optimization_mnist()
The code for the neural network model I want to save and load and predict with is https://github.com/aseveryn/deep-qa. I could save and load the model with cPickle but I continuously get errors in # compile a predictor function part:
predict_model = theano.function(inputs=[classifier.input],outputs=classifier.y_pred)
Actually I am not certain what I need to put in the inputs according to my code. Which one is right?
inputs=[main.predict_prob_batch.batch_iterator], outputs=test_nnet.layers[-1].
y_pred)
inputs=[predict_prob_batch.batch_iterator],
outputs=test_nnet.layers[-1].y_pred)
inputs=[MiniBatchIteratorConstantBatchSize.dataset],
outputs=test_nnet.layers[-1].y_pred)
inputs=[
sgd_trainer.MiniBatchIteratorConstantBatchSize.dataset],
outputs=test_nnet.layers[-1].y_pred)
or none of them???
Each of them I tried I got the errors:
ImportError: No module named MiniBatchIteratorConstantBatchSize
or
NameError: global name 'predict_prob_batch' is not defined
I would really appreciate if you could help me.
I also used these commands for running the code but still the errors.
python -c 'from run_nnet import predict; from sgd_trainer import MiniBatchIteratorConstantBatchSize; from MiniBatchIteratorConstantBatchSize import dataset; print predict()'
python -c 'from run_nnet import predict; from sgd_trainer import *; from MiniBatchIteratorConstantBatchSize import dataset; print predict()'
Thank you and let me know please if you know a better way to predict for new dataset on the loaded trained model.

How to Get feature_importance when using sklearn2pmml

Now i trained a gbdt model named 'GB' in python sklearn. And i want to export this trained model into pmml files. But i meet this problem:
1. if i try to put the trained 'GB' model into PMMLpipeline and use sklearn2pmml to export the model. like below:
GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_
there is a warning 'The 'active_fields' attribute is not set'. and i will lose all the features' names in the exported pmml file.
and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:
GB_pipeline = PMMLPipeline([("classifier",GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))])
PMMLPipeline.fit(train[list(x_features),Train['Target']])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
what shall i do that i can both observe the features_importance of the model and also keep the features' names in the exported pmml file.
Thank you very much!
Important points:
Instantiate the classifier outside of pipeline
Instantiate the (PMML-) pipeline, insert this classifier into it.
Fit this pipeline as a whole.
Print the feature importances of this classifier, and export this pipeline into a PMML document.
In your first code example, you're fitting the classifier, but you should be fitting the pipeline as a whole - hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don't have a direct reference to the classifier (however, you could obtain it by "parsing" the last step of the fitted pipeline).
A complete example based on the Iris dataset:
import pandas
iris_df = pandas.read_csv("Iris.csv")
from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)
If you have come here like me to include the importances inside the pipeline from Python to pmml, then I have a good news.
I have tried searching for it on the internet and came to know that: We would have to make the importance field manually in the RF model in python so then it would be able to store them inside the PMML.
TL;DR Here is the code:
# Keep the model object outside which is the trick
RFModel = RandomForestRegressor()
# Make the pipeline as usual
column_trans = ColumnTransformer([
('onehot', OneHotEncoder(drop='first'), ["sex", "smoker", "region"]),
('Stdscaler', StandardScaler(), ["age", "bmi"]),
('MinMxscaler', MinMaxScaler(), ["children"])
])
pipeline = PMMLPipeline([
('col_transformer', column_trans),
('model', RFModel)
])
# Fit the pipeline
pipeline.fit(X, y)
# Store the importances in the temproary variable
importances = RFModel.feature_importances_
# Assign them in the MODEL ITSELF (The main part)
RFModel.pmml_feature_importances_ = importances
# Finally save the model as usual
sklearn2pmml(pipeline, r"path\file.pmml")
Now, you will see the importances in the PMML file!!
Reference from: Openscoring
Another way to do this is by referring to the model in the pmml pipeline, very similar to Aayush Shah answer but we are actually using the PMMLPipeline to see the importances. See bellow:
model = DecisionTreeClassifier()
pmml_pipeline = PMMLPipeline([
("preprocessing",preprocessing_step),
('decisiontree',model)
])
# access to your model using pmml_pipeline[1] , then call feature importances
pmml_pipeline[1].feature_importances_

How to save final model using keras?

I use KerasClassifier to train the classifier.
The code is below:
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = read_csv("iris.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#print("encoded_Y")
#print(encoded_Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
#print("dummy_y")
#print(dummy_y)
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(4, input_dim=4, init='normal', activation='relu'))
#model.add(Dense(4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=200, batch_size=5, verbose=0)
#global_model = baseline_model()
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
But How to save the final model for future prediction?
I usually use below code to save model:
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")
But I don't know how to insert the saving model's code into KerasClassifier's code.
Thank you.
The model has a save method, which saves all the details necessary to reconstitute the model. An example from the keras documentation:
from keras.models import load_model
model.save('my_model.h5') # creates a HDF5 file 'my_model.h5'
del model # deletes the existing model
# returns a compiled model
# identical to the previous one
model = load_model('my_model.h5')
you can save the model in json and weights in a hdf5 file format.
# keras library import for Saving and loading model and weights
from keras.models import model_from_json
from keras.models import load_model
# serialize model to JSON
# the keras model which is trained is defined as 'model' in this example
model_json = model.to_json()
with open("model_num.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model_num.h5")
files "model_num.h5" and "model_num.json" are created which contain our model and weights
To use the same trained model for further testing you can simply load the hdf5 file and use it for the prediction of different data.
here's how to load the model from saved files.
# load json and create model
json_file = open('model_num.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model_num.h5")
print("Loaded model from disk")
loaded_model.save('model_num.hdf5')
loaded_model=load_model('model_num.hdf5')
To predict for different data you can use this
loaded_model.predict_classes("your_test_data here")
You can use model.save(filepath) to save a Keras model into a single HDF5 file which will contain:
the architecture of the model, allowing to re-create the model.
the weights of the model.
the training configuration (loss, optimizer)
the state of the optimizer, allowing to resume training exactly where you left off.
In your Python code probable the last line should be:
model.save("m.hdf5")
This allows you to save the entirety of the state of a model in a single file.
Saved models can be reinstantiated via keras.models.load_model().
The model returned by load_model() is a compiled model ready to be used (unless the saved model was never compiled in the first place).
model.save() arguments:
filepath: String, path to the file to save the weights to.
overwrite: Whether to silently overwrite any existing file at the target location, or provide the user with a manual prompt.
include_optimizer: If True, save optimizer's state together.
you can save the model and load in this way.
from keras.models import Sequential, load_model
from keras_contrib.losses import import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
# To save model
model.save('my_model_01.hdf5')
# To load the model
custom_objects={'CRF': CRF,'crf_loss':crf_loss,'crf_viterbi_accuracy':crf_viterbi_accuracy}
# To load a persisted model that uses the CRF layer
model1 = load_model("/home/abc/my_model_01.hdf5", custom_objects = custom_objects)
Generally, we save the model and weights in the same file by calling the save() function.
For saving,
model.compile(optimizer='adam',
loss = 'categorical_crossentropy',
metrics = ["accuracy"])
model.fit(X_train, Y_train,
batch_size = 32,
epochs= 10,
verbose = 2,
validation_data=(X_test, Y_test))
#here I have use filename as "my_model", you can choose whatever you want to.
model.save("my_model.h5") #using h5 extension
print("model saved!!!")
For Loading the model,
from keras.models import load_model
model = load_model('my_model.h5')
model.summary()
In this case, we can simply save and load the model without re-compiling our model again.
Note - This is the preferred way for saving and loading your Keras model.
Saving a Keras model:
model = ... # Get model (Sequential, Functional Model, or Model subclass)
model.save('path/to/location')
Loading the model back:
from tensorflow import keras
model = keras.models.load_model('path/to/location')
For more information, read Documentation
You can save the best model using keras.callbacks.ModelCheckpoint()
Example:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_checkpoint_callback = keras.callbacks.ModelCheckpoint("best_Model.h5",save_best_only=True)
history = model.fit(x_train,y_train,
epochs=10,
validation_data=(x_valid,y_valid),
callbacks=[model_checkpoint_callback])
This will save the best model in your working directory.
Since the syntax of keras, how to save a model, changed over the years I will post a fresh answer. In principle the earliest answer of bogatron, posted Mar 13 '17 at 12:10 is still good, if you want to save your model including the weights into one file.
model.save("my_model.h5")
This will save the model in the older Keras H5 format.
However, there is a new format, the TensorFlow SavedModel format, which will be used if you do not specify the extension .h5, .hdf5 or .keras after the filename.
The syntax in this case is
model.save("path/to/folder")
If the given folder name does not yet exist, it will be created. Two files and two folders will be created within this folder:
keras_metadata.pb, saved_model.pb, assets, variables
So far you can still decide whether you want to store your model into one single file or into a folder containing files and folders. (See keras documentation at www.tensorflow.org.)

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data?
I have the following sample program from the scikit-learn website:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()
Classifiers are just objects that can be pickled and dumped like any other. To continue your example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)
Edit: if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle (nor by joblib), then using Neuraxle's custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis. The savers are called for each step if defined upon saving, and otherwise joblib is used as default for steps without a saver.
You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.
Joblib is included in scikit-learn:
>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier
>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target) # evaluate training error
0.9526989426822482
>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)
>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482
Edit: in Python 3.8+ it's now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).
What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.
So you have initialized your classifier and trained it for a long time with
clf = some.classifier()
clf.fit(X, y)
After this you have two options:
1) Using Pickle
import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
pickle.dump(clf, f)
# and later you can load it
with open('filename.pkl', 'rb') as f:
clf = pickle.load(f)
2) Using Joblib
from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl')
# and later you can load it
clf = joblib.load('filename.pkl')
One more time it is helpful to read the above-mentioned links
In many cases, particularly with text classification it is not enough just to store the classifier but you'll need to store the vectorizer as well so that you can vectorize your input in future.
import pickle
with open('model.pkl', 'wb') as fout:
pickle.dump((vectorizer, clf), fout)
future use case:
with open('model.pkl', 'rb') as fin:
vectorizer, clf = pickle.load(fin)
X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)
Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:
vectorizer.stop_words_ = None
to make dumping more efficient.
Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:
clf.sparsify()
Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:
clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)
and then you can store it more efficiently.
sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:
def __getstate__(self):
try:
state = super(BaseEstimator, self).__getstate__()
except AttributeError:
state = self.__dict__.copy()
if type(self).__module__.startswith('sklearn.'):
return dict(state.items(), _sklearn_version=__version__)
else:
return state
The recommended method to save your model to disc is to use the pickle module:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
pickle.dump(model,f)
However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).
From the documentation:
In order to rebuild a similar model with future versions of
scikit-learn, additional metadata should be saved along the pickled
model:
The training data, e.g. a reference to a immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.
If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:
In the specific case of the scikit, it may be more interesting to use
joblib’s replacement of pickle (joblib.dump & joblib.load), which is
more efficient on objects that carry large numpy arrays internally as
is often the case for fitted scikit-learn estimators, but can only
pickle to the disk and not to a string:
sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:
/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15:
FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will
be removed in 0.23. Please import this functionality directly from
joblib, which can be installed with: pip install joblib. If this
warning is raised when loading pickled models, you may need to
re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
Therefore, you need to install joblib:
pip install joblib
and finally write the model to disk:
import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)
with open('myClassifier.joblib.pkl', 'wb') as f:
joblib.dump(clf, f, compress=9)
Now in order to read the dumped file all you need to run is:
with open('myClassifier.joblib.pkl', 'rb') as f:
my_clf = joblib.load(f)

Categories

Resources