Converting pmml file for random forest in python

Converting pmml file for random forest in python - python

I need to convert my random forest model into pmml format in python. I've imported sklearn2pmml from github and tried create a pmml file. I run the code below;
import pandas
import sklearn_pandas
iris = iris.csv
iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "sepal_width", "petal_length", "petal_width"]), pandas.DataFrame(iris.target, columns = ["species"])), axis = 1)
iris_mapper = sklearn_pandas.DataFrameMapper([('sepal_length',None),
('sepal_width', None),
('petal_width', None),
('petal_width', None),
('species',None)])
iris = iris_mapper.fit_transform(iris_df)
from sklearn.ensemble import RandomForestClassifier
iris_X = iris[:, 0:4]
iris_y = iris[:, 4]
iris_classifier = RandomForestClassifier(n_estimators=10)
iris_classifier.fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_classifier, iris_mapper, "randomforest.pmml")
However, I get an error;
TypeError: The pipeline object is not an instance of PMMLPipeline
Any suggestion what I am missing or another way to creat pmml format?

TypeError: The pipeline object is not an instance of PMMLPipeline
The first argument of the sklearn2pmml function call must be an instance of sklearn2pmml.PMMLPipeline. You're passing an instance of sklearn.ensemble.RandomForestClassifier instead.
Any suggestion what I am missing or another way to creat pmml format?
You're pairing a pre-historic code example with the latest version of the sklearn2pmml library. These are your options:
Upgrade code example to latest sklearn2pmml library version. Please take two minutes to read through the "Usage" section of its README.file.
Downgrade the sklearn2pmml library to 0.13.0 (or older) version.

sklearn2pmml() need a PMMLPipeline model, so try to pack iris_classifier with PMMLPipeline like this:
import pandas
import sklearn_pandas
from sklearn.datasets import load_iris
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.ensemble import RandomForestClassifier
d = load_iris()
iris_X = d.data
iris_y = d.target
iris_classifier = RandomForestClassifier(n_estimators=10)
#rfc_model = iris_classifier.fit(iris_X, iris_y)
pipeline_model = PMMLPipeline([('iris_classifier',
iris_classifier)]).fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline_model, 'rfc.pmml', with_repr = True)

Related

Data frames, importing csv files in jupyter notebook

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals import joblib
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
y = music_data['genre']
model = DecisionTreeClassifier()
model.fit(X, y)
joblib.dump(model, 'music-recommender.joblib' )
This is the output:
Traceback (most recent call last)
<ipython-input-28-802088a85507> in <module>
1 import pandas as pd
2 from sklearn.tree import DecisionTreeClassifier
----> 3 from sklearn.externals import joblib
4
5 music_data = pd.read_csv('music.csv')
ImportError: cannot import name 'joblib' from 'sklearn.externals' (C:\Users\christian\anaconda3\lib\site-packages\sklearn\externals\__init__.py)*****
I would appreciate if someone could help out, is there another syntax that I should use to import joblib from sklearn?

I had the same problem, I never knew that joblib was already install on my machine. just replace this line of code
from sklearn.externals import joblib
with
import joblib

The issue is here that you are trying to import joblib from scikit learn not reading csv file if you're wondering. Install joblib using pip install joblib and import it like this
import joblib
joblib.dump(model, 'music-recommender.joblib')
Remember to remove the import line from sklearn.externals. You can find more details here from the official documentation of scikit-learn.

ModuleNotFoundError: No module named 'sklearn.tree.tree'

I'm trying to learn how to create a machine learning API with Flask, however, following this tutorial, the following error appears when I type the command python app.py:
Traceback (most recent call last):
File "C:\Users\Breno\Desktop\flask-api\app.py", line 24, in <module>
model = p.load(open(modelfile, 'rb'))
ModuleNotFoundError: No module named 'sklearn.tree.tree'
My code:
from flask import Flask, request, redirect, url_for, flash, jsonify
import numpy as np
import pickle as p
import pandas as pd
import json
#from sklearn.tree import DecisionTreeClassifier
app = Flask(__name__)
#app.route('/api/', methods=['POST'])
def makecalc():
j_data = request.get_json()
prediction = np.array2string(model.predict(j_data))
return jsonify(prediction)
if __name__ == '__main__':
modelfile = 'models/final_prediction.pickle'
model = p.load(open(modelfile, 'rb'))
app.run(debug=True,host='0.0.0.0')
Could someone help me please?

Pickles are not necessarily compatible across scikit-learn versions so this behavior is expected (and the use case is not supported). For more details, see https://scikit-learn.org/dev/modules/model_persistence.html#model-persistence. Replace pickle by joblib.
by example :
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> X, y= datasets.load_iris(return_X_y=True)
>>> clf.fit(X, y)
SVC()
>>> from joblib import dump, load
>>> dump(clf, open('filename.joblib','wb'))
>>> clf2 = load(open('filename.joblib','rb'))
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

For anyone coming across this issue (perhaps dealing with code written long ago), sklearn.tree.tree is now under sklearn.tree (as from v0.24). This can be see from the import error warning:
from sklearn.tree.tree import BaseDecisionTree
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.tree module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
warnings.warn(message, FutureWarning)
Instead, use:
from sklearn.tree import BaseDecisionTree

The problem is with the version of sklearn. Module sklearn.tree.tree is removed since version 0.24. Most probably, your model has been generated with the older version. Try installing an older version of sklearn:
pip uninstall scikit-learn
pip install scikit-learn==0.20.4

Deploy a RAPIDS CUML Random Forest model to Windows Virtual Machine where RAPIDS/CUML can't be installed

I need to perform inference for a cuml.dask.ensemble.RandomForestClassifier on a GPU-less Windows virtual machine where rapids/cuml can't be installed.
I have thought to use treelite so I have to import the model into treelite and generate a shared library (.dll file for windows). After that, I would use treelite_runtime.Predictor to import the shared library and perform inference in the target machine.
The problem is that I have no idea of how to import the RandomForestClassifier model into treelite to create a treelite model.
I have tried to use the 'convert_to_treelite_model' but the obtained object isn't a treelite model and I don't know how to use it.
See the attached code (executed under Linux, so I try to use the gcc toolchain and generate a '.so' file...
I get the exception "'cuml.fil.fil.TreeliteModel' object has no attribute 'export_lib'" when I try to call the 'export_lib' function...
import numpy as np
import pandas as pd
import cudf
from sklearn import model_selection, datasets
from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
import treelite
import treelite_runtime
if __name__ == '__main__':
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
# Data parameters
train_size = 10000
test_size = 100
n_samples = train_size + test_size
n_features = 10
# Random Forest building parameters
max_depth = 6
n_bins = 16
n_trees = 100
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_clusters_per_class=1, n_informative=int(n_features / 3),
random_state=123, n_classes=5)
X = X.astype(np.float32)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
n_partitions = n_workers
# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)
X_test_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_test))
# Partition with Dask
# In this case, each worker will train on 1/n_partitions fraction of the data
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
x_test_dask = dask_cudf.from_cudf(X_test_cudf, npartitions=n_partitions)
# Persist to cache the data in active memory
X_train_dask, y_train_dask, x_test_dask= dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask, x_test_dask], workers=workers)
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
# HACK: comb_model is None if a prediction isn't performed before calling to 'get_combined_model'.
# I don't know why...
cuml_y_pred = cuml_model.predict(x_test_dask).compute()
cuml_y_pred = cuml_y_pred.to_array()
del cuml_y_pred
comb_model = cuml_model.get_combined_model()
treelite_model = comb_model.convert_to_treelite_model()
toolchain = 'gcc'
treelite_model.export_lib(toolchain=toolchain, libpath='./mymodel.so', verbose=True) # <----- EXCEPTION!
del cuml_model
del treelite_model
predictor = treelite_runtime.Predictor('./mymodel.so', verbose=True)
y_pred = predictor.predict(X_test)
# ......
Notes: I'm trying to run the code on an Ubuntu box with 2 NVIDIA RTX2080ti GPUs, using the following library versions:
cudatoolkit 10.1.243
cudnn 7.6.0
cudf 0.15.0
cuml 0.15.0
dask 2.30.0
dask-core 2.30.0
dask-cuda 0.15.0
dask-cudf 0.15.0
rapids 0.15.1
treelite 0.92
treelite-runtime 0.92

At the moment Treelite does not have a serialization method that can be directly used. We have an internal serialization method that we use to pickle cuML's RF model.
I would recommend creating a feature request in Treelite's github repo (https://github.com/dmlc/treelite) and requesting a feature for serializing and deserializing Treelite models.
Furthermore, the output of convert_to_treelite_model function is a Treelite model. It shows it as :
In [2]: treelite_model
Out[2]: <cuml.fil.fil.TreeliteModel at 0x7f11ceeca840>
As we expose the C++ Treelite code in cython to have direct access to Treelite's C++ handle.

Why am I facing the Unpickling Error while trying to load my model in Flask?

I have tried the pickling and unpickling in jupyter lab and it seems to work as its supposed to but when i run my app.py it gives me following error.
C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Python 3.7\fakenews\venv\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.linear_model.passive_aggressive module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.linear_model. Anything that cannot be imported from sklearn.linear_model is now part of the private API.
warnings.warn(message, FutureWarning)
Traceback (most recent call last):
File "app.py", line 9, in <module>
model = pickle.load(open('model.pkl', 'rb'))
_pickle.UnpicklingError: invalid load key, '\x17'.
Here are my files.
Model.py
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
#Read the data
df=pd.read_csv('C:\\Users\\Hp\\Desktop\\mini project\\news\\news.csv')
#Get shape and head
df.shape
df.head()
#DataFlair - Get the labels
labels=df.label
labels.head()
#DataFlair - Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)
#DataFlair - Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
#DataFlair - Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)
#DataFlair - Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
#DataFlair - Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
#DataFlair - Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
App.py
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle
import pandas as pd
app = Flask(__name__)
model = pickle.load(open('model.pkl', 'rb'))
session.clear()
#app.route('/')
def home():
return render_template('index.html')
#app.route('/predict',methods=['POST'])
def predict():
news = request.form["newsT"]
test1 = pd.Series(news, index=[11000])
prediction = model.predict(test1)
return render_template('index.html', prediction_text='Sales should be $ {}'.format(prediction))
if __name__ == "__main__":
app.run(debug=True)
I used this code to pickle---------------
# Save the model as a pickle in a file
joblib.dump(pac, 'model.pkl')
# Load the model from the file
pac_from_joblib = joblib.load('model.pkl')
# Use the loaded model to make predictions
pac_from_joblib.predict(tfidf_test)
This works fine in the lab but seems to give unpickling error while loading app.py.
I am fairly new to this field and am unable to figure out whats wrong even after extensive search online.

It seems like it is an encoding issue. It may because you are joblib to save a pickle model and try to load that same model vai pickle library. Try loading the model again by using joblib
model = joblib.load('model.pkl')
I hope it helps.

ValueError: Error hosting endpoint. The primary container for production variant AllTraffic did not pass the ping health check

I'm trying to deploy an SKlearn model on Amazon Sagemaker, and am working through the example provided in their documentation and am getting the above error when I deploy the model.
I'm following the instructions provided in this notebook, and so far have just copied and pasted the code that they have.
Right now, this is the exact code I have in my jupyter notebook:
# S3 prefix
prefix = 'Scikit-iris'
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
import numpy as np
import os
from sklearn import datasets
# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)
# Create directory and write csv
os.makedirs('./iris', exist_ok=True)
np.savetxt('./iris/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f,
%1.3f, %1.3f, %1.3f')
WORK_DIRECTORY = 'data'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )
from sagemaker.sklearn.estimator import SKLearn
script_path = 'scikit_learn_iris.py'
sklearn = SKLearn(
entry_point=script_path,
train_instance_type="ml.c4.xlarge",
role=role,
sagemaker_session=sagemaker_session,
framework_version='0.20.0',
hyperparameters={'max_leaf_nodes': 30})
sklearn.fit({'train': train_input})
sklearn.deploy(instance_type='ml.m4.xlarge',
initial_instance_count=1)
And at that point I get the error message.
The contents of 'scikit_learn_iris.py' look like this:
import argparse
import pandas as pd
import os
import numpy as np
from sklearn import tree
from sklearn.externals import joblib
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Hyperparameters are described here. In this simple example we are just including one hyperparameter.
parser.add_argument('--max_leaf_nodes', type=int, default=-1)
# SageMaker specific arguments. Defaults are set in the environment variables.
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
args = parser.parse_args()
# Take the set of files and read them all into a single pandas dataframe
input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
if len(input_files) == 0:
raise ValueError(('There are no files in {}.\n' +
'This usually indicates that the channel ({}) was incorrectly specified,\n' +
'the data specification in S3 was incorrectly specified or the role specified\n' +
'does not have permission to access the data.').format(args.train, "train"))
raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
train_data = pd.concat(raw_data)
# labels are in the first column
train_y = train_data.ix[:,0].astype(np.int)
train_X = train_data.ix[:,1:]
# We determine the number of leaf nodes using the hyper-parameter above.
max_leaf_nodes = args.max_leaf_nodes
# Now use scikit-learn's decision tree classifier to train the model.
clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
clf = clf.fit(train_X, train_y)
# Save the decision tree model.
joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))
My cloudwatch logs look like this:

Based on the error from the CloudWatch logs, the script is missing the model_fn definition as provided in the notebook. I've repeated the function here for convenience:
def model_fn(model_dir):
return joblib.load(os.path.join(model_dir, "model.joblib"))
Try appending this to the bottom of your script and re-running your notebook.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting pmml file for random forest in python - python

Related

Data frames, importing csv files in jupyter notebook

ModuleNotFoundError: No module named 'sklearn.tree.tree'

Deploy a RAPIDS CUML Random Forest model to Windows Virtual Machine where RAPIDS/CUML can't be installed

Why am I facing the Unpickling Error while trying to load my model in Flask?

ValueError: Error hosting endpoint. The primary container for production variant AllTraffic did not pass the ping health check

Categories

Resources