I have a question about coremltools.
I want to convert trained xgboost classifier model into coreML Model.
import coremltools
import xgboost as xgb
X, y = get_data()
xgb_model = xgb.XGBClassifier()
xib_model.train(X, y)
coreml_model = coremltools.converters.xgboost.convert(xgb_model)
coremltools.save('my_model.mlmodel')
Error is as follows:
>>> coremltools.converters.xgboost.convert(xgb_model)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/karas/.pyenv/versions/anaconda2-4.3.0/lib/python2.7/site-packages/coremltools/converters/xgboost/_tree.py", line 51, in convert
return _MLModel(_convert_tree_ensemble(model, feature_names, target, force_32bit_float = force_32bit_float))
File "/Users/karas/.pyenv/versions/anaconda2-4.3.0/lib/python2.7/site-packages/coremltools/converters/xgboost/_tree_ensemble.py", line 143, in convert_tree_ensemble
raise TypeError("Unexpected type. Expecting XGBoost model.")
TypeError: Unexpected type. Expecting XGBoost model.
Quick solution:
coreml_model = coremltools.converters.xgboost.convert(xgb_model._Booster)
More about this converter:
I just encountered this problem so I debug into _tree_ensemble.py and here is what I found:
The first parameter 'model' should be _xgboost.core.Booster or _xgboost.XGBRegressor or the path of .json file of the previous two data. Also, if you use the .json file, the second parameter feature_names must be provided.
Plus, according to the python examples on github, there is another way you can get a model:
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb
### simple example
# load file from text file, also binary buffer generated by xgboost
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
# specify validations set to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2
booster = xgb.train(param, dtrain, num_round, watchlist)
Note the booster here is _xgboost.core.Booster.
Then you can do
import coremltools
coreml_model = coremltools.converters.xgboost.convert(booster)
coreml_model.save('my_model.mlmodel')
Related
This document shows that a XGBoost API trained model can be sliced by following code:
from sklearn.datasets import make_classification
import xgboost as xgb
booster = xgb.train({
'num_parallel_tree': 4, 'subsample': 0.5, 'num_class': 3},
num_boost_round=num_boost_round, dtrain=dtrain)
sliced: xgb.Booster = booster[3:7]
I tried it and it worked.
Since XGBoost provides Scikit-Learn Wrapper interface, I tried something like this:
from xgboost import XGBClassifier
clf_xgb = XGBClassifier().fit(X_train, y_train)
clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
But got following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-84155815d877> in <module>
----> 1 clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
AttributeError: 'XGBClassifier' object has no attribute 'Booster'
Since XGBClassifier has no attribute 'Booster', is there any way to slice a Scikit-Learn Wrapper interface trained XGBClassifier(/XGBRegressor) model?
The problem is with the type hint you are giving clf_xgb.Booster which does not match an existing argument. Try:
clf_xgb_sliced: xgb.Booster = clf_xgb.get_booster()[3:7]
instead.
I have a Google Cloud Platform account with a Kubeflow Pipeline. The first component of the pipeline preprocesses some data and the second one trains a model (SKlearn Decision Tree Classifier) with that preprocessed data. For the purpose of showing a code sample, the sample below is a simple modification of the pipeline's second component:
import logging
import pandas as pd
import os
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)
print("Creating model")
model = DecisionTreeClassifier()
print(f"Training model ({type(model)})")
model.fit(x_train_data, y_train_data)
print("Evaluating model")
y_train_pred = model.predict(x_train_data)
print("y_train_pred: ", y_train_pred.shape)
y_test_pred = model.predict(x_test_data)
print("y_test_pred: ", y_test_pred.shape)
train_accuracy = metrics.accuracy_score(y_train_data, y_train_pred)
train_classification_report = metrics.classification_report(y_train_data, y_train_pred)
print("\nTraining result:")
print(f"Accuracy:\t{train_accuracy}")
print(f"Classification report:\t{type(train_classification_report)}\n{train_classification_report}")
test_accuracy = metrics.accuracy_score(y_test_data, y_test_pred)
test_classification_report = metrics.classification_report(y_test_data, y_test_pred)
print("\nTesting result:")
print(f"Accuracy:\t{test_accuracy}")
print(f"Classification report:\t{type(test_classification_report)}\n{test_classification_report}")
print("\nDONE !\n")
Here, instead of loading the preprocessed data, I'm using the IRIS Sklearn datset but the output is exactly the same. Everything seems to work as intended, every print statement appears on the Kubeflow platform output console as expected, however after the second component finishes executing (after the last print is correclty shown on the output console), an error appers:
Traceback (most recent call last):
File "<string>", line 181, in <module>
File "<string>", line 151, in _serialize_str
TypeError: Value "None" has type "<class 'NoneType'>" instead of str.
Do you have any idea why this is happening ?
Am I doing something wrong or is it some Google Cloud / Kubeflow Pipeline problem ?
Thanks in advance!
I'm a beginner in Machine Learning and I'm trying to learn by working through Kaggle's Titanic problem. From what I know, I've made sure that the metrics are in sync with one another but of course I blame myself for this problem and not Python. However, I still couldn't find the source and Spyder IDE is no help.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
"""Assigning the train & test datasets' adresses to variables"""
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"
"""Using pandas' read_csv() function to read the datasets
and then assigning them to their own variables"""
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
"""Using pandas' factorize() function to represent genders (male/female)
with binary values (0/1)"""
train_data['Sex'] = pd.factorize(train_data.Sex)[0]
test_data['Sex'] = pd.factorize(test_data.Sex)[0]
"""Replacing missing values in the training and test dataset with 0"""
train_data.fillna(0.0, inplace = True)
test_data.fillna(0.0, inplace = True)
"""Selecting features for training"""
columns_of_interest = ['Pclass', 'Sex', 'Age']
"""Dropping missing/NaN values from the training dataset"""
filtered_titanic_data = train_data.dropna(axis=0)
"""Using the predictory features in the data to handle the x axis"""
x = filtered_titanic_data[columns_of_interest]
"""The survival (what we're trying to find) is the y axis"""
y = filtered_titanic_data.Survived
"""Splitting the train data with test"""
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
"""Assigning the DecisionTreeRegressor model to a variable"""
titanic_model = DecisionTreeRegressor()
"""Fitting the x and y values with the model"""
titanic_model.fit(train_x, train_y)
"""Predicting the x-axis"""
val_predictions = titanic_model.predict(val_x)
"""Assigning the feature columns from the test to a variable"""
test_x = test_data[columns_of_interest]
"""Predicting the test by feeding its x axis into the model"""
test_predictions = titanic_model.predict(test_x)
"""Printing the prediction"""
print(val_predictions)
"""Checking for the accuracy"""
print(accuracy_score(val_y, val_predictions))
"""Printing the test prediction"""
print(test_predictions)
and this is the stacktrace:
Traceback (most recent call last):
File "<ipython-input-3-73797c87986e>", line 1, in <module>
runfile('C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py', wdir='C:/Users/Omar/Downloads/Kaggle Competition')
File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
print(accuracy_score(val_y, val_predictions))
File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 176, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 81, in _check_targets
"and {1} targets".format(type_true, type_pred))
ValueError: Classification metrics can't handle a mix of binary and continuous targets
You are trying to use a regression algorithm (DecisionTreeRegressor) for a binary classification problem; the regression model, as expected, gives continuous outputs, but the accuracy_score, where the error actually happens:
File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
print(accuracy_score(val_y, val_predictions))
expects binary ones, hence the error.
For starters, change your model to
from sklearn.tree import DecisionTreeClassifier
titanic_model = DecisionTreeClassifier()
You are using a DecisionTreeRegressor, which as it says, is a regressor model. The Kaggle Titanic problem is a classification problem. So you should use a DecisionTreeClassifier.
As for why your code is throwing an error, it is because val_y has binary values (0,1) whereas val_predictions has continuous values because you used a Regressor model.
Classification needs discrete label as it predicts class(which is any one of the label) and Regression works with continuous data. As your output is class label, you need to perform classification
I am trying to implement nearest neighbor classifier in Turi Create, however I am unsure of this error I am getting. This error occurs when I create the actual model. I am using python 3.6 if that helps.
Error:
Traceback (most recent call last):
File "/Users/PycharmProjects/turi/turi.py", line 51, in <module>
iris_cross()
File "/Users/PycharmProjects/turi/turi.py", line 37, in iris_cross
clf = tc.nearest_neighbor_classifier(train_data, target='4', features=features)
TypeError: 'module' object is not callable
Code:
import turicreate as tc
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
import time
import numpy as np
#Iris Classification Cross Validation
def iris_cross():
iris = datasets.load_iris()
features = ['0','1','2','3']
target = iris.target_names
x = iris.data
y = iris.target.astype(int)
undata = np.column_stack((x,y))
data = tc.SFrame(pd.DataFrame(undata))
print(data)
train_data, test_data = data.random_split(.8)
clf = tc.nearest_neighbor_classifier(train_data, target='4', features=features)
print('done')
iris_cross()
You have to actually call the create() method of the nearest_neighbor_classifier. See the library API.
Run the following line of code instead:
clf = tc.nearest_neighbor_classifier.create(train_data, target='4', features=features)
I am trying to implement a cross validation scheme on grouped data. I was hoping to use the GroupKFold method, but I keep getting an error. what am I doing wrong?
The code (slightly different from the one I used--I had different data so I had a larger n_splits, but everythign else is the same)
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.grid_search import GridSearchCV
from xgboost import XGBRegressor
#generate data
x=np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13])
y= np.array([1,2,3,4,5,6,7,1,2,3,4,5,6,7])
group=np.array([1,0,1,1,2,2,2,1,1,1,2,0,0,2)]
#grid search
gkf = GroupKFold( n_splits=3).split(x,y,group)
subsample = np.arange(0.3,0.5,0.1)
param_grid = dict( subsample=subsample)
rgr_xgb = XGBRegressor(n_estimators=50)
grid_search = GridSearchCV(rgr_xgb, param_grid, cv=gkf, n_jobs=-1)
result = grid_search.fit(x, y)
the error:
Traceback (most recent call last):
File "<ipython-input-143-11d785056a08>", line 8, in <module>
result = grid_search.fit(x, y)
File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 813, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 566, in _fit
n_folds = len(cv)
TypeError: object of type 'generator' has no len()
changing the line
gkf = GroupKFold( n_splits=3).split(x,y,group)
to
gkf = GroupKFold( n_splits=3)
does not work either. The error message is then:
'GroupKFold' object is not iterable
The split function of GroupKFold yields the training and test indices pair one at a time. You should call list on the split value to get them all in a list so the length can be computed:
gkf = list(GroupKFold( n_splits=3).split(x,y,group))