I am trying to train a XGBoost classification model and I had done it several times. This time I am trying to do a hyper parameter gridsearch and doing a CV using xgboost.cv. Everytime I run my code it gives a Key error:
I also tried to use just xgboost.train with some default parameters, which when I use to predict for the same DMatrix, it predicts everything as null.
Here is my DMatrix, where I have missing values in 4 features, for which I specified missing = np.nan in DMatrix
xgbmat_train = xgb.DMatrix(X_train.values,label=
Y_train.values,missing=np.nan,weight = train_weights)
xgbmat_test = xgb.DMatrix(X_test.values,label=Y_test.values,missing=np.nan,weight=test_weights)
These are my initial parameters
initial_params = {'learning_rate':0.1,'n_estimators':1000,'objective':'binary:logistic','booster':'gbtree','reg_alpha':0,
'reg_lambda':1,'max_depth':5,'min_child_weight':1,'gamma':0,'subsample':0.8,'colsample_bytree':0.8,
'scale_pos_weight':1,'missing':np.nan,'seed':27,'eval_metric':'auc','n_jobs':32,'silent':True}
These are my gridsearch parameters
gridsearch_params = [(max_depth,min_child_weight)
for max_depth in range(4,10)
for min_child_weight in range(1,6)]
Below is the loop where I am doing a gridsearch
max_auc = 0.0
best_params = ''
print(gc.collect())
for max_depth, min_child_weight in gridsearch_params:
print(gc.collect())
print("CV with max_depth = {}, min_child_weight=
{}".format(max_depth,min_child_weight))
initial_params['max_depth'] = max_depth
initial_params['min_child_weight'] = min_child_weight
cv_results = xgb.cv(initial_params,
xgbmat_train,
num_boost_round = 200,
seed = 42,
stratified = True,
shuffle=True,
nfold=3,
metrics={'auc'},
early_stopping_rounds = 50)
mean_auc = cv_results['test-auc-mean'].max()
boost_rounds = cv_results['test-auc-mean'].argmax()
cv_results = cv_results.append(cv_results)
if mean_auc > max_auc:
max_auc = mean_auc
best_params = (max_depth,min_child_weight)
print(gc.collect())
print(cv_results)
print(mean_auc)
print(boost_rounds)
print("Best param: {}, {}, aucpr: {}".format(best_params[0],best_params[1],max_auc))
This is the error I am getting while running the above code
KeyError Traceback (most recent call
last)
<ipython-input-15-f546ef27594f> in <module>
15 nfold=3,
16 metrics={'auc'},
---> 17 early_stopping_rounds = 50)
18 mean_auc = cv_results['test-auc-mean'].max()
19 boost_rounds = cv_results['test-auc-mean'].argmax()
~/anaconda3/lib/python3.7/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks, shuffle)
461 end_iteration=num_boost_round,
462 rank=0,
--> 463 evaluation_result_list=res))
464 except EarlyStopException as e:
465 for k in results:
~/anaconda3/lib/python3.7/site-packages/xgboost/callback.py in callback(env)
243 best_msg=state['best_msg'])
244 elif env.iteration - best_iteration >= stopping_rounds:
--> 245 best_msg = state['best_msg']
246 if verbose and env.rank == 0:
247 msg = "Stopping. Best iteration:\n{}\n\n"
KeyError: 'best_msg'
I tried filling NAs with -9999.0 and specified the same in missing argument in DMatrix, but throws the same error. I am running on some hard deadline, any help will be deeply appriciated
Related
I am trying different machine learning projects from Kaggle to make myself better. Here is the model that I am using:
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 1)
model.fit = (X, y)
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index = False)
print('Your submission was successfully saved!')
Here is the error I get:
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
/tmp/ipykernel_33/1528591149.py in <module>
9 forest_clf = RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 1)
10 forest_clf.fit = (X, y)
---> 11 predictions = forest_clf.predict(X_test)
12
13 output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
806 The predicted classes.
807 """
--> 808 proba = self.predict_proba(X)
809
810 if self.n_outputs_ == 1:
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
846 classes corresponds to that in the attribute :term:`classes_`.
847 """
--> 848 check_is_fitted(self)
849 # Check data
850 X = self._validate_X_predict(X)
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
1220
1221 if not fitted:
-> 1222 raise NotFittedError(msg % {"name": type(estimator).__name__})
1223
1224
NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
I think this is an example of the estimator cloning itself, but I am not sure which line is the issue here. This is the Titanic project that is seen on Kaggle, whose tutorial code I have copied amidst trying to learn. Any help is appreciated.
As #Blackgaurd pointed out just change model.fit = (X, y) to model.fit(X, y)
Your current code overwrites the fit method of your Random Forest Classifier.
Full code of yours with correction:
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 1)
model.fit(X, y) # <- line of code fixed
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index = False)
print('Your submission was successfully saved!')
I'm trying to make a classifier with XGBoost, I fit it with RandomizedSearchCV.
Here is the code of my function:
def xgboost_classifier_rscv(x,y):
from scipy import stats
from xgboost import XGBClassifier
from sklearn.metrics import fbeta_score, make_scorer, recall_score, accuracy_score, precision_score
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV
#splitting the dataset into training and test parts
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#bag of words implmentation
cv = CountVectorizer()
x_train = cv.fit_transform(x_train).toarray()
#TF-IDF implementation
vector = TfidfTransformer()
x_train = vector.fit_transform(x_train).toarray()
x_test = cv.transform(x_test)
scorers = {
'f1_score':make_scorer(f1_score),
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score)
}
param_dist = {'n_estimators': stats.randint(150, 1000),
'learning_rate': stats.uniform(0.01, 0.59),
'subsample': stats.uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': stats.uniform(0.5, 0.4),
'min_child_weight': [1, 2, 3, 4]
}
n_folds = numFolds)
skf = StratifiedKFold(n_splits=3, shuffle = True)
gridCV = RandomizedSearchCV(xgb_model,
param_distributions = param_dist,
cv = skf,
n_iter = 5,
scoring = scorers,
verbose = 3,
n_jobs = -1,
return_train_score=True,
refit = precision_score)
gridCV.fit(x_train,y_train)
best_pars = gridCV.best_params_
print("best params : ", best_pars)
xgb_predict = gridCV.predict(x_test)
xgb_pred_prob = gridCV.predict_proba(x_test)
print('best scores : ', gridCV.grid_scores_)
scores = [x[1] for x in gridCV.grid_scores_]
print("best scores : ", scores)
return y_test, xgb_predict, xgb_pred_prob
When I run the code, I get an error, reported below:
TypeError Traceback (most recent call last)
<ipython-input-30-9adf84d48e5c> in <module>
1 print("********** Xgboost classifier *************")
2 start_time = time.monotonic()
----> 3 y_test, xgb_predict, xgb_pred_prob = xgboost_classifier_rscv(x,y)
4 end_time = time.monotonic()
5 print("the time consumed is : ", timedelta(seconds=end_time - start_time))
<ipython-input-29-e0c6ae026076> in xgboost_classifier_rscv(x, y)
70 # verbose=3, random_state=1001, refit='precision_score' )
71
---> 72 gridCV.fit(x_train,y_train)
73 best_pars = gridCV.best_params_
74 print("best params : ", best_pars)
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
858 # parameter set.
859 if callable(self.refit):
--> 860 self.best_index_ = self.refit(results)
861 if not isinstance(self.best_index_, numbers.Integral):
862 raise TypeError('best_index_ returned is not an integer')
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
TypeError: precision_score() missing 1 required positional argument: 'y_pred'
When I do the same thing but with GridSearchCV instead of RandomizedSearchCV, the code runs without any problems!
It's not precision_score it's 'precision_score' (with ' '), like this-
gridCV = RandomizedSearchCV(xgb_model,
param_distributions = param_dist,
cv = skf,
n_iter = 5,
scoring = scorers,
verbose = 3,
n_jobs = -1,
return_train_score=True,
refit = 'precision_score')
Another error:
grid_scores_ has been removed, so changed it to cv_results_ (in the last 3rd and 4th line)
print('best scores : ', gridCV.cv_results_)
scores = [x[1] for x in gridCV.cv_results_]
One more error:
You have not defined that xgb_model, so add that.
xgb_model = XGBClassifier(n_jobs = -1, random_state = 42)
I am working on a multi-class classification problem using xgboost.
The shape of my data is
print(train_ohe.shape, test_ohe.shape)
# (43266, 190) (18543, 190)
Custom F1 eval function and model training code
def f1_eval(y_pred, dtrain):
y_true = dtrain.get_label()
err = 1-f1_score(y_true, np.round(y_pred),average='weighted')
return 'f1_err', err
def train_model(algo,train,test,predictors,useTrainCV=True,
cv_folds=5,early_stopping_rounds=50):
if useTrainCV:
xgb_param = algo.get_params()
xgb_train = xgb.DMatrix(train[predictors].values,label=train[target].values)
xgb_test = xgb.DMatrix(test[predictors].values)
print(xgb_train.num_row())
print(xgb_test.num_row())
cv_result = xgb.cv(xgb_param,
train,
num_boost_round=xgb_param['n_estimators'],
nfold=cv_folds,
metrics='f1_eval',
early_stopping_rounds=early_stopping_rounds)
algo.set_params(n_estimators=cv_result.shape[0])
# Fit algorithm on data
algo.fit(train[predictors],train[target],eval_metric=f1_eval)
# Predict train data
train_predictions = algo.predict(train[predictors])
train_pred_prob = algo.predict_proba(train[predictors])[:,1]
# Report model performance
print("Model performance")
print("F1 Score Train {}".format(f1_score(train[target].values,train_predictions)))
# Predict test data
test_predictions = algo.predict(test[predictors])
# Performance
print("F1 Score Test {}".format(f1_score(test[target].values,test_predictions)))
Here is my XgbClassifier code. Trying to find the number of estimators for a high learning rate.
target = 'Complaint-Status'
predictors = [x for x in train_ohe.columns if x not in target]
xgb1 = XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=8,
scale_pos_weight=1,
seed=145)
train_model(xgb1, train_ohe, test_ohe, predictors)
I am getting following Attribute error saying 'DataFrame' object has no attribute 'num_row'in the xgb.cv line in train_model function.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-116-5933227c171d> in <module>
18 seed=145)
19 print(xgb1.get_params())
---> 20 train_model(xgb1, train_ohe, test_ohe, predictors)
21 # xgb_param = xgb1.get_params()
22 # cv_folds=5
<ipython-input-114-a9df39c19abf> in train_model(algo, train, test, predictors, useTrainCV, cv_folds, early_stopping_rounds)
19 nfold=cv_folds,
20 metrics='f1_eval',
---> 21 early_stopping_rounds=early_stopping_rounds)
22 algo.set_params(n_estimators=cv_result.shape[0])
23
/opt/virtual_env/py3/lib/python3.6/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks, shuffle)
413 results = {}
414 cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc,
--> 415 stratified, folds, shuffle)
416
417 # setup callbacks
/opt/virtual_env/py3/lib/python3.6/site-packages/xgboost/training.py in mknfold(dall, nfold, param, seed, evals, fpreproc, stratified, folds, shuffle)
246 # Do standard k-fold cross validation
247 if shuffle is True:
--> 248 idx = np.random.permutation(dall.num_row())
249 else:
250 idx = np.arange(dall.num_row())
/opt/virtual_env/py3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4374 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4375 return self[name]
-> 4376 return object.__getattribute__(self, name)
4377
4378 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'num_row'
Saw your post when I was searching around for the same error.
Your second parameter train of the code:
cv_result = xgb.cv(xgb_param,
train,
num_boost_round=xgb_param['n_estimators'],
nfold=cv_folds,
metrics='f1_eval',
early_stopping_rounds=early_stopping_rounds)
algo.set_params(n_estimators=cv_result.shape[0])
should be a matrix such as
train = xgb.DMatrix(X_train, y_train)
hope this helps
I am trying to predict the prices of items using Dnnregressor and I couldn't figure out this error that keeps coming. I created tf numeric and categorical columns from pandas dataframe and fed it into the DNNRegressor. There is not much help online regarding this particular error.
Please help me fix this error. Thanks
AttributeError Traceback (most recent call last)
<ipython-input-27-790ecef8c709> in <module>()
92
93 if __name__ == '__main__':
---> 94 main()
<ipython-input-27-790ecef8c709> in main()
81 # learning_rate=0.1, l1_regularization_strength=0.001))
82 est = tf.estimator.DNNRegressor(feature_columns = feature_columns, hidden_units = [10, 10], model_dir = 'data')
---> 83 est.train(input_fn = get_train_input_fn(Xtrain, ytrain), steps = 500)
84 scores = est.evaluate(input_fn = get_test_input_fn(Xtest, ytest))
85 print('Loss Score: {0:f}' .format(scores['average_loss']))
C:\Users\user\Anaconda3\lib\site- packages\tensorflow\python\estimator\estimator.py in train(self, input_fn, hooks, steps, max_steps)
239 hooks.append(training.StopAtStepHook(steps, max_steps))
240
--> 241 loss = self._train_model(input_fn=input_fn, hooks=hooks)
242 logging.info('Loss for final step: %s.', loss)
243 return self
C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py in _train_model(self, input_fn, hooks)
628 input_fn, model_fn_lib.ModeKeys.TRAIN)
629 estimator_spec = self._call_model_fn(features, labels,
--> 630 model_fn_lib.ModeKeys.TRAIN)
631 ops.add_to_collection(ops.GraphKeys.LOSSES, estimator_spec.loss)
632 all_hooks.extend(hooks)
C:\Users\user\Anaconda3\lib\site- packages\tensorflow\python\estimator\estimator.py in _call_model_fn(self, features, labels, mode)
613 if 'config' in model_fn_args:
614 kwargs['config'] = self.config
--> 615 model_fn_results = self._model_fn(features=features, **kwargs)
616
617 if not isinstance(model_fn_results, model_fn_lib.EstimatorSpec):
C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\dnn.py in _model_fn(features, labels, mode, config)
389 dropout=dropout,
390 input_layer_partitioner=input_layer_partitioner,
--> 391 config=config)
392 super(DNNRegressor, self).__init__(
393 model_fn=_model_fn, model_dir=model_dir, config=config)
C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\dnn.py in _dnn_model_fn(features, labels, mode, head, hidden_units, feature_columns, optimizer, activation_fn, dropout, input_layer_partitioner, config)
100 net = feature_column_lib.input_layer(
101 features=features,
--> 102 feature_columns=feature_columns)
103
104 for layer_id, num_hidden_units in enumerate(hidden_units):
C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py in input_layer(features, feature_columns, weight_collections, trainable)
205 ValueError: if an item in `feature_columns` is not a `_DenseColumn`.
206 """
--> 207 _check_feature_columns(feature_columns)
208 for column in feature_columns:
209 if not isinstance(column, _DenseColumn):
C:\Users\user\Anaconda3\lib\site- packages\tensorflow\python\feature_column\feature_column.py in _check_feature_columns(feature_columns)
1660 name_to_column = dict()
1661 for column in feature_columns:
-> 1662 if column.name in name_to_column:
1663 raise ValueError('Duplicate feature column name found for columns: {} '
1664 'and {}. This usually means that these columns refer to '
C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py in name(self)
2451 #property
2452 def name(self):
-> 2453 return '{}_indicator'.format(self.categorical_column.name)
2454
2455 def _transform_feature(self, inputs):
AttributeError: 'str' object has no attribute 'name'
And below is code:
def get_train_input_fn(Xtrain, ytrain):
return tf.estimator.inputs.pandas_input_fn(
x = Xtrain,
y = ytrain,
batch_size = 30,
num_epochs = None,
shuffle = True)
def get_test_input_fn(Xtest, ytest):
return tf.estimator.inputs.pandas_input_fn(
x = Xtest,
y = ytest,
batch_size = 32,
num_epochs = 1,
shuffle = False)
def main():
Xtrain, Xtest, ytrain, ytest = train_test_split(merc, ytr, test_size = 0.4, random_state = 42)
feature_columns = []
brand_rating = tf.feature_column.numeric_column('brand_rating')
feature_columns.append(brand_rating)
sentiment = tf.feature_column.numeric_column('description_polarity')
feature_columns.append(sentiment)
item_condition = tf.feature_column.numeric_column('item_condition_id')
feature_columns.append(item_condition)
shipping = tf.feature_column.indicator_column('shipping')
feature_columns.append(shipping)
name = tf.feature_column.embedding_column('item_name', 34) #(column name, dimension(no. of unique values ** 0.25))
feature_columns.append(name)
general = tf.feature_column.categorical_column_with_hash_bucket('General', 12)
feature_columns.append(general)
sc1 = tf.feature_column.categorical_column_with_hash_bucket('SC1', 120)
feature_columns.append(sc1)
sc2 = tf.feature_column.categorical_column_with_hash_bucket('SC2', 900)
feature_columns.append(sc2)
print(feature_columns)
#est = tf.estimator.DNNRegressor(feature_columns, hidden_units = [10, 10], optimizer=tf.train.ProximalAdagradOptimizer(
# learning_rate=0.1, l1_regularization_strength=0.001))
est = tf.estimator.DNNRegressor(feature_columns = feature_columns, hidden_units = [10, 10], model_dir = 'data')
est.train(input_fn = get_train_input_fn(Xtrain, ytrain), steps = 500)
The first argument to tf.feature_column.embedding_column must be a categorical column, not a string. See API spec.
The offending line in your code is:
tf.feature_column.embedding_column('item_name', 34)
After using
general = tf.feature_column.categorical_column_with_hash_bucket('General', 12)
and other feature_column.categorical_column_with..., you should use
general_indicator = tf.feature_column.indicator_column(general)
and then append it to your feature_columns list.
feature_columns.append(general_indicator)
I am training a XGBoostClassifier for my training set.
My training features are in the shape of (45001, 10338) which is a numpy array and my training labels are in the shape of (45001,) [I have 1161 unique labels so I have done a label encoding for the labels] which is also a numpy array.
From the documentation, it clearly says that I can create DMatrix from numpy array. So I am using the above mentioned training features and labels as numpy arrays straightaway. But I am getting the following error
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-30-3de36245534e> in <module>()
13 scale_pos_weight=1,
14 seed=27)
---> 15 modelfit(xgb1, train_x, train_y)
<ipython-input-27-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
6 xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
7 cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8 metrics='auc',early_stopping_rounds=early_stopping_rounds)
9 alg.set_params(n_estimators=cvresult.shape[0])
10
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
399 for fold in cvfolds:
400 fold.update(i, obj)
--> 401 res = aggcv([f.eval(i, feval) for f in cvfolds])
402
403 for key, mean, std in res:
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in <listcomp>(.0)
399 for fold in cvfolds:
400 fold.update(i, obj)
--> 401 res = aggcv([f.eval(i, feval) for f in cvfolds])
402
403 for key, mean, std in res:
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in eval(self, iteration, feval)
221 def eval(self, iteration, feval):
222 """"Evaluate the CVPack for one iteration."""
--> 223 return self.bst.eval_set(self.watchlist, iteration, feval)
224
225
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in eval_set(self, evals, iteration, feval)
865 _check_call(_LIB.XGBoosterEvalOneIter(self.handle, iteration,
866 dmats, evnames, len(evals),
--> 867 ctypes.byref(msg)))
868 return msg.value
869 else:
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
125 """
126 if ret != 0:
--> 127 raise XGBoostError(_LIB.XGBGetLastError())
128
129
XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'
Please find my model Code below:
def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgb_param['num_class'] = 1161
xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc',early_stopping_rounds=early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(train_data_features, train_labels, eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(train_data_features)
dtrain_predprob = alg.predict_proba(train_data_features)[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(train_labels, dtrain_predictions))
Where am I going wrong in the above place ?
My classifier as follows :
xgb1 = xgb.XGBClassifier(
learning_rate =0.1,
n_estimators=50,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=4,
scale_pos_weight=1,
seed=27)
EDIT - 2
After changing evaluation metric,
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-9-30c62a886c2e> in <module>()
13 scale_pos_weight=1,
14 seed=27)
---> 15 modelfit(xgb1, train_x_trail, train_y_trail)
<ipython-input-8-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
6 xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
7 cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8 metrics='auc',early_stopping_rounds=early_stopping_rounds)
9 alg.set_params(n_estimators=cvresult.shape[0])
10
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
398 evaluation_result_list=None))
399 for fold in cvfolds:
--> 400 fold.update(i, obj)
401 res = aggcv([f.eval(i, feval) for f in cvfolds])
402
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in update(self, iteration, fobj)
217 def update(self, iteration, fobj):
218 """"Update the boosters for one iteration"""
--> 219 self.bst.update(self.dtrain, iteration, fobj)
220
221 def eval(self, iteration, feval):
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
804
805 if fobj is None:
--> 806 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
807 else:
808 pred = self.predict(dtrain)
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
125 """
126 if ret != 0:
--> 127 raise XGBoostError(_LIB.XGBGetLastError())
128
129
XGBoostError: b'[03:43:03] src/objective/multiclass_obj.cc:42: Check failed: (info.labels.size()) != (0) label set cannot be empty'
The original error that you get is because this metric was not designed for multi-class classification (see here).
You could use scikit learn wrapper of xgboost to overcome this issue. I modified your code with this wrapper, to produce similar function. I am not sure why are you doing gridsearch though, as you are not enumerating over parameters. Instead, you are using the parameters you specified in xgb1. Here is the modified code:
import xgboost as xgb
import sklearn
import numpy as np
from sklearn.model_selection import GridSearchCV
def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5):
if useTrainCV:
params=alg.get_xgb_params()
xgb_param=dict([(key,[params[key]]) for key in params])
boost = xgb.sklearn.XGBClassifier()
cvresult = GridSearchCV(boost,xgb_param,cv=cv_folds)
cvresult.fit(X,y)
alg=cvresult.best_estimator_
#Fit the algorithm on the data
alg.fit(train_data_features, train_labels)
#Predict training set:
dtrain_predictions = alg.predict(train_data_features)
dtrain_predprob = alg.predict_proba(train_data_features)[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % sklearn.metrics.accuracy_score(train_labels, dtrain_predictions))
xgb1 = xgb.sklearn.XGBClassifier(
learning_rate =0.1,
n_estimators=50,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=4,
scale_pos_weight=1,
seed=27)
X=np.random.normal(size=(200,30))
y=np.random.randint(0,5,200)
modelfit(xgb1, X, y)
The output that I get is
Model Report
Accuracy : 1
Note that I used much smaller size for the data. With the size that you mentioned, the algorithm may be very slow.
The error is b/c you are trying to use AUC evaluation metric for multiclass classification, but AUC is only applicable for two-class problems. In xgboost implementation, "auc" expects prediction size to be the same as label size, while your multiclass prediction size would be 45001*1161. Use either "mlogloss" or "merror" multiclass metrics.
P.S.: currently, xgboost would be rather slow with so many classes, as there is some inefficiency with predictions caching during training.