Reverse one-hot encoded labels in XGBoost? - python

I'm trying to print the accuracy score for an XGBoost multilabel classifier. However, I'm stuck on this error:
ValueError: Classification metrics can't handle a mix of
multilabel-indicator and binary targets
I think y_test needs to not be one-hot encoded when passed to accuracy_score()? But everything I've tried creates more errors. Any idea how I get this to work?
Code:
X = X.reshape(X.shape[0], -1)
print(X.shape)
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state=42, stratify=y)
dtrain = xgb.DMatrix(data=x_train, label=y_train)
dtest = xgb.DMatrix(data=x_test, label=y_test)
eval_list = [(dtest, 'eval')]
# Train the model
params = {
'max_depth': 3,
'objective': 'multi:softmax',
'num_class': 3,
'tree_method':'gpu_hist'
}
# Train the model
model = xgb.train(params, dtrain, evals=eval_list, early_stopping_rounds=20, verbose_eval=True)
# Evaluate predictions
y_pred = model.predict(dtest)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Adding argmax to y_test seemed to work:
accuracy = accuracy_score(y_test.argmax(axis=1), predictions)

Related

how can i use cross validation in ensembling models (like voting)?

I am trying to use voting model in the ensembling methods, how can i add a cross validation to it?
Thanks
model_1 = SGD_cls=SGDClassifier(random_state=0)
model_2 = DecisionTreeClassifier(criterion='entropy', max_depth=12,max_leaf_nodes=35,splitter='best')
model_3 = GradientBoostingClassifier(n_estimators=100, random_state=42)
model_4 = MLPClassifier(random_state=42)
# model_3 = KNeighborsClassifier(n_neighbors=2)
X = df1.drop('product', axis = 1)
y = df1['product']
X_new = res_fit.transform(X)
#X_new =pd.DataFrame(X_new,columns = X.iloc[:,res_fit.support_].columns)
y_pred=cross_val_predict(model,X_new,y,cv=10)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.25, random_state=42)
model_5 = VotingClassifier([('ADA', model_1),
('Tree', model_2),
('GradBoost',model_3),
('MLP',model_4)],
voting='hard')
for model in (model_1, model_2, model_3,model_4,model_5):
model.fit(X_train, y_train)
print(model.__class__.__name__, model.score(X_test, y_test))

Plotting conusion Matrix

I am trying to plot a confusion matrix and heatmap that contains the percentages and numbered values,
I use the code in this link https://gist.github.com/mesquita/f6beffcc2579c6f3a97c9d93e278a9f1
The Error message :
cm = confusion_matrix(y_true, y_pred, labels=labels)
File "C:\Users\XX\anaconda3\envs\yamnet\lib\site-
packages\sklearn\metrics\_classification.py", line 316, in confusion_matrix
raise ValueError("At least one label specified must be in y_true")
ValueError: At least one label specified must be in y_true
This is my code:
encoder = LabelBinarizer()
print("encoder", encoder)
labels = encoder.fit_transform(y)
print("label ", labels)
print("y", y)
# Save the names of the classes for future using.
np.save(fname, encoder.classes_)
num_classes = len(np.unique(y))
# Generate the model
general_model = generate_model(num_classes, num_hidden=num_hidden,
activation=activation)
general_model.compile(optimizer=optimizer, loss='categorical_crossentropy',
metrics=['accuracy'])
# Create some callbacks
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=fname, monitor='val_loss',
save_best_only=True),
tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.9,
patience=15, verbose=1,
min_lr=0.000001)]
X, labels = shuffle(X, labels)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.20)
history = general_model.fit(X_train, y_train, epochs=epochs, validation_split=0.20,
batch_size=batch_size, callbacks=callbacks, verbose=1)
score = general_model.evaluate(X_test, y_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
y_pred = general_model.predict(X_test)
y_pred = (y_pred.argmax(axis=1))
y_test = (y_test.argmax(axis=1))
cm_analysis(y_test, y_pred, "ConfusionMatrix", y, X, ymap=None, figsize=(17, 17))

How to find top-k accuracy for SVM and Naive Bayes

Mine is a multi-class classification problem with 50 classes. I am attempting to find the top-k categorical accuracy for SVM and NB algorithms.
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.3, random_state = 42)
nb = Pipeline([('vect', CountVectorizer(min_df=1, dtype=np.int32, vocabulary= vocab_data, ngram_range=(1, 2))),
('tfidf', TfidfTransformer(use_idf=False)),
('chi', SelectKBest(chi2, k='all')),
('clf', OneVsRestClassifier(MultinomialNB(alpha=0.001))),
])
y_pred = nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
I am able to find the accuracy precision and recall values. Is there a way to find top k accuracy?
probs = nb.predict_proba(X_test)
print('accuracy %s' % top_k_accuracy_score(y_test, probs, k=5))
This prints top k accuracy

Error in svm prediction with two classes for prediction variable

I want to apply the SVM classifier to my problem where the prediction vector has two classes. SVM shows an error as "bad input" when I try to input such a prediction vector. If it's possible to provide such input to SVM? If not, how to cope with this issue?
Y = np.zeros((len(y), max(y)+1))
for i in range(len(y)):
Y[i, y[i]] = 1
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf.get_n_splits(X)
print(kf)
KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = Y[train_index], Y[test_index]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Matrix Y appears as below
enter image description here

confusion matrix and classification report of StratifiedKFold

I am using StratifiedKFold to checking the performance of my classifier. I have two classes and I trying to build Logistic Regression classier.
Here is my code
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
r.append(score)
print(score)
print(np.mean(r))
I could just print the score of the performance but I couldn't figure out how to print the confusion matrix and classification report.If I just add print statement inside the loop,
print(confusion_matrix(y_test, y_pred))
it will print it 10 times, but I want to report and a matrix of the final performance of the classifier.
Any help about how to calculation the matrix and the report. Thanks
Cross validation is used to asses the performance of particular models or hyperparameters across different splits of a dataset. At the end you don't have a final performance per se, you have the individual performance of each split and the aggregated performance across splits. You could potentially use the tn, fn, fp, tp for each to create an aggregated precision, recall, sensitivity, etc... but then you could also just use the predefined functions for those metrics in sklearn and aggregate them at the end.
e.g.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accs, precs, recs = [], [], []
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
accs.append(acc)
precs.append(prec)
recs.append(rec)
print(f'Accuracy: {acc}, Precision: {prec}, Recall: {rec}')
print(f'Mean Accuracy: {np.mean(accs)}, Mean Precision: {np.mean(precs)}, Mean Recall: {np.mean(recs)}')

Categories

Resources