Mine is a multi-class classification problem with 50 classes. I am attempting to find the top-k categorical accuracy for SVM and NB algorithms.
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.3, random_state = 42)
nb = Pipeline([('vect', CountVectorizer(min_df=1, dtype=np.int32, vocabulary= vocab_data, ngram_range=(1, 2))),
('tfidf', TfidfTransformer(use_idf=False)),
('chi', SelectKBest(chi2, k='all')),
('clf', OneVsRestClassifier(MultinomialNB(alpha=0.001))),
])
y_pred = nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
I am able to find the accuracy precision and recall values. Is there a way to find top k accuracy?
probs = nb.predict_proba(X_test)
print('accuracy %s' % top_k_accuracy_score(y_test, probs, k=5))
This prints top k accuracy
Related
When I am running the following Code, I get this error. How could I solve this?
model = KNeighborsClassifier(n_neighbors=19)
#model.fit(X_train, Y_train.values.ravel())
model.fit(X_train, Y_train)
prediction = model.predict(X_test)
print("KNN")
print("Accuracy = ",accuracy_score(Y_test, prediction))
print("Confusion Matrix =")
print(confusion_matrix(Y_test, prediction))
print(classification_report(Y_test, prediction))
I want to apply the SVM classifier to my problem where the prediction vector has two classes. SVM shows an error as "bad input" when I try to input such a prediction vector. If it's possible to provide such input to SVM? If not, how to cope with this issue?
Y = np.zeros((len(y), max(y)+1))
for i in range(len(y)):
Y[i, y[i]] = 1
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf.get_n_splits(X)
print(kf)
KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = Y[train_index], Y[test_index]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Matrix Y appears as below
enter image description here
I'm trying to print the accuracy score for an XGBoost multilabel classifier. However, I'm stuck on this error:
ValueError: Classification metrics can't handle a mix of
multilabel-indicator and binary targets
I think y_test needs to not be one-hot encoded when passed to accuracy_score()? But everything I've tried creates more errors. Any idea how I get this to work?
Code:
X = X.reshape(X.shape[0], -1)
print(X.shape)
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state=42, stratify=y)
dtrain = xgb.DMatrix(data=x_train, label=y_train)
dtest = xgb.DMatrix(data=x_test, label=y_test)
eval_list = [(dtest, 'eval')]
# Train the model
params = {
'max_depth': 3,
'objective': 'multi:softmax',
'num_class': 3,
'tree_method':'gpu_hist'
}
# Train the model
model = xgb.train(params, dtrain, evals=eval_list, early_stopping_rounds=20, verbose_eval=True)
# Evaluate predictions
y_pred = model.predict(dtest)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Adding argmax to y_test seemed to work:
accuracy = accuracy_score(y_test.argmax(axis=1), predictions)
I have a dataset with >16k vectors (21 dimensions).
I use 80% for training and 20% for testing.
I implement Neural Network and Naive Bayes with above dataset.
Get dataset and split it
data_set = np.loadtxt("./data/_vector21.csv", delimiter=",")
inp_vec = data_set[:, 1:22]
out_vec = data_set[:, 22:]
# Split dataset into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_vec, out_vec, test_size=0.2) # 80% training and 20% test
Neural Network
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(21, 100, 100, 6), max_iter=1000)
mlp.fit(X_train, y_train)
predictions = mlp.predict(X_test)
print("\nAccuracy: %.2f%%\n" % (accuracy_score(y_test, predictions)*100))
# Accuracy: 61.26%
Naive Bayes
# Create a Gaussian Classifier
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
# Train the model using the training sets
model.fit(X_train, y_train)
# Predict the response for test dataset
y_pred = model.predict(X_test)
print("\nAccuracy: %.3f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
#Accuracy: 34.050%
I expect the output of Neural Network model and Naive Bayes model closer.
Can anyone tell me what did I do wrong and how can fix that?
I am using StratifiedKFold to checking the performance of my classifier. I have two classes and I trying to build Logistic Regression classier.
Here is my code
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
r.append(score)
print(score)
print(np.mean(r))
I could just print the score of the performance but I couldn't figure out how to print the confusion matrix and classification report.If I just add print statement inside the loop,
print(confusion_matrix(y_test, y_pred))
it will print it 10 times, but I want to report and a matrix of the final performance of the classifier.
Any help about how to calculation the matrix and the report. Thanks
Cross validation is used to asses the performance of particular models or hyperparameters across different splits of a dataset. At the end you don't have a final performance per se, you have the individual performance of each split and the aggregated performance across splits. You could potentially use the tn, fn, fp, tp for each to create an aggregated precision, recall, sensitivity, etc... but then you could also just use the predefined functions for those metrics in sklearn and aggregate them at the end.
e.g.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accs, precs, recs = [], [], []
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
accs.append(acc)
precs.append(prec)
recs.append(rec)
print(f'Accuracy: {acc}, Precision: {prec}, Recall: {rec}')
print(f'Mean Accuracy: {np.mean(accs)}, Mean Precision: {np.mean(precs)}, Mean Recall: {np.mean(recs)}')