cross_val_score in python enables one to generate a variety of convenient model performance metrics. This is what I use to get ROC-AUC and Recall for a binary classification model.
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
log = LogisticRegression(class_weight='balanced')
auc = cross_val_score(log, X, y, scoring='roc_auc')
print ("ROC-AUC (Mean->): " + str(round(100*auc.mean(), 2)) + "%" + " (Standard Deviation->): " + str(round(100*auc.std(), 2)) + "%")
recall = cross_val_score(log, X, y, scoring='recall')
print ("RECALL (Mean->): " + str(round(100*recall.mean(), 2)) + "%"+ " (Standard Deviation->): " + str(round(100*recall.std(), 2)) + "%")
For the same binary classification model, how can one incorporate a metric for calculating precision-recall AUC within cross_val_score?
I think you should look into the function: precision_recall_curve(),
it compute precision-recall pairs for different probability thresholds.
Try the following approach:
FOLDS = 6
k_fold = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
for i, (train_index, test_index) in enumerate(k_fold.split(X)):
Xtrain, Xtest = X[train_index], X[test_index]
ytrain, ytest = y[train_index], y[test_index]
logistic_model.fit(Xtrain, ytrain)
pred_proba = logistic_model.predict_proba(Xtest)
precision, recall, _ = precision_recall_curve(ytest, pred_proba[:, 1])
...
Related
I am trying to understand cross validation score and accuracy score. I got accuracy score = 0.79 and cross validation score = 0.73. As I know, these scores should have been very close to each other. What can I say about my model by just looking at these scores ?
sonar_x = df_2.iloc[:,0:61].values.astype(int)
sonar_y = df_2.iloc[:,62:].values.ravel().astype(int)
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.ensemble import RandomForestClassifier
x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
folds = KFold(n_splits = 10, shuffle = False, random_state = 0)
scores = []
for n_fold, (train_index, valid_index) in enumerate(folds.split(sonar_x,sonar_y)):
print('\n Fold '+ str(n_fold+1 ) +
' \n\n train ids :' + str(train_index) +
' \n\n validation ids :' + str(valid_index))
x_train, x_valid = sonar_x[train_index], sonar_x[valid_index]
y_train, y_valid = sonar_y[train_index], sonar_y[valid_index]
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
scores.append(acc_score)
print('\n Accuracy score for Fold ' +str(n_fold+1) + ' --> ' + str(acc_score)+'\n')
print(scores)
print('Avg. accuracy score :' + str(np.mean(scores)))
##Cross validation score
scores = cross_val_score(rf, sonar_x, sonar_y, cv=10)
print(scores.mean())
You have a bug in your code that accounts for the gap.
You are training over a folded set of trains, but evaluating against a fixed test.
These two lines in the for loop:
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
Should be:
y_pred = rf.predict(x_valid)
acc_score = accuracy_score(y_pred , y_valid)
Since in your hand-written Cross-validation you are evaluating against a fixed x_test and y_test, for some folds there is a leakage that accounts for the overly optimistic result in the overall average.
If you correct this, the values should come closer, because you are doing the same, conceptually speaking, as cross_val_score does.
They might not match exactly though, due to randomness and the size of your dataset, which is quite small.
Finally, if you just wanted to the get one test score, then the KFold part is not needed and you can do:
x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
This result is less robust than the cross-validated results, since you are splitting the dataset just once and therefore you can get better or worse results by chance, this is, depending on the difficulty of the train-test split that the random seed generated.
I have used SVM's Linear svc for training and testing the data. I'm able to get the accuracy for SVM on my dataset. But, in addition to accuracy, I need precision and recall. Can anyone suggest me how to calculate precision and recall.
MyCode:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
with open("/Users/abc/Desktop/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/Desktop/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)
lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
score = lsvm.score(onehot_enc.transform(X_test), y_test)
print("Score of SVM:" , score)
You can do like this:
from sklearn.metrics import confusion_matrix
predicted_y = lsvm.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
Refer confusion_matrix documentation for more info
Problem: I need to train a classifier (in matlab) to classify multiple levels of signal noise.
So i trained a multi class SVM in matlab using the fitcecoc and obtained an accuracy of 92%.
Then i trained a multiclass SVM using sklearn.svm.svc in python, but it seems that however i fiddle with the parameters, i cannot achieve more than 69% accuracy.
30% of the data was held back and used to verify the training. the confusion matrixes can be seen below.
Matlab confusion matrix
Python confusion matrix
So if anyone has some experience or suggestions with svm.svc multiclass training and can see a problem in my code, or has a suggestion it would be greatly appreciated.
Python code:
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
#from sklearn import preprocessing
#### SET fitting parameters here
C = 100
gamma = 1e-8
#### SET WEIGHTS HERE
C0_Weight = 1*C
C1_weight = 1*C
C2_weight = 1*C
C3_weight = 1*C
C4_weight = 1*C
#####
X = np.genfromtxt('data/features.csv', delimiter=',')
Y = np.genfromtxt('data/targets.csv', delimiter=',')
print 'feature data is of size: ' + str(X.shape)
print 'target data is of size: ' + str(Y.shape)
# SPLIT X AND Y INTO TRAINING AND TEST SET
test_size = 0.3
X_train, x_test, Y_train, y_test = train_test_split(X, Y,
... test_size=test_size, random_state=0)
svc = svm.SVC(C=C,kernel='rbf', gamma=gamma, class_weight = {0:C0_Weight,
... 1:C1_weight, 2:C2_weight, 3:C3_weight, 4:C4_weight},cache_size = 1000)
svc.fit(X_train, Y_train)
scores = cross_val_score(svc, X_train, Y_train, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Out = svc.predict(x_test)
np.savetxt("data/testPredictions.csv", Out, delimiter=",")
np.savetxt("data/testTargets.csv", y_test, delimiter=",")
# calculate accuracy in test data
Hits = 0
HitsOverlap = 0
for idx, val in enumerate(Out):
Hits += int(y_test[idx]==Out[idx])
HitsOverlap += int(y_test[idx]==Out[idx]) + int(y_test[idx]==
... (Out[idx]-1)) + int(y_test[idx]==(Out[idx]+1))
print "Accuracy in testset: ", Hits*100/(11595*test_size)
print "Accuracy in testset w. overlap: ", HitsOverlap*100/(11595*test_size)
to those curious how i got the parameters, they were found with GridSearchCV (and increased the accuracy from 40% to 69)
Any help or suggestions is greatly appreciated.
After much pulling my hair, the answer was found here: http://neerajkumar.org/writings/svm/
when the inputs were scaled with StandardScaler(), svm.svc now produces superior results to matlab!!
I have written the following code to import data vectors from file and test the performance of SVM classifier (using sklearn and python).
However the classifier performance is lower than any other classifier (NNet for example gives 98% accuracy on test data but this gives 92% at best). In my experience SVM should produce better results for this kind of data.
Am I possibly doing something wrong?
import numpy as np
def buildData(featureCols, testRatio):
f = open("car-eval-data-1.csv")
data = np.loadtxt(fname = f, delimiter = ',')
X = data[:, :featureCols] # select columns 0:featureCols-1
y = data[:, featureCols] # select column featureCols
n_points = y.size
print "Imported " + str(n_points) + " lines."
### split into train/test sets
split = int((1-testRatio) * n_points)
X_train = X[0:split,:]
X_test = X[split:,:]
y_train = y[0:split]
y_test = y[split:]
return X_train, y_train, X_test, y_test
def buildClassifier(features_train, labels_train):
from sklearn import svm
#clf = svm.SVC(kernel='linear',C=1.0, gamma=0.1)
#clf = svm.SVC(kernel='poly', degree=3,C=1.0, gamma=0.1)
clf = svm.SVC(kernel='rbf',C=1.0, gamma=0.1)
clf.fit(features_train, labels_train)
return clf
def checkAccuracy(clf, features, labels):
from sklearn.metrics import accuracy_score
pred = clf.predict(features)
accuracy = accuracy_score(pred, labels)
return accuracy
features_train, labels_train, features_test, labels_test = buildData(6, 0.3)
clf = buildClassifier(features_train, labels_train)
trainAccuracy = checkAccuracy(clf, features_train, labels_train)
testAccuracy = checkAccuracy(clf, features_test, labels_test)
print "Training Items: " + str(labels_train.size) + ", Test Items: " + str(labels_test.size)
print "Training Accuracy: " + str(trainAccuracy)
print "Test Accuracy: " + str(testAccuracy)
i = 0
while i < labels_test.size:
pred = clf.predict(features_test[i])
print "F(" + str(i) + ") : " + str(features_test[i]) + " label= " + str(labels_test[i]) + " pred= " + str(pred);
i = i + 1
How is it possible to do multi-class classification if it does not do it by default?
p.s. my data is of the following format (last column is the class):
2,2,2,2,2,1,0
2,2,2,2,1,2,0
0,2,2,5,2,2,3
2,2,2,4,2,2,1
2,2,2,4,2,0,0
2,2,2,4,2,1,1
2,2,2,4,1,2,1
0,2,2,5,2,2,3
I found the problem after a long time and I am posting it, in case someone needs it.
The problem was that the data import function wouldn't shuffle the data. If the data is somehow sorted, then there is the risk that you train the classifier with some data and test it with totally different data. In the NNet case, Matlab was used which automatically shuffles the input data.
def buildData(filename, featureCols, testRatio):
f = open(filename)
data = np.loadtxt(fname = f, delimiter = ',')
np.random.shuffle(data) # randomize the order
X = data[:, :featureCols] # select columns 0:featureCols-1
y = data[:, featureCols] # select column featureCols
n_points = y.size
print "Imported " + str(n_points) + " lines."
### split into train/test sets
split = int((1-testRatio) * n_points)
X_train = X[0:split,:]
X_test = X[split:,:]
y_train = y[0:split]
y_test = y[split:]
return X_train, y_train, X_test, y_test
I'm using scikit to perform a logistic regression on spam/ham data.
X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
If I want to get the accuracies for a 10 fold cross validation, I just write:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
But it results in a ValueError:
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score function ?
Thank you in advance !
To compute the recall and precision, the data has to be indeed binarized, this way:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.
I encountered the same problem here, and I solved it with
# precision, recall and F1
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])
recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall', np.mean(recall), recall)
precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision', np.mean(precision), precision)
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('F1', np.mean(f1), f1)
The syntax you showed above is correct. Looks like a problem with the data you're using. The labels don't need to be binarized, as long as they're not continuous numbers.
You can prove out the same syntax with a different dataset:
iris = sklearn.dataset.load_iris()
X_train = iris['data']
y_train = iris['target']
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
You could use cross-validation like this to get the f1-score and recall :
print('10-fold cross validation:\n')
start_time = time()
scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1')
recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall')
print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier'))
print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time)))
for more scoring-parameter just see the page
you should specify which of the two labels is positive (it could be ham) :
from sklearn.metrics import make_scorer, precision_score
precision = make_scorer(precision_score, pos_label="ham")
accuracy = cross_val_score(classifier, X_train, y_train, cv=10, scoring = precision)