I am trying to compute the accuracy and ROC for LinearSVM, but I'm not sure about getting probabilities for calculating ROC.
I have this for calculating the accuracy. y_pred gives me hard predictions.
svm = LinearSVC()
y_pred = cross_val_predict(svm, X, y, cv=5)
For calculating the probabilities, I have this:
clf = CalibratedClassifierCV(svm, cv=5)
scores = cross_val_predict(clf, X, y, cv=5, method='predict_proba')[:,1]
I am not sure of the above 2 lines because I feel like there is some repetition with the cv=5 parameter. Any ideas on how to combine cross_val_predict and CalibratedClassifierCV? I don't have a separate test set. svm with linear kernel gives me different results, and I only want to use LinearSVM.
Related
I'm using sklearn's RFECV to come to the optimal set of features for my classification problem. I have X with 217 numerical features, for a binary label y. I determine the optimal set of features like so:
min_features_to_select = 3
cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=50, random_state=1)
model = LogisticRegression(penalty='l1', solver="liblinear")
rfecv = RFECV(estimator=model, step=1, cv=cv,
scoring="roc_auc", min_features_to_select=min_features_to_select)
rfecv.fit(X, y)
To visualize the rfecv process I plot the range of numbers of features against the grid scores of the RFECV
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(min_features_to_select, len(rfecv.grid_scores_) + min_features_to_select), rfecv.grid_scores_)
plt.show()
refcv_supp = rfecv.get_support()
optimal_features = list(features[refcv_supp])
Which looks like this:
The grid_scores with all feature subset of the RFECV lie around a score of 0.660. I interper this score as the performance of the model that was used in the RFECV, with each particular set of features, expressed by the roc_auc.
Now when I isolate the optimal set of features, and feed this into the exact same cross validated model, the performance does not at all look like the scores depicted in the graph...
refcv_supp = rfecv.get_support()
optimal_features = list(features[refcv_supp])
X = X.loc[:,np.array(optimal_features)]
model = LogisticRegression(penalty='l1', solver="liblinear")
cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=100, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1, error_score='raise')
score = np.mean(n_scores)
std = np.std(n_scores)
Which results in a mean roc_auc of 0.896 with a std of 0.046. Can anyone help me with what goes wrong here?
I've created a model is trained on the titanic dataset, and I want to see an accuracy percentage for my model. I've done this before, but sadly, I do not remember. I looked at the internet, and I couldn't find anything. Either I just entered the wrong words, or there isn't anything their.
# the tts function is `train_test_split` from `sklearn.model_selection`
train_X, val_X, train_y, val_y = tts(X, y, random_state = 0) # y is the state of survival
forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
val_predictions = forest_model.predict(val_X)
How can I calculate the accuracy?
I wonder why are you using RandomForestRegressor, as titanic dataset can be formulated as a binary-classification problem. Assuming it is a mistake, to measure accuracy you can of a RandomForestClassifier, you can do:
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(val_y, val_predictions)
However, it is often better to use K-fold cross-validation, which gives you more reliable accuracy:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(forest_model, X, y, cv=10, scoring='accuracy') #10-fold cross validation, notice that I am giving X,y not X_train, y_train
K-fold cross-validation gives you 10 accuracy values of accuracy, as it divides the data into 10 folds (i.e. parts). Then, you can get a mean and standard deviation the accuracy values as follows:
>>> import numpy as np
>>> accuracy = np.array(cross_val_score(forest_model, X, y, cv=10, scoring='accuracy'))
>>> accuracy.mean() #mean of accuracies
0.81
>>> accuracy.std() #standard deviation of accuracies
0.04
You can also use other scoring metric such as F1-score, Precision,
Recall,cohen_kappa_score etc.
You can calculate the model score to know the performance accuracy
forest_model.score(val_y, val_predictions)
I am using sklearn GridSearch to find best parameters for random forest classification using a predefined validation set. The scores from the best estimator returned by GridSearch do not match the scores obtained by training a separate classifier with the same parameters.
The data split definition
X = pd.concat([X_train, X_devel])
y = pd.concat([y_train, y_devel])
test_fold = -X.index.str.contains('train').astype(int)
ps = PredefinedSplit(test_fold)
The GridSearch definition
n_estimators = [10]
max_depth = [4]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
rf = RandomForestClassifier(random_state=0)
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = ps, scoring='recall_macro')
rf_grid.fit(X, y)
The classifier definition
clf = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=0)
clf.fit(X_train, y_train)
The recall was calculated explicitly using sklearn.metrics.recall_score
y_pred_train = clf.predict(X_train)
y_pred_devel = clf.predict(X_devel)
uar_train = recall_score(y_train, y_pred_train, average='macro')
uar_devel = recall_score(y_devel, y_pred_devel, average='macro')
GridSearch
uar train: 0.32189884516029466
uar devel: 0.3328299259976279
Random Forest:
uar train: 0.483040291148839
uar devel: 0.40706644557392435
What is the reason for such a mismatch?
There are multiple issues here:
Your input arguments to recall_score are reversed. The actual correct order is:
recall_score(y_true, y_test)
But you are are doing:
recall_score(y_pred_train, y_train, average='macro')
Correct that to:
recall_score(y_train, y_pred_train, average='macro')
You are doing rf_grid.fit(X, y) for grid-search. That means that after finding the best parameter combinations, the GridSearchCV will fit the whole data (whole X, ignoring the PredefinedSplit because that's only used during cross-validation in search of best parameters). So in essence, the estimator from GridSearchCV will have seen the whole data, so scores will be different from what you get when you do clf.fit(X_train, y_train)
It's because in your GridSearchCV you are using the scoring function as recall-macro which basically return the recall score which is macro averaged. See this link.
However, when you are returning the default score from your RandomForestClassifier it returns the mean accuracy. So, that is why the scores are different. See this link for info on the same. (Since one is recall and the other is accuracy).
I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.
So I have a very challenging dataset to work with, but even with that in mind the ROC curves I am getting as a result seem quite bizarre and looks wrong.
Below is my code - I have used the scikitplot library (skplt) for plotting ROC curves after passing in my predictions and the ground truth labels so I cannot reasonably be getting that wrong. Is there something crazily obvious that I am missing here?
# My dataset - note that m (number of examples) is 115. These are histograms that are already
# summed to 1 so I am doubtful that further preprocessing is necessary.
X, y = load_new_dataset(positives, positive_files, m=115, upper=21, range_size=10, display_plot=False)
# Partition - class balance is 0.87 : 0.13 for negative and positive classes respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Pick a baseline classifier - Naive Bayes
nb = GaussianNB()
# Very large class imbalance, so use stratified K-fold cross-validation.
cross_val = StratifiedKFold(n_splits=10)
# Use RFE for feature selection
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
# Create pipeline, nothing fancy here
clf = Pipeline(steps=[("feature selection", selector), ("classifier", nb)])
# Score using F1-score due to class imbalance - accuracy unlikely to be meaningful
scores = cross_val_score(clf, X_train, y_train, cv=cross_val,
scoring=make_scorer(f1_score, average='micro'))
# Fit and make predictions. Use these to plot ROC curves.
print(scores)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
And below is the starkly binary ROC curve:
I understand that I can't expect outstanding performance with such a challenging dataset, but even so I cannot fathom why I am getting such a binary result, particularly for the ROC curves of the individual classes. No, I cannot get more data, although I sincerely wish I could. If this really is valid code, then I will just have to make do with it and perhaps report the micro-average F1 score, which does not look too bad.
For reference, using the make_classification function from sklearn in the code snippet below, I get the following ROC curve:
# Randomly generate a dataset with similar characteristics (size, class balance,
# num_features)
X, y = make_classification(n_samples=103, n_features=21, random_state=0, n_classes=2, \
weights=[0.87, 0.13], n_informative=5, n_clusters_per_class=3)
positives = np.where(y == 1)
X_minority, X_majority, y_minority, y_majority = np.take(X, positives, axis=0), \
np.delete(X, positives, axis=0), \
np.take(y, positives, axis=0), \
np.delete(y, positives, axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Cross-validation again
cross_val = StratifiedKFold(n_splits=10)
# Use Naive Bayes again for consistency
clf = GaussianNB()
# Likewise for the evaluation metric
scores = cross_val_score(clf, X_train, y_train, cv=cross_val, \
scoring=make_scorer(f1_score, average='micro'))
print(scores)
# Fit, predict, plot results
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
Am I doing something wrong? Or is this what I should expect given these characteristics?
Thanks to Stev's kind suggestion of increasing the test size, the resulting curves I ended up getting were far smoother and exhibited much less variance. Using SMOTE in this case was also very helpful and I would advise it (using imblearn perhaps) for anyone else with a similar issue.