Optimize "recall" only when "precision">0.9 using Optuna - python

I have a lower bound of the precision of my model, say a Logistic Regression, of 0.9, thus I want the maximum recall when precision>0.9.
Recall is here defined as the ratio of the dataset I predict on to get a precision of atleast 0.9 i.e we remove predictions with a low confidence (predict_proba).
I have used optuna before to find optimal hyper-parameters but I cannot figure out how I should use it with this condition.
Right now I have the following code
import optuna
def objective(trial):
c = trial.suggest_float(0.001,10)
model = LogisticRegression(C=c)
model.fit(X_train,y_train)
pred_proba = model.predict_proba(X_val)
pred = model.predict(X_val)
thrs = trial.suggest_float(0.5,0.9)
keep_idx = (pred_proba>=thrs).any(axis=1) #Predictions to keep
recall = keep_idx.mean() #Ratio of predictions we are making
precision = (pred[keep_idx] == y_val[keep_idx])
if precision>0.9:
return recall
else:
return 0
but I assume we are giving optuna a hard time here, since it does not get any feedback of the parameter space when precision<0.9 - it just gets the same value namely "0".
Is this the correct way of doing so or is there a better way?

Related

How do I calculate the naive accuracy an a confusion matrix?

confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=โˆ’1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/๐‘›. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/๐‘› = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.

Random forest with Grid Search having worst performance than simple Random Forest

I am training a model using a simple Random Forest and then another model with the exact same dataset with Random Forest using Grid Search. Supossely , since Grid Search looks for the best combination of values ,the perfomance of the later one should be higher, but the opposite is happening.
#Random Forest
clf=RandomForestClassifier()
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rf',model, y_test, y_pred)
list_of_results.append(results)
#GridSearchCV
clf=RandomForestClassifier()
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
'max_depth':[1,2,3,4,5,6]}
gridSearch=GridSearchCV(clf,parameter_grid,cv=5,n_jobs=1,verbose=5)
gridSearchResults=gridSearch.fit(X,y)
print(gridSearchResults.best_estimator_)
clf=gridSearchResults.best_estimator_
model=clf.fit(X_train, y_train)
y_pred=model.predict(X_test)
#Model metrics
results=classifmodel_Metrics('rfopt',model,y_test,y_pred)
list_of_results.append(results)
print(list_of_results)
Does anyone know why is this happening? Is something wrong with my code or is something that can esporadically happen?
The function I use to calcule my model performance is this , being F1 the value I use for reference( the higher the F1 the best the model is)
def classifmodel_Metrics(modelName, model, actual, predicted):
classes = list(np.unique(np.concatenate((actual,predicted))))
confMtx = confusion_matrix(actual,predicted)
print("Confusion Matrix")
print(confMtx)
report = classification_report(actual,predicted,output_dict = True)
precision = report["macro avg"]["precision"]
recall = report["macro avg"]["recall"]
f1 = report["macro avg"]["f1-score"] # Mรฉdia ponderada da precision e recall
res = pd.Series({
"ModelName":modelName,
"Model":model,
"accuracy":round(accuracy_score(predicted,actual),3),
"precision": round(precision,3),
"recall": round(recall,3),
"f1": round(f1,3)
})
if len(classes) == 2:
print("\naccuracy: {0:.2%}".format(round(accuracy_score(predicted,actual),3)))
print("\nprecision: {0:.2%}".format(precision))
print("\nrecall: {0:.2%}".format(recall))
print("\nf1: {0:.2%}".format(f1))
else:
print("\n",classification_report(actual,predicted))
return res
It's likely that your parameter grid is just not capturing enough depth for your model to learn well.
You have:
parameter_grid={'n_estimators':[50,100,150,250,500,1000,1500,2000,2500,3000],
'max_depth':[1,2,3,4,5,6]}
Where you are limiting your model to at most 64 leaves (2^6) through having a max depth of 6. In contrast, the default for scikit-learn is have as many layers as necessary until each of your n samples is in its own leaf (<n/2).
To improve performance I would use more depth options. It's also highly unlikely that you need so many trees in your forest (diminishing returns). Try something like this instead:
parameter_grid={'n_estimators':[64, 128, 256],
'max_depth':[2, 4, 8, 16, 36, 64]}
If your best model is topping out the max depth, you can try increasing the limit you are testing.

How to predict Precision, Recall and F1 score after training SSD

I am new to deep learning and I want to be able to evaluate the model which underwent training for certain epochs using F1 score. I believe it requires the calculation of precision and recall first.
The model trained is SSD-300-Tensorflow. Is there a code or something that can produce these results? I am not using sci-kit or anything as I am not sure if that is required to calculate the score but any guidance is appreciated.
There is a file for evaluation in folder tf_extended called metrics.py which has a code for precision and recall. After training my files, I have the checkpoints in logs folder. How can I calculate my metrics? I am using Google collab due to hardware limitations (GPU issues)
If you are using the Tensorflow Object Detection API, it provides a way for running model evaluation that can be configured for different metrics. A tutorial on how to do this is here.
The COCO evaluation metrics includes analogous measures of precision and recall for object detection use cases. A good overview of these metrics is here. The concepts of precision and recall need to be adapted somewhat for object detection scenarios because you have to define "how closely" a predicted bounding box needs to match the ground truth bounding box to be considered a true positive.
I am not certain what the analog for F1 score would be for object detection scenarios. Typically, I have seen models compared using mAP as the single evaluation metric.
You should first calculate False positive, False negatives, True positive and True negatives. To obtain these values you must evaluate your model with test dataset. This link might help
with these formulas you can calculate precision and recall and here is some example code:
y_hat = []
y = []
threshold = 0.5
for data, label in test_dataset:
y_hat.extend(model.predict(data))
y.extend(label.numpy()[:, 1])
y_hat = np.asarray(y_hat)
y = np.asarray(y)
m = len(y)
y_hat = np.asarray([1 if i > threshold else 0 for i in y_hat[:, 1]])
true_positive = np.logical_and(y, y_hat).sum()
true_negative = np.logical_and(np.logical_not(y_hat), np.logical_not(y)).sum()
false_positive = np.logical_and(np.logical_not(y), y_hat).sum()
false_negative = np.logical_and(np.logical_not(y_hat), y).sum()
total = true_positive + true_negative + false_negative + false_positive
assert total == m
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
accuracy = (true_positive + true_negative) / total
f1score = 2 * precision * recall / (precision + recall)

Why not use mean squared error for classification problems?

I am trying to solve a simple binary classification problem using LSTM. I am trying to figure out the correct loss function for the network. The issue is, when I use the binary cross-entropy as loss function, the loss value for training and testing is relatively high as compared to using the mean squared error (MSE) function.
Upon research, I came across justifications that binary cross-entropy should be used for classification problems and MSE for the regression problem. However, in my case, I am getting better accuracies and lesser loss value with MSE for binary classification.
I am not sure how to justify these obtained results. Why not use mean squared error for classification problems?
I would like to show it using an example.
Assume a 6 class classification problem.
Assume,
True probabilities = [1, 0, 0, 0, 0, 0]
Case 1:
Predicted probabilities = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]
Case 2:
Predicted probabilities = [0.4, 0.5, 0.1, 0, 0, 0]
The MSE in the Case1 and Case 2 is 0.128 and 0.1033 respectively.
Although, Case 1 is correctly predicting class 1 for the instance, the loss in Case 1 is higher than the loss in Case 2.
The answer is right there in your question. Value of binary cross entropy loss is higher than rmse loss.
Case 1 (Large Error):
Lets say your model predicted 1e-7 and the actual label is 1.
Binary Cross Entropy loss will be -log(1e-7) = 16.11.
Root mean square error will be (1-1e-7)^2 = 0.99.
Case 2 (Small Error)
Lets say your model predicted 0.94 and the actual label is 1.
Binary Cross Entropy loss will be -log(0.94) = 0.06.
Root mean square error will be (1-1e-7)^2 = 0.06.
In Case 1 when prediction is far off from reality, BCELoss has larger value compared to RMSE. When you have large value of loss you'll have large value of gradients, thus optimizer will take a larger step in direction opposite to gradient. Which will result in relatively more reduction in loss.
Though #nerd21 gives a good example for "MSE as loss function is bad for 6-class classification", it's not the same for binary classification.
Let's just consider binary classification. Label is [1, 0], one prediction is h1=[p, 1-p], another prediction is h2=[q, 1-q], thus their's MSEs are:
L1 = 2*(1-p)^2, L2 = 2*(1-q)^2
Assuming h1 is mis-classifcation, i.e. p<1-p, thus 0<p<0.5
Assuming h2 is correct-classification, i.e. q>1-q, thus 0.5<q<1
Then L1-L2=2(p-q)(p+q-2) > 0 is for sure:
p < q is for sure;
q + q < 1 + 0.5 < 1.5, thus p + q - 2 < -0.5 < 0;
thus L1-L2>0, i.e. L1 > L2
This mean for binary classfication with MSE as loss function, mis-classification will definitely with larger loss that correct-classification.
I'd like to share my understanding of the MSE and binary cross-entropy functions.
In the case of classification, we take the argmax of the probability of each training instance.
Now, consider an example of a binary classifier where model predicts the probability as [0.49, 0.51]. In this case, the model will return 1 as the prediction.
Now, assume that the actual label is also 1.
In such a case, if MSE is used, it will return 0 as a loss value, whereas the binary cross-entropy will return some "tangible" value.
And, if somehow with all data samples, the trained model predicts a similar type of probability, then binary cross-entropy effectively return a big accumulative loss value, whereas MSE will return a 0.
According to the MSE, it's a perfect model, but, actually, it's not that good model, that's why we should not use MSE for classification.

High AUC but bad predictions with imbalanced data

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?
The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Categories

Resources