Precision, Recall, F-score requiring equal inputs - python

I am looking at precision, recall, and f-score using scikit-learn using:
from sklearn.metrics import `precision_score`
Then:
y_true = np.array(["one", "two", "three"])
y_pred = np.array(["one", "two"])
precision = precision_score(y_true, y_pred, average=None)
print(precision)
The error returned is:
ValueError: Found input variables with inconsistent numbers of samples: [3, 2]
Due to the imbalanced input arrays, why does scikit-learn require an equal amount of inputs? Particularly when evaluating recall (which I would have thought was taking more guesses than answers).
I can implement my own metrics or just reduce the arrays so they match. I want to be sure there is no underlying reason why I should not?

It really depends what your y_true and y_pred mean in your case. But generally, y_true will be a vector indicating what the true value is supposed to be for every element of y_pred. I think this is not your case, and to use scikit-learn's metrics, you would need to put them in that format.
So in the case of binary classification, precision will be:
correct_classifications = (y_true == y_pred).astype(int)
precision = sum(y_pred * correct_classifications) / sum(y_pred)
Here you see that you need y_true and y_pred to be the same length.

That is quite simply because sklearn is playing the safe role here.
It doesn't make sense that you didn't do 100% of the predictions for the test set.
Let's say you have 1M data points in your dataset but you only predict 200k, are those the first 200k points? The last? Spread all over? How would the library know which matches which?
You have to have a 1:1 correspondance at the input of the metrics calculation. If you don't have predictions for some points throw them out (but make sure you know why you don't have such predictions in the first place, if it's not a problem with the pipeline) - you don't want to say you have 100% recall at 1% precision and in the end you only predicted for 10% of the dataset.

Related

How do I calculate the naive accuracy an a confusion matrix?

confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=−1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/𝑛. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/𝑛 = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.

precision score warnings results in score =0 sklearn

I am using precision_score in sklearn to evaluate the result of the outlier detection algorithm.
I trained with one class only and predict on unseen data. So the label for the one class is just 0 all the way.
I have found the following:
There are two columns, truth and predicted.
(I used the label encoder to beautify the number, in Local Outlier Factor, it output 1 for inlier and -1 for the outlier, I use label encoder to encode them into 0s and 1s, same for the truth)
However, the algorithm returns that my accuracy is 1, but precision is 0. It can be clearly seen that the predicted match with the truth completely. I would expect to get scores of 1s for both parameters. It comes with the below warning:
What should I do or any links I should be reading to mitigate this issue.
The documentation explains that with only two classes, it treats it as a binary problem. Precision is about true positives (guessing 1 when the answer is 1). You don’t have any—just true negatives (guessing 0 when the answer is 0).
If you’re really unhappy with that outcome, you can use the zero_division argument:
precision_score(truth, predicted, zero_division=1)
That way, you’ll get the 1 you want.

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

How to use F-score as error function to train neural networks?

I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.
It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives
I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.
the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.

How do sklearn SGDClassifier model thresholds relate to model scores?

I've trained a model and identified a 'threshold' that I'd like to deploy it at, but I'm having trouble understanding how the threshold relates to the score.
X = labeled_data[features].reset_index(drop=True)
Y = np.array(labeled_data['fraud'].reset_index(drop=True))
# (train/test etc.. settle on an acceptable model)
grad_des = SGDClassifier(alpha=alpha_optimum, l1_ratio=l1_optimum, loss='log')
grad_des.fit(X, Y)
score_Y = grad_des.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, score_Y[:,1])
Alright, so now I plot precision and recall vs threshold and decide I want my threshold to be .4
What is threshold?
My model coefficients, which I understand are 'scoring' events by computing coefficients['x']*event_values['x'], sum up to 29. Threshold is between 0 and 1.
How am I to understand the translation from threshold to what is, I guess a raw score? Would an event with a 1 for all features (all are binary) have a calculated score of 29 since that is the sum of all coefficients?
Do I need to compute this 'raw' score metric for all events and then plot that against precision instead of threshold?
Edit and Update:
So my question hinged on a lack of understanding about the logistic function, as Mikhail Korobov pointed out below. Regardless of 'raw score' the logistic function forces a value in [0, 1] range.
In order to 'unwrap' that value back into the 'raw score' I was looking for, I can do scipy.special.logit(0.8) - grad_des.intercept_ and this returns the 'score' of the row.
Probabilities are not just coefficients['x']*event_values['x'] - a logistic function is applied to these scores to get probability values in [0, 1] range.
predict_proba method returns these probabilities; they are in range [0, 1].
To get a concrete yes/no prediction one have to choose a probability threshold. An obvious and sane way is to use 0.5: if probability is greater than 0.5 then predict "yep", predict "nope" otherwise. This is what .predict() method does.
precision_recall_curve tries different probability thresholds and computes precision and recall for them. If based on precision and recall scores you believe some other threshold is better for your application you can use it instead of 0.5, e.g. bool_prediction = score_Y[:,1] > threshold.

Categories

Resources