precision score warnings results in score =0 sklearn - python

I am using precision_score in sklearn to evaluate the result of the outlier detection algorithm.
I trained with one class only and predict on unseen data. So the label for the one class is just 0 all the way.
I have found the following:
There are two columns, truth and predicted.
(I used the label encoder to beautify the number, in Local Outlier Factor, it output 1 for inlier and -1 for the outlier, I use label encoder to encode them into 0s and 1s, same for the truth)
However, the algorithm returns that my accuracy is 1, but precision is 0. It can be clearly seen that the predicted match with the truth completely. I would expect to get scores of 1s for both parameters. It comes with the below warning:
What should I do or any links I should be reading to mitigate this issue.

The documentation explains that with only two classes, it treats it as a binary problem. Precision is about true positives (guessing 1 when the answer is 1). You don’t have any—just true negatives (guessing 0 when the answer is 0).
If you’re really unhappy with that outcome, you can use the zero_division argument:
precision_score(truth, predicted, zero_division=1)
That way, you’ll get the 1 you want.

Related

How can I use the test_proportion data in a machine learning model?

I have a data with 4000 CNN features and it is a binary classification problem. All I know about the test data is the proportions of 1 and 0. How can I tell to my model to predict test labels by using the proportions data ? (Like is there a way to say in order to reach this proportions I will give this instance 0.)
How can I use it to increase accuracy ? In my case the training data is mostly consist of 1 (85%) and 0(15%)
However in my test data proportion of l is given as (%38) So it is much different than training data.
I worked a little bit with balancing the data and it helped. However my model still predicts 1 for nearly all of the data. It may occur because of the adaptation problem also.
As #birdwatch suggested I decrease the threshold for the 0 value and try to increase the 0 label count on the prediction.
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
threshold=0.3
y_pred [:,0] = (y_pred [:,0] < threshold).astype('int')
Before the number of classes were as in follows:
1 : 8906
0 : 2968
After changing threshold now it is
1 : 3221
0 : 8653
However is there any other way that I can use test_proportions which ensures the result?
There isn't any sensible way to that. Doing so would create a weird bias in the model. One thing you could do is accept the less likely outcome only is it has high enough score. Normally you'd use 0.5 threshold, but here you might take e.g. 0.7.

What is the negative mean absolute error in scikit-learn?

I am trying to train a model using SciKit Learn's SVM module. For the scoring, I could not find the mean_absolute_error(MAE), however, negative_mean_absolute_error(NMAE) does exist. What is the difference between these 2 metrics? Lets say I get the following results for 2 models:
model 1 (NMAE = -2.6), model 2(NMAE = -3.0)
Which model is better? Is it model 1?
Moreover, how does the negative compare to the positive? Say the following:
model 1 (NMAE = -1.7), model 2(MAE = 1.4)
Here, which model is better?
As its name implies, negative MAE is simply the negative of the MAE, which (MAE) is by definition a positive quantity. And since MAE is an error metric, i.e. the lower the better, negative MAE is the opposite: a value of -2.6 is better than a value of -3.0.
Just remove the negative signs and treat them as MAE values (which arguably also answers your second question).
Keep in mind that MAE is always available in scikit-learn as a general metric (docs).
I would like to add here, that this negative error is also helpful in finding best algorithm when you are comparing multiple algorithms through GridSearchCV().
This is because after training, GridSearchCV() ranks all the algorithms(estimators) and tells you which one is the best. Now when you use an error function, estimator with higher score will be ranked higher by sklearn, which is not true in the case of MAE (along with MSE and a few others).
To deal with this, the library flips the sign of error, so the highest MAE will be ranked lowest and vice versa.
So to answer your question: -2.6 is better than -3.0 because the actual MAE is 2.6 and 3.0 respectively.

Text Classification for multiple label

I doing Text Classification by Convolution Neural Network. I used health documents (ICD-9-CM code) for my project and I used the same model as dennybritz used but my data has 36 labels. I used one_hot encoding to encode my label.
Here is my problem, when I run data which has one label for each document my code the accuracy is perfect from 0.8 to 1. If I run data which has more than one labels, the accuracy is significantly reduced.
For example: a document has single label as "782.0": [0 0 1 0 ... 0],
a document has multiple label as "782.0 V13.09 593.5": [1 0 1 0 ... 1].
Could anyone suggest why this happen and how to improve it?
The label encoding seems correct. If you have multiple correct labels, [1 0 1 0 ... 1] looks totally fine. The loss function used in Denny's post is tf.nn.softmax_cross_entropy_with_logits, which is the loss function for a multi-class problem.
Computes softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one class).
In multi-label problem, you should use tf.nn.sigmoid_cross_entropy_with_logits:
Computes sigmoid cross entropy given logits.
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
The input to the loss function would be logits (WX) and targets (labels).
Fix the accuracy measure
In order to measure the accuracy correctly for a multi-label problem, the code below needs to be changed.
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
The logic of correct_predictions above is incorrect when you could have multiple correct labels. For example, say num_classes=4, and label 0 and 2 are correct. Thus your input_y=[1, 0, 1, 0]. The correct_predictions would need to break tie between index 0 and index 2. I am not sure how tf.argmax breaks tie but if it breaks the tie by choosing the smaller index, a prediction of label 2 is always considered wrong, which definitely hurt your accuracy measure.
Actually in a multi-label problem, precision and recall are better metrics than accuracy. Also you can consider using precision#k (tf.nn.in_top_k) to report classifier performance.

Scikit learn linear regression predicting labels

I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)

How does sklearn calculate the area under the roc curve for two binary inputs?

I noticed that sklearn has the following function:
sklearn.metrics.roc_auc_score()
which takes as input ground_truth and prediction.
For example,
ground_truth = [1,1,0,0,0]
prediction = [1,1,0,0,0]
sklearn.metrics.roc_auc_score(ground_truth, prediction) returns 1
My problem is that I can't figure out how sklearn calculates the area under the ROC curve with two binary inputs. Isn't the ROC curve derived by moving the class assignment threshold, and calculating the false alarm and hit rate for each threshold? With two binary inputs, shouldn't you only have one (false alarm, hit rate) measurement?
Many thanks!
You're correct that with binary predictions you'll only have a single threshold/measurement for the curve. I didn't understand it myself so I ran the code with a ton of print statements both for the sklearn tutorial and then with a purely binary example. All the magic is happening in sklearn.metrics._binary_clf_curve
The "thresholds" are distinct prediction scores. For any binary classifier that outputs purely ones and zeros you're going to get two thresholds - 1 and 0 (they're sorted internally from highest to lowest). At the 1 threshold, a prediction score of >=1 is true and anything below that (only 0 in this case) is considered false, and the TP and FP rates are calculated from that. In all cases, the last threshold categorizes everything as true so the TP and FP rates will both be 1.
It appears then that to generate a correct ROC curve for a sklearn classifier you'd use clf.predict_proba() rather than predict(). Or, maybe predict_log_proba()? I'm not sure if it would make any difference

Categories

Resources