I am trying to train a model using SciKit Learn's SVM module. For the scoring, I could not find the mean_absolute_error(MAE), however, negative_mean_absolute_error(NMAE) does exist. What is the difference between these 2 metrics? Lets say I get the following results for 2 models:
model 1 (NMAE = -2.6), model 2(NMAE = -3.0)
Which model is better? Is it model 1?
Moreover, how does the negative compare to the positive? Say the following:
model 1 (NMAE = -1.7), model 2(MAE = 1.4)
Here, which model is better?
As its name implies, negative MAE is simply the negative of the MAE, which (MAE) is by definition a positive quantity. And since MAE is an error metric, i.e. the lower the better, negative MAE is the opposite: a value of -2.6 is better than a value of -3.0.
Just remove the negative signs and treat them as MAE values (which arguably also answers your second question).
Keep in mind that MAE is always available in scikit-learn as a general metric (docs).
I would like to add here, that this negative error is also helpful in finding best algorithm when you are comparing multiple algorithms through GridSearchCV().
This is because after training, GridSearchCV() ranks all the algorithms(estimators) and tells you which one is the best. Now when you use an error function, estimator with higher score will be ranked higher by sklearn, which is not true in the case of MAE (along with MSE and a few others).
To deal with this, the library flips the sign of error, so the highest MAE will be ranked lowest and vice versa.
So to answer your question: -2.6 is better than -3.0 because the actual MAE is 2.6 and 3.0 respectively.
Related
I am using precision_score in sklearn to evaluate the result of the outlier detection algorithm.
I trained with one class only and predict on unseen data. So the label for the one class is just 0 all the way.
I have found the following:
There are two columns, truth and predicted.
(I used the label encoder to beautify the number, in Local Outlier Factor, it output 1 for inlier and -1 for the outlier, I use label encoder to encode them into 0s and 1s, same for the truth)
However, the algorithm returns that my accuracy is 1, but precision is 0. It can be clearly seen that the predicted match with the truth completely. I would expect to get scores of 1s for both parameters. It comes with the below warning:
What should I do or any links I should be reading to mitigate this issue.
The documentation explains that with only two classes, it treats it as a binary problem. Precision is about true positives (guessing 1 when the answer is 1). You don’t have any—just true negatives (guessing 0 when the answer is 0).
If you’re really unhappy with that outcome, you can use the zero_division argument:
precision_score(truth, predicted, zero_division=1)
That way, you’ll get the 1 you want.
I used the GridCV to do cross validation across k folds to tune my hyper parameters. The mean results which should have been mean over individual folds is wrong in my results attribute "cv_results_". Following is my code for same:
gscv = GridSearchCV(n_jobs=n_jobs,cv=train_test_iterable, estimator=pipeline, param_grid=param_grid,
verbose=10, scoring=['accuracy', 'precision','recall','f1'], refit='f1',
return_train_score=return_train_score, error_score=error_score,
)
gscv.fit(X,Y)
gscv.cv_results_
The cv_results_ contains following json(displayed as table)
mean_test_f1 split0_test_f1 split1_test_f1 Actual Mean
0.934310796 0.935603198 0.933665455 0.934634326
0.931279716 0.908430118 0.942689316 0.925559717
0.927683609 0.912005672 0.935512149 0.923758911
0.680908006 0.741198823 0.650802701 0.696000762
0.680908006 0.741198823 0.650802701 0.696000762
0.646005028 0.684483208 0.626791532 0.65563737
0.840273248 0.847484083 0.836672627 0.842078355
0.837160828 0.847484083 0.832006068 0.839745075
0.833637 0.842109375 0.829406448 0.835757911
You can see above: the "mean_test_f1" is not the mean of two folds "split0_test_f1", "split1_test_f1". Actual mean is the last column.
Note: F1 means the f1-score.
Did anyone face similar issues?
I think what your seeing is a weighed mean not a direct average.
Try setting iid=False in GridSearchCV(...) and compare.
According to documentation:
iid : boolean, default=True
If True, the data is assumed to be identically distributed across
the folds, and the loss minimized is the total loss per sample,
and not the mean loss across the folds.
So when iid is True (by default), averaging of test scores include a weight as specified here in source code:
_store('test_%s' % scorer_name, test_scores[scorer_name],
splits=True, rank=True,
weights=test_sample_counts if iid else None)
Please note that train scores are not affected by it, so also cross-check the mean of train scores.
I am currently trying to vary the threshold of a Random Forest Classifier in order to plot a ROC Curve. I was under the impression that the only way to do this for a Random Forest is through the use of the class_weight parameter. I have been able to do this successfully, increasing and decreasing precision, recall, true positive and false positive rates; however, I am not sure what I am actually doing. Currently I have the following;
rfc = RandomForestClassifier(n_jobs=-1, oob_score=True, n_estimators=50,max_depth=40,min_samples_split=100,min_samples_leaf=80, class_weight={0:.4, 1:.9})
What is the .4 and .9 actually referring too. I thought it was 40% of data set is 0's and 90% 1's however, this obviously makes no sense (over %100). What is it actually doing? THANKS!
Class weights typically do not need to normalise to 1 (it's only the ratio of the class weights that is important, so demanding that they sum to 1 would not actually be a restriction though).
So setting the class weights to 0.4 and 0.9 is equivalent to assuming a split of class labels in the data of 0.4 / (0.4+0.9) to 0.9 / (0.4+0.9) [roughly ~30% belonging to class 0 and ~70% belonging to class 1].
An alternative way to view differing class weights is as a way of more strongly penalising mistakes in one class compared to another, but still assuming balanced numbers of labelings in the data. In your example, it would be 9/4 times worse to misclassify a 1 as a 0 than it would be to misclassify a 0 as a 1.
The easiest (in my experience) way to vary the discrimination threshold of any of the scikit-learn classifiers is to use the predict_proba() function. Rather than returning a single output class, this returns the probabilities for membership in each class (concretely what it is doing is outputting the proportion of samples in the leaf nodes reached during the classification, averaged over all trees in the random forest.) Once you have these probabilities, it is easy to implement your own final classification step by comparing the probability for each class to some threshold which you can change.
probs = RF.predict_proba(X) # output dimension: [num_samples x num_classes]
for threshold in range(0,100):
threshold = threshold / 100.0
classes = (probs > threshold).astype(int)
# further analysis here as desired
I've trained a model and identified a 'threshold' that I'd like to deploy it at, but I'm having trouble understanding how the threshold relates to the score.
X = labeled_data[features].reset_index(drop=True)
Y = np.array(labeled_data['fraud'].reset_index(drop=True))
# (train/test etc.. settle on an acceptable model)
grad_des = SGDClassifier(alpha=alpha_optimum, l1_ratio=l1_optimum, loss='log')
grad_des.fit(X, Y)
score_Y = grad_des.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, score_Y[:,1])
Alright, so now I plot precision and recall vs threshold and decide I want my threshold to be .4
What is threshold?
My model coefficients, which I understand are 'scoring' events by computing coefficients['x']*event_values['x'], sum up to 29. Threshold is between 0 and 1.
How am I to understand the translation from threshold to what is, I guess a raw score? Would an event with a 1 for all features (all are binary) have a calculated score of 29 since that is the sum of all coefficients?
Do I need to compute this 'raw' score metric for all events and then plot that against precision instead of threshold?
Edit and Update:
So my question hinged on a lack of understanding about the logistic function, as Mikhail Korobov pointed out below. Regardless of 'raw score' the logistic function forces a value in [0, 1] range.
In order to 'unwrap' that value back into the 'raw score' I was looking for, I can do scipy.special.logit(0.8) - grad_des.intercept_ and this returns the 'score' of the row.
Probabilities are not just coefficients['x']*event_values['x'] - a logistic function is applied to these scores to get probability values in [0, 1] range.
predict_proba method returns these probabilities; they are in range [0, 1].
To get a concrete yes/no prediction one have to choose a probability threshold. An obvious and sane way is to use 0.5: if probability is greater than 0.5 then predict "yep", predict "nope" otherwise. This is what .predict() method does.
precision_recall_curve tries different probability thresholds and computes precision and recall for them. If based on precision and recall scores you believe some other threshold is better for your application you can use it instead of 0.5, e.g. bool_prediction = score_Y[:,1] > threshold.
I'm using a linear regression model to predict the weather data for one year. Prediction is done using Python's sklearn library. The problem is that I need to find the accuracy of the prediction. After a quick internet search I found out that r^2 is the way to find out the accuracy. I calculated the r value as follows:
r value
0.0919309031356
Coefficients:
[-20.01071429 0. ]
Residual sum of squares: 19331.78
Variance score: -0.23
The problem is that I need to show the accuracy as a percentage. How do I do that? Do I need use a tool to find out the accuracy?
Maybe this question is more complicated than I think, but why not just
r = str((r**2) * 100) + '%'
For regression problems, you can use the following metrics to determine the quality of the fit (http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics):
Mean squared error. Fit is good when the value is as low as possible
R^2 score. Fit is good when the value is 1 or close to it.
You can also calculate prediction error with:
(Actual value - Predicted value)/Actual value.
However, I am not sure if this is a common metric to evaluate a linear regression fit.