I'm using a linear regression model to predict the weather data for one year. Prediction is done using Python's sklearn library. The problem is that I need to find the accuracy of the prediction. After a quick internet search I found out that r^2 is the way to find out the accuracy. I calculated the r value as follows:
r value
0.0919309031356
Coefficients:
[-20.01071429 0. ]
Residual sum of squares: 19331.78
Variance score: -0.23
The problem is that I need to show the accuracy as a percentage. How do I do that? Do I need use a tool to find out the accuracy?
Maybe this question is more complicated than I think, but why not just
r = str((r**2) * 100) + '%'
For regression problems, you can use the following metrics to determine the quality of the fit (http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics):
Mean squared error. Fit is good when the value is as low as possible
R^2 score. Fit is good when the value is 1 or close to it.
You can also calculate prediction error with:
(Actual value - Predicted value)/Actual value.
However, I am not sure if this is a common metric to evaluate a linear regression fit.
Related
confusion matrix
I have an issue where I'm trying to compute the test accuracy for a naive classifier that always predicts ^y=−1.
I have already calculated the test accuracy of the classifier based on the confusion matrix attached above by using (TN + TP)/𝑛. But how do I calculate the naive value?
accuracy = (109112+3805)/127933
naive_accuracy = # TODO: Compute the accuracy of the naive classifier
It is actually the same formula. You should just notice that your naive classifier never gives positives answers, so TP = 0. TN will be equal to the total number of negatives: TN = 123324.
So naive_accuracy = (TN + TP)/𝑛 = (123324 + 0)/127933.
And yes, this is the case when naive classifier actually shows better accuracy than the one given by the confusion matrix you are referring to. This is due to data imbalance problem: there are 30 times more negative examples than positive ones. This is why accuracy is not applicable in that setting. Please check out precision, recall and f-score metrics if you need to have a meaningful result.
I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.
I am trying to train a model using SciKit Learn's SVM module. For the scoring, I could not find the mean_absolute_error(MAE), however, negative_mean_absolute_error(NMAE) does exist. What is the difference between these 2 metrics? Lets say I get the following results for 2 models:
model 1 (NMAE = -2.6), model 2(NMAE = -3.0)
Which model is better? Is it model 1?
Moreover, how does the negative compare to the positive? Say the following:
model 1 (NMAE = -1.7), model 2(MAE = 1.4)
Here, which model is better?
As its name implies, negative MAE is simply the negative of the MAE, which (MAE) is by definition a positive quantity. And since MAE is an error metric, i.e. the lower the better, negative MAE is the opposite: a value of -2.6 is better than a value of -3.0.
Just remove the negative signs and treat them as MAE values (which arguably also answers your second question).
Keep in mind that MAE is always available in scikit-learn as a general metric (docs).
I would like to add here, that this negative error is also helpful in finding best algorithm when you are comparing multiple algorithms through GridSearchCV().
This is because after training, GridSearchCV() ranks all the algorithms(estimators) and tells you which one is the best. Now when you use an error function, estimator with higher score will be ranked higher by sklearn, which is not true in the case of MAE (along with MSE and a few others).
To deal with this, the library flips the sign of error, so the highest MAE will be ranked lowest and vice versa.
So to answer your question: -2.6 is better than -3.0 because the actual MAE is 2.6 and 3.0 respectively.
I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
I noticed that sklearn has the following function:
sklearn.metrics.roc_auc_score()
which takes as input ground_truth and prediction.
For example,
ground_truth = [1,1,0,0,0]
prediction = [1,1,0,0,0]
sklearn.metrics.roc_auc_score(ground_truth, prediction) returns 1
My problem is that I can't figure out how sklearn calculates the area under the ROC curve with two binary inputs. Isn't the ROC curve derived by moving the class assignment threshold, and calculating the false alarm and hit rate for each threshold? With two binary inputs, shouldn't you only have one (false alarm, hit rate) measurement?
Many thanks!
You're correct that with binary predictions you'll only have a single threshold/measurement for the curve. I didn't understand it myself so I ran the code with a ton of print statements both for the sklearn tutorial and then with a purely binary example. All the magic is happening in sklearn.metrics._binary_clf_curve
The "thresholds" are distinct prediction scores. For any binary classifier that outputs purely ones and zeros you're going to get two thresholds - 1 and 0 (they're sorted internally from highest to lowest). At the 1 threshold, a prediction score of >=1 is true and anything below that (only 0 in this case) is considered false, and the TP and FP rates are calculated from that. In all cases, the last threshold categorizes everything as true so the TP and FP rates will both be 1.
It appears then that to generate a correct ROC curve for a sklearn classifier you'd use clf.predict_proba() rather than predict(). Or, maybe predict_log_proba()? I'm not sure if it would make any difference