Targeting a specific metric to optimize in tensorflow

Targeting a specific metric to optimize in tensorflow - python

Is there any way we can target a specific metric to optimize using inbuilt tensorflow optimizers? If not, how to achieve this? For eg. If I want to focus only on maximizing F-score of my classifier specifically, is it possible to do so in tensorflow?
estimator = tf.estimator.LinearClassifier(
feature_columns=feature_cols,
config=my_checkpointing_config,
model_dir=output_dir,
optimizer=lambda: tf.train.FtrlOptimizer(
learning_rate=tf.train.exponential_decay(
learning_rate=0.1,
global_step=tf.train.get_or_create_global_step(),
decay_steps=1000,
decay_rate=0.96)))
I am trying to optimize my classifier specifically on the basis of getting a better F-score. Despite using the decaying learning_rate and 300 training steps I am getting inconsistent results. While checking the metrics in the logs, I found the behavior of precision, recall and accuracy to be very erratic. Despite increasing the number of training steps, there was no significant improvement. So I thought that if i could make the optimizer focus more on improving the F-score as a whole I might get better results. Hence the question. Is there something that I am missing?

In classification settings, optimizers minimize the loss, e.g. cross entropy; quantities like accuracy, F-score, precision, recall etc. are essentially business metrics, and they are not (and cannot be) directly minimized during the optimization process.
This is a question that pops up rather frequently here in SO in various disguises; here are some threads which will hopefully help you disentangle the concepts (although they refer to accuracy, precision, and recall, the argument is exactly the same for the F-score):
Loss & accuracy - Are these reasonable learning curves?
Cost function training target versus accuracy desired goal
Is there an optimizer in keras based on precision or recall instead of loss?
The bottom line, adapting one of my own (linked) answers:
Loss and metrics like accuracy or F-score are different things; roughly speaking, metrics like accuracy & F-score are what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy, F-score etc) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

One could technically adjust the threshold parameter that distinguishes between class 1 and 0. For example, in logistic regression, if the threshold is lowered from 0.5 to 0.3, recall would decrease and precision would increase, and viceversa. But as others have mentioned, this is not the same as optimizing ("minimizing") the loss function.

Related

How to evaluate Pytorch model using metrics like precision and recall?

I have trained a simple Pytorch neural network on some data, and now wish to test and evaluate it using metrics like accuracy, recall, f1 and precision. I searched the Pytorch documentation thoroughly and could not find any classes or functions for these metrics. I then tried converting the predicted labels and the actual labels to numpy arrays and using scikit-learn's metrics, but the predicted labels don't seem to be either 0 or 1 (my labels), but instead continuous values. Because of this scikit-learn metrics don't work.
Fast.ai documentation didn't make much sense either, I could not understand which class to inherit for precision etc (although I was able to calculate accuracy). Any help would be much desperately appreciated.

Usually, in a binary classification setting, your neural network will output the probability that the event occurs (e.g., if you are using sigmoid activation and a single neuron at the output layer), which is a continuous value between 0 and 1. To evaluate precision and recall of your model (e.g., with scikit-learn's precision_score and recall_score), it is required that you convert the probability of your model into binary value. This is achieved by specifying a threshold value for your model's probability. (For a overview about threshold, please take a look at this reference: https://developers.google.com/machine-learning/crash-course/classification/thresholding)
Scikit-learn's precision_recall_curve (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) is commonly used to understand how precision and recall metrics behave for different probability thresholds. By analysing the precision and recall values per threshold, you will be able to specify the best threshold for your problem (you may want higher precision, so you will aim for higher thresholds, e.g., 90%; or you may want to have a balanced precision and recall, and you will need to check the threshold that returns the best f1 score for your problem). A good overview on the topic may be found in the following reference: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
I hope this may be of help.

how to select the metric to optimize in sklearn's fit function?

When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?

This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)

Does tf.keras.metrics.AUC work on multi-class problems?

I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?

There is the argument multi_label which is a boolean inside your tf.keras.metrics.AUC call.
If True (not the default), multi-label data will be treated as such, and so AUC is computed separately for each label and then averaged across labels.
When False (the default), the data will be flattened into a single label before AUC computation. In the latter case, when multi-label data is passed to AUC, each label-prediction pair is treated as an individual data point.
The documentation recommends to set it to False for multi-class data.
e.g.: tf.keras.metrics.AUC(multi_label = True)
See the AUC Documentation for more details.

AUC can have a higher score than accuracy.
Additionally, you can use AUC to decide the cutoff threshold for a binary classifier(this cutoff is by default 0.5). Though there are more technical ways to decide this cutoff, you could simply simply increase it from 0 to 1 to find the value which maximizes your accuracy(this is a naive solution and 1 recommend you to read this https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf for an in depth explanation on cutoff analysis )

training loss decreases while dev loss increases

I'm observing the following patterns in a one-layer CNN, binary classification model:
Training loss decreases while dev loss increased with number of steps
Training accuracy increases while dev accuracy decreases with number of steps
Based on past SO questions and literature review, it seems that these patterns are indicative of over-fitting (the model performs well in training, but cannot generalize to new examples).
The graphs below illustrate the loss and accuracy with respect to the number of steps in training.
In both,
The orange line represents the summary of the dev set performance.
The blue line represents the summary of the training set performance.
Loss:
Accuracy:
Traditional remedies I've considered, and my observations about them:
Adding L2 Regularization : I've tried many coefficients of L2 regularization -- from 0.0 to 4.5; all of these tests yield a similar pattern by the 5,000th step in both loss and accuracy.
Cross validation : It seems that the role of cross-validation is widely mis-understood online. As this answer states, cross-validation is for model checking, not model building. Indeed, cross-validation would be a way to check if the model generalizes well. And actually, the graphs I show are from one fold of a 4-fold cross-validation. If I observe a similar pattern in the loss/accuracy in all the folds, what other insight does cross-validation offer other than the confirmation that the model does not generalize well?
Early stopping : This would seem the most intuitive, but the loss graph seems to indicate that the loss levels out only after a divergence in the dev set loss is observed; the starting point of this early stop, then, doesn't seem easy to decide.
Data : The amount of labeled data I have available is limited, so training on more data is not an option right now.
All this said, what I am asking is:
If the patterns observed in the loss and accuracy are indeed indicative of over-fitting, are there any other methods to counteract over-fitting that I haven't considered?
If these patterns are not indicative of over-fitting, what else could they mean?
Thanks -- any insight would be much appreciated.

I think that you are totally on the right track. Looks like classic over-fitting.
One option is adding dropout if you don't already have it. It falls into the category of regularization, but it is more commonly used now then L1 and L2 regularization.
Changing the model architcture could get better results but it's hard to say what specifically would be best. It could help to make it deeper with more layers and possibly some pooling layers. It will likely still overfit but you might get a higher accuracy on the dev set before that happens.
Getting more data may be one of the best things you could do. If you can't get more data you can try to augment the data. You can also try cleaning the data to remove noise which can help prevent the model from fitting to noise.
You may ultimately want to try setting up a hyperparameter optimization search. This, however, can take a while on neural nets which take a while to train. Make sure you remove a test set before hyper parameter tuning.

How can I test my classifier for overfitting?

I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen

The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset

There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.