Validation and Testing in Tensforflow Estimator vs. Keras - python

I've read answers here and trying to understand how training, validation and testing map to Tensorflow Estimator API and Keras API.
A: Tensorflow
tf.estimator.train_and_evaluate function takes a train_spec and a eval_spec.
Here, does evaluate mean validation or testing in above terminology?
If it's testing, where do I specify a validation set?
B: Keras
In Keras, this seems to be clearer, model.fit takes validation_data argument, which is for validation set. There is a separate function model.evaluate, to which we provide the test set. Is this correct?

In practice the terms "test set" and "validation set" are used interchangeably (flipped from how they are described above). As a result it's become common to refer to the one that is used during training to be referred to as either the test/validation set. To disambiguate, the set that gets set aside for hyperparameter tuning (here described as the validation set) is generally referred to as the holdout set.(source)
Based on this definition you can do one simple thing. For example suppose that, the first dataset is "train", the second is "validation"(as in keras) for real-time evaluation of the model at each step and the final dataset is the "test".
you can simply check the model once it finished training by running the model.predict on the test dataset, to see how your model works on the unseen data.

Related

Tensorflow estimator: Switching to careful_interpolation to get the correct PR-AUC of a model

In my project, I am using the premade estimator DNNClassifier.
Here is my estimator:
model = tf.estimator.DNNClassifier(
hidden_units=network,
feature_columns=feature_cols,
n_classes= 2,
activation_fn=tf.nn.relu,
optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=0.1,
l1_regularization_strength=0.001
),
config=chk_point_run_config,
model_dir=MODEL_CHECKPOINT_DIR
)
when I evaluate the model using eval_res = model.evaluate(..),
I get the following warning:
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
How I can switch to careful_interpolation to get the correct results from the evaluate() method?
Tensorflow version: 1.8
Unfortunately, the use of a pre-made estimator leaves little freedom for customizing the evaluation process. Currently, a DNNClassifier does not seem to provide a means to adjust the evaluation metrics, likewise for other estimators.
Albeit not ideal, one solution is to augment an estimator with the desired metrics using tf.contrib.metrics.add_metrics, which will replace the old metric if the exact same key is assigned to the new one:
If there is a name conflict between this and estimators existing metrics, this will override the existing one.
It comes with the advantage of working for any estimator that produces probabilistic predictions, at the expense of still calculating the overridden metric for each evaluation. A DNNClassifier estimator provides logistic values (between 0 and 1) under the key 'logistic' (the list of possible keys in canned estimators are here). This might not always be the case for other estimator heads, but alternatives may be available: in a multi-label classifier built with tf.contrib.estimator.multi_label_head, logistic is not available, but probabilities can be used instead.
Hence, the code would look like this:
def metric_auc(labels, predictions):
return {
'auc_precision_recall': tf.metrics.auc(
labels=labels, predictions=predictions['logistic'], num_thresholds=200,
curve='PR', summation_method='careful_interpolation')
}
estimator = tf.estimator.DNNClassifier(...)
estimator = tf.contrib.estimator.add_metrics(estimator, metric_auc)
When evaluating, the warning message will still appear, but the AUC with careful interpolation will be called shortly afterwards. Assigning this metric to a different key would also allow you to check the discrepancy between the two summation methods. My tests on a multi-label logistic regression task show that the measurements may indeed be slightly different: auc_precision_recall = 0.05173396, auc_precision_recall_careful = 0.05059402.
There is also a reason why the default summation method is still 'trapezoidal', in spite of the documentation suggesting that careful interpolation is "strictly preferred". As commented in pull request #19079, the change would be significantly backwards incompatible. Subsequent comments on the same pull request suggest the workaround above.

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?
According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()
Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.
As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.

Different validation accuracy when use keras funciton fit_generator() and do prediction on every individual picture?

Recently, I use keras to train a network to classify pictures, and use the keras function model.fit_generator() to fit my model. The fit_generator() will automatically run the model in validation data and return a validation accuracy when finish a epoch.
But odd thing happened, when I used the model to predict the validation data and compared the results with the correct class, the validation accuracy is lower than what I get when use the fit_generator().
I have two assumptions:
1. I use a generator to get data from dictionary, so I assume in one single epoch, the generator may repeatedly fetch data which is highly fitted to the model, so that the accuracy may be higher.
2. keras may use some tricks or preprocess the data when do validation, thus enhance the accuracy.
I tried to look through the source code and document of keras, but nothing helped. I would be very thankful if anyone could give me some advice about the problem.

Tensorflow Object Detection API validation vs test set

I recently started looking into the Tensorflow Object Detection API and have a question on the validation set:
Is the validation used at all for the model training?
For instance are the weights of the model selected based on the accuracy on the validation set?
I am trying to figure out whether I need to have an independent test set (different from the evaluation set) to get unbiased results on the model performance, or can use the validation set for that.
Thank you!
The validation dataset (the test.record ) is not used in the training.
It is always better to have a validation dataset, to prevent overfitting for example.

Confused with repect to working of GridSearchCV

GridSearchCV implements a fit method in which it performs n-fold cross validation to determine best parameters. After this we can directly apply the best estimator to the testing data using predict() - Following this link : - http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
It says here "The model is trained on the full development set"
However we have only applied n fold cross validations here. Is the classifier somehow also training itself on the entire data? or is it just choosing the best trained estimator with best parameters amongst the n-folds when applying predict?
If you want to use predict, you'll need to set 'refit' to True. From the documentation:
refit : boolean
Refit the best estimator with the entire dataset.
If “False”, it is impossible to make predictions using
this GridSearchCV instance after fitting.
It looks like it is true by default, so in the example, predict is based on the whole training set.

Categories

Resources