Tensorflow Hub: Fine-tune and evaluate

Tensorflow Hub: Fine-tune and evaluate - python

Let's say that I want to fine tune one of the Tensorflow Hub image feature vector modules. The problem arises because in order to fine-tune a module, the following needs to be done:
module = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3", trainable=True, tags={"train"})
Assuming that the module is Resnet50.
In other words, the module is imported with the trainable flag set as True and with the train tag. Now, in case I want to validate the model (perform inference on the validation set in order to measure the performance of the model), I can't switch off the batch-norm because of the train tag and the trainable flag.
Please note that this question has already been asked here Tensorflow hub fine-tune and evaluate but no answer has been provided.
I also raised a Github issue about it issue about it.
Looking forward to your help!

With hub.Module for TF1, the situation is as you say: either the training or the inference graph is instantiated, and there is no good way to import both and share variables between them in a single tf.Session. That's informed by the approach used by Estimators and many other training scripts in TF1 (esp. distributed ones): there's a training Session that produces checkpoints, and a separate evaluation Session that restores model weights from them. (The two will likely also differ in the dataset they read and the preprocessing they perform.)
With TF2 and its emphasis on Eager mode, this has changed. TF2-style Hub modules (as found at https://tfhub.dev/s?q=tf2-preview) are really just TF2-style SavedModels, and these don't come with multiple graph versions. Instead, the __call__ function on the restored top-level object takes an optional training=... parameter if the train/inference distinction is required.
With this, TF2 should match your expectations. See the interactive demo tf2_image_retraining.ipynb and the underlying code in tensorflow_hub/keras_layer.py for how it can be done. The TF Hub team is working on making more complete selection of modules available for the TF2 release.

Related

How can I "see" the model/network when loading a model from tfhub?

I'm new to this topic, so forgive me my lack of knowledge. There is a very good model called inception resnet v2 that basically works like this, the input is an image and outputs a list of predictions with their positions and bounded rectangles. I find this very useful, and I thought of using the already worked model in order to recognize things that it now can't (for example if a human is wearing a mask or not). Yes, I wanted to add a new recognition class to the model.
import tensorflow as tf
import tensorflow_hub as hub
mod = hub.load("https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1")
mod is an object of type
tensorflow.python.training.tracking.tracking.AutoTrackable, reading the documentation (that was only available on the source code was a bit hard to understand without context)
and I tried to inspect some of it's properties in order to see if I could figure it out by myself.
And well, I didn't. How can I see the network, the layers, the weights? the fit methods, Is it's all abstracted away?. Can I convert it to keras? I want to experiment with it, see if I can modify it, and see if I could export the model to another representation, for example pytorch.
I wanted to do this because I thought it'd be better to modify an already working model instead of creating one from scratch. Also because I'm not good at training models myself.

I've run into this issue too. Tensorflow hub guide says:
This error frequently arises when loading models in TF1 Hub format with the hub.load() API in TF2. Adding the correct signature should fix this problem.
mod = hub.load(handle).signatures['default']
As an example, you can see this notebook.

You can dir the loaded model asset to see what's defined on it
m = hub.load(handle)
dir(model)
As mentioned in the other answer, you can also look at the signatures with print(m.signatures)
Hub models are SavedModel assets and do not have a keras .fit method on them. If you want to train the model from scratch, you'll need to go to the source code.
Some models have more extensive exported interfaces including access to individual layers, but this model does not.

How to run TF object detection API model_main.py in evaluation mode only

I would like to evaluate a custom-trained Tensorflow object detection model on a new test set using Google Cloud.
I obtained the inital checkpoints from:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
I know that the Tensorflow object-detection API allows me to run training and evaluation simultaneously by using:
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
To start such a job, i submit following ml-engine job:
gcloud ml-engine jobs submit training [JOBNAME]
--runtime-version 1.9
--job-dir=gs://path_to_bucket/model-dir
--packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0.tar.gz
--module-name object_detection.model_main
--region us-central1
--config object_detection/samples/cloud/cloud.yml
--
--model_dir=gs://path_to_bucket/model_dir
--pipeline_config_path=gs://path_to_bucket/data/model.config
However, after I have successfully transfer-trained a model I would like to use calculate performance metrics, such as COCO mAP(http://cocodataset.org/#detection-eval) or PASCAL mAP (http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf) on a new test data set which has not been previously used (neither during training nor during evaluation).
I have seen, that there is possible flag in model_main.py:
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only
mode, '
'writing resulting metrics to `model_dir`.')
But I don't know whether this really implicates that model_main.py can be run in exclusive evaluation mode? If yes, how should I submit the ML-Engine job?
Alternatively, are there any functions in the Tensorflow API which allows me to evaluate an existing output dictionary (containing bounding boxes, class labels, scores) based on COCO and/or Pascal mAP? If there is, I could easily read in a Tensorflow record file locally, run inference and then evaluate the output dictionary.
I know how to obtain these metrics for the evaluation data set, which is evaluated during training in model_main.py. However, from my understanding I should still report model performance on a new test data set, since I compare multiple models and implement some hyper-parameter optimization and thus I should not report on evaluation data set, am I right? On a more general note: I can really not comprehend why one would switch from separate training and evaluation (as it is in the legacy code) to a combined training and evaluation script?
Edit:
I found two related posts. However I do not think that the answers provided are complete:
how to check both training/eval performances in tensorflow object_detection
How to evaluate a pretrained model in Tensorflow object detection api
The latter has been written while TF's object detection API still had separate evaluation and training scripts. This is not the case anymore.
Thank you very much for any help.

If you specify the checkpoint_dir and set run_once to be true, then it should run evaluation exactly once on the eval dataset. I believe that metrics will be written to the model_dir and should also appear in your console logs. I usually just run this on my local machine (since it's just doing one pass over the dataset) and is not a distributed job. Unfortunately I haven't tried running this particular codepath on CMLE.
Regarding why we have a combined script... from the perspective of the Object Detection API, we were trying to write things in the tf.Estimator paradigm --- but you are right that personally I found it a bit easier when the two functionalities lived in separate binaries. If you want, you can always wrap up this functionality in another binary :)

Why should I build separated graph for training and validation in tensorflow?

I've been using tensorflow for a while now. At first I had stuff like this:
def myModel(training):
with tf.scope_variables('model', reuse=not training):
do model
return model
training_model = myModel(True)
validation_model = myModel(False)
Mostly because I started with some MOOCs that tought me to do that. But they also didn't use TFRecords or Queues. And I didn't know why I was using two separate models. I tried building only one and feeding the data with the feed_dict: everything worked.
Ever since I've been usually using only one model. My inputs are always place_holders and I just input either training or validation data.
Lately, I've noticed some weird behavior on models that use tf.layers.dropout and tf.layers.batch_normalization. Both functions have a 'training' parameter that I use with a tf.bool placeholder. I've seen tf.layers used generally with a tf.estimator.Estimator, but I'm not using it. I've read the Estimators code and it appears to create two different graphs for training and validation. May be that those issues are arising from not having two separate models, but I'm still skeptical.
Is there a clear reason I'm not seeing that implies that two separate-equivalent models have to be used?

You do not have to use two neural nets for training and validation. After all, as you noticed, tensorflow helps you having a monolothical train-and-validate net by allowing the training parameter of some layers to be a placeholder.
However, why wouldn't you? By having separate nets for training and for validation, you set yourself on the right path and future-proof your code. Your training and validation nets might be identical today, but you might later see some benefit to having distinct nets such as having different inputs, different outputs, removing out intermediate layers, etc.
Also, because variables are shared between them, having distinct training and validation nets comes at almost no penalty.
So, keeping a single net is fine; in my experience though, any project other than playful experimentation is likely to implement a distinct validation net at some point, and tensorflow makes it easy to do just that with minimal penalty.

tf.estimator.Estimator classes indeed create a new graph for each invocation and this has been the subject of furious debates, see this issue on GitHub. Their approach is to build the graph from scratch on each train, evaluate and predict invocations and restore the model from the last checkpoint. There are clear downsides of this approach, for example:
A loop that calls train and evaluate will create two new graphs on every iteration.
One can't evaluate while training easily (though there are workarounds, train_and_evaluate, but this doesn't look very nice).
I tend to agree that having the same graph and model for all actions is convenient and I usually go with this solution. But in a lot of cases when using a high-level API like tf.estimator.Estimator, you don't deal with the graph and variables directly, so you shouldn't care how exactly the model is organized.

Horovod and Tensorflow estimators

How can I extend the Horovod example that uses tf.train.MonitoredTrainingSession to instead use tf.estimator.Estimator? I am using Tensorflow 1.4.0.
Here is an example that closely resembles my current code.
I want to use this together with hyperopt, and I like how I can easily do something like
tf.contrib.learn.learn_runner.run(
experiment_fn=_create_my_experiment,
run_config=run_config,
schedule="train_and_evaluate",
hparams=hparams)
to train with different hyperparameters, hparams. This also gives me separate Tensorboard log directories for training and validation sets - and I'd like this to be true with a Horovod solution as well. I played around with a tf.train.SingularMonitoredSession(hooks=hooks, config=config) where hooks contains a tf.train.SummarySaverHook, but I only could make it work nicely with the training set.

An TensorFlow Estimator example has been added to the Horovod repo.

R-CNN: looking for REPO where FC for classification is retrainable

I'm studying different object detection algorithms for my interest.
The main reference are Andrej Karpathy's slides on object detection slides here.
I would like to start from some reference, in particular something which allows me to directly test some of the network mentioned on my data (mainly consisting in onboard cameras of car and bike races).
Unfortunately I already used some pretrained network (repo forked from JunshengFu one, where I slightly adapt Yolo to my use case), but the classification accuracy is rather poor, I guess because there were not many training instances of racing cars like Formula 1.
For this reason I would like to retrain the networks and here is where I'm finding the most issues:
properly training some of the networks requires either hardware (powerful GPUs) or time I don't have so I was wondering whether I could retrain just some part of the network, in particular the classification network and if there is any repo already allowing that.
Thank you in advance

That is called fine-tuning of the network or transfer-learning. Basically you can do that for any network you find (having similar problem domains of course), and then depending on the amount of the data you have you will either fine-tune whole network or freeze some layers and train only last layers. For your case you would probably need to freeze whole network except last fully-connected layers (which you will actually replace with new ones, satisfying your number of classes), which perform classification. I don't know what library you use, but tensorflow has official tutorial on transfer-learning. However it's not very clear tbh.
More user-friendly tutorial you can find here by some enthusiast: tutorial. Here you can find a code repository as well. One correction you need thou is that the author performs fine-tuning of the whole network, while if you want to freeze some layers you will need to get list of the trainable variables and remove those you want to freeze and pass the resultant list to the optimizer (so he ignores removed vars), like following:
all_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,scope='InceptionResnetV2')
to_train = all_vars[-6:] // you better specify them by name explicitely, but this still will work
optimizer = tf.train.AdamOptimizer(lr=0.0001)
train_op = slim.learning.create_train_op(total_loss,optimizer, variables_to_train=to_train)
Further, tensorflow has a so called model zoo (bunch of trained models you can use for your purposes and transfer-learning). You can find it here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.