I have been training a custom Object Detector using the Tensorflow Object Detection API (Network: SSD Mobilenet V1). There are screenshots of Tensorboard with the accuracy of the network being shown, however, I have a bunch of metrics being displayed except for accuracy. Are there any specific steps that need to be taken to display the accuracy using tensorboard?
I am using the updated model_main.py and the following python command;
python model_main.py \
--pipeline_config_path=training/ssd_mobilenet_v1_coco.config \
--model_dir=training \
--num_train_steps=560000 \
--num_eval_steps=3 \
--alsologtostderr
How long have you been training it for?
It will run for a number of steps before the doing the first evaluation. Precision and recall data should show up when you wait a bit longer
to display precision and recall in Tensorboard you should run this command after training your model so that the training folder will contain the checkpoints of your trained model
python model_main_tf2.py \
--pipeline_config_path=training/ssd_mobilenet_v1_coco.config \
--model_dir=training \
--checkpoint_dir= training
this command will generate a folder named eval inside the training folder
and to display result you should add the path to your eval folder (training/eval) in tensorboard
!!! this command for TensorFlow 2 object detection API
Related
I am using TensorFlow 2.x object detection API. I have trained a deep learning model from the model zoo on my dataset. I am using Google Colab. After training now I want to evaluate my model. I am using coco detection metrics. I used the following script to evaluate my model,
!python3 model_main_tf2.py \
--model_dir = path/to/model directory \
--pipeline_config_path = path/to/pipeline config file \
--checkpoint_dir = path/to/checkpoint directory
After running the above code I get the mean average precision (mAP) and average recall (AR) for the latest checkpoint on my test set. But for academic purposes, I want to get these metrics on all the checkpoints to get a graph of how my model has improved over time. Is there a possible way to that? or is it possible to train and evaluate at the same time in TensorFlow 2 object detection API? I am a beginner in this field so kindly help me out with this issue. Thank you.
I am facing the same problem. So I had an idea. We can run the model_main_tf2.py you mentioned to eval the model but changing the current checkpoint (first line) to evaluate in the checkpoint file
model_checkpoint_path: "ckpt-1"
then
model_checkpoint_path: "ckpt-2"
then
model_checkpoint_path: "ckpt-3"
.
.
.
For each checkpoint you will get a .tfevent so then you open TensorBoard pointing to the directory that contains all the .tfevent and you can see how the model improves over time.
I just saved the last 3 checkpoints in my computer so I can't see the progress from the beginning (my fault) but if you have all the checkpoints try to do what I suggest.
See my graph evaluating the last 3 checkpoints.
You should have an eval directory including an events.out.tfevents file under your model directory. You can run !tensorboard --logdir=path/to/eval/directory to access the graphs.
You can run training with the same snipped you have except without the checkpoint_dirand can open another terminal to run evaluation like you're currently doing.
I have been trying to train an object detection model for past 2 months and have finally succeeded by following this tutorial.
Here is my colab which contains all my work.
The problem is, the training loss is shown, and it is decreasing on average, but the validation loss is not.
In the pipeline.config file, I did input the evaluation TFRecord file (which I assumed to be the validation data input) , like this:
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
}
and I read through model_main_tf2.py, which does not seem to evaluation while training, but only evaluates when the checkpoint_dir is mentioned.
Hence, I have only been able to monitor the loss on the training set and not the loss on the validation set.
As a result, I have no clue about over or under fitting.
Have any of you managed to use model_main_tf2.py successfully to view validation loss?
Also, it would be nice to see the mAP score with training.
I know keras training allows all these things to be seen on tensorboard, but OD API seems to be much harder.
Thank you for your time, if you are still confused about something please let me know.
You have to open another terminal and run this command
python model_main_tf2.py \
--model_dir=models/my_ssd_resnet50_v1_fpn \
--pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config \
--checkpoint_dir=models/my_ssd_resnet50_v1_fpn
This API tutorial is unclear on that topic. I had the exact same issue.
It turns out that the evaluation process is not included in the training loop, you must launch it in parallel.
It will wait and say waiting for new checkpoint, which means that when you will launch a training with:
python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config # note that the checkpoint_dir argument is not there
It's going to run the evaluation once every eval_interval_secs in your eval_config.
According to the documentation, the eval metrics will the be stored next to your checkpoints inside a eval_0 directory, which you will then be able to plot in tensorboard.
I do agree that this was a bit hard to understand as it is not very clear in the documentation, and is not very convenient as well since I had to allocate another GPU to do the evaluation to avoid the CUDA out of memory issue.
Have a nice day
I would like to evaluate a custom-trained Tensorflow object detection model on a new test set using Google Cloud.
I obtained the inital checkpoints from:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
I know that the Tensorflow object-detection API allows me to run training and evaluation simultaneously by using:
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
To start such a job, i submit following ml-engine job:
gcloud ml-engine jobs submit training [JOBNAME]
--runtime-version 1.9
--job-dir=gs://path_to_bucket/model-dir
--packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0.tar.gz
--module-name object_detection.model_main
--region us-central1
--config object_detection/samples/cloud/cloud.yml
--
--model_dir=gs://path_to_bucket/model_dir
--pipeline_config_path=gs://path_to_bucket/data/model.config
However, after I have successfully transfer-trained a model I would like to use calculate performance metrics, such as COCO mAP(http://cocodataset.org/#detection-eval) or PASCAL mAP (http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf) on a new test data set which has not been previously used (neither during training nor during evaluation).
I have seen, that there is possible flag in model_main.py:
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only
mode, '
'writing resulting metrics to `model_dir`.')
But I don't know whether this really implicates that model_main.py can be run in exclusive evaluation mode? If yes, how should I submit the ML-Engine job?
Alternatively, are there any functions in the Tensorflow API which allows me to evaluate an existing output dictionary (containing bounding boxes, class labels, scores) based on COCO and/or Pascal mAP? If there is, I could easily read in a Tensorflow record file locally, run inference and then evaluate the output dictionary.
I know how to obtain these metrics for the evaluation data set, which is evaluated during training in model_main.py. However, from my understanding I should still report model performance on a new test data set, since I compare multiple models and implement some hyper-parameter optimization and thus I should not report on evaluation data set, am I right? On a more general note: I can really not comprehend why one would switch from separate training and evaluation (as it is in the legacy code) to a combined training and evaluation script?
Edit:
I found two related posts. However I do not think that the answers provided are complete:
how to check both training/eval performances in tensorflow object_detection
How to evaluate a pretrained model in Tensorflow object detection api
The latter has been written while TF's object detection API still had separate evaluation and training scripts. This is not the case anymore.
Thank you very much for any help.
If you specify the checkpoint_dir and set run_once to be true, then it should run evaluation exactly once on the eval dataset. I believe that metrics will be written to the model_dir and should also appear in your console logs. I usually just run this on my local machine (since it's just doing one pass over the dataset) and is not a distributed job. Unfortunately I haven't tried running this particular codepath on CMLE.
Regarding why we have a combined script... from the perspective of the Object Detection API, we were trying to write things in the tf.Estimator paradigm --- but you are right that personally I found it a bit easier when the two functionalities lived in separate binaries. If you want, you can always wrap up this functionality in another binary :)
I have used Keras to finetune MobileNet v1. Now I have model.h5 and I need to convert it to TensorFlow Lite to use it in Android app.
I use TFLite conversion script tflite_convert. I can convert it without quantization but I need more performance so I need to make quantization.
If I run this script:
tflite_convert --output_file=model_quant.tflite \
--keras_model_file=model.h5 \
--inference_type=QUANTIZED_UINT8 \
--input_arrays=input_1 \
--output_arrays=predictions/Softmax \
--mean_values=128 \
--std_dev_values=127 \
--input_shape="1,224,224,3"
It fails:
F tensorflow/contrib/lite/toco/tooling_util.cc:1634] Array
conv1_relu/Relu6, which is an input to the DepthwiseConv operator
producing the output array conv_dw_1_relu/Relu6, is lacking min/max
data, which is necessary for quantization. If accuracy matters, either
target a non-quantized output format, or run quantized training with
your model from a floating point checkpoint to change the input graph
to contain min/max information. If you don't care about accuracy, you
can pass --default_ranges_min= and --default_ranges_max= for easy
experimentation.\nAborted (core dumped)\n"
If I use default_ranges_min and default_ranges_max (called as "dummy-quantization"), it works but it is only for debugging performance without accuracy as it is described in error log.
So what I need to do to make Keras model correctly quantizable? Do I need to find best default_ranges_min and default_ranges_max? How? Or it is about changes in Keras training phase?
Library versions:
Python 3.6.4
TensorFlow 1.12.0
Keras 2.2.4
Unfortunately, Tensorflow does not provide the tooling for post-training per layer quantization in flatbuffer (tflite) yet, but only in protobuf. The only available way now is to introduce fakeQuantization layers in your graph and re-train / fine-tune your model on the train or a calibration set. This is called "Quantization-aware training".
Once the fakeQuant layers are introduced, then you can feed the training set and TF is going to use them on Feed-Forward as simulated quantisation layers (fp-32 datatypes that represent 8-bit values) and back-propagate using full precision values. This way, you can get back the accuracy loss that caused by quantization.
In addition, the fakeQuant layers are going to capture the ranges per layer or per channel through moving average and store them in min / max variables.
Later, you can extract the graph definition and get rid of the fakeQuant nodes through freeze_graph tool.
Finally, the model can be fed into tf_lite_converter (cross-fingers it won't brake) and extract the u8_tflite with captured ranges.
A very good white-paper, explaining all these is provided by Google here : https://arxiv.org/pdf/1806.08342.pdf
Hope that helps.
I'm new to TensorFlow. I am currently working on a project that uses the TF Object detection API. I'm training a model with two classes on my custom images. So far I have successfully run train.py and eval.py and executed TensorBoard at the same time to see how the training processes is progressing.
Here is the image of my work:
How do I display a graph in which I can see the accuracy of the model being developed ?
Any help is appreciated!
You can do this by switching from the deprecated train.py and eval.py to the recent model_main.py. It interleaves evaluation schedule into the training session, so that you can evaluate the training progress of the model without manually doing so.
The flags of model_main.py are very similar to those of train.py, and you can see an example in here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md#running-the-training-job