I'm a Tensorflow newby and I'm trying to train a 1 class model for object detection. In particular I'm trying to recognize an arrow like the following:
I need a very fast recognition so I started wondering if a pre-trained model can contain such kind of shape.
Unfortunately didn't find anything similar and therefor I started with my own training of the arrow using as model the faster_rcnn_inception_v2_coco_2018_01_28.
I'm using his pipeline config, and I'm using his fine_tune_checkpoint as well, is this right considering that I have to train a completely different object?
The result is a training with a very good accuracy but very low speed. I need to increase the framerate and I didn't understand yet if the less is the "training loss" the more is the "object recognition speed", or not.
Any suggestion on how could I speedup the detection?
I'm using his pipeline config, and I'm using his fine_tune_checkpoint
as well, is this right considering that I have to train a completely
different object?
Yes! Every time you want to change the output of a deep NN, you should take a pretrained model. Training a model from scratch can take several weeks and you will never be able to generate enough data on your own. Taking a pretrained model and fine-tuning it is a way to go.
I didn't understand yet if the
less is the "training loss" the more is the "object recognition
speed", or not.
No. Training loss just tells you how good your model performs with respect to the training set.
The issue you are having is a classic speed vs. accuracy trade-off. I encourage you to take a look at this table and find a model which is fast enough for you (i.e. lowest run-time) but have decent accuracy. I would first check SSD here.
The result is a training with a very good accuracy but very low speed.
How much FPS does your algorithm perform?
Since you already have prepared dataset, I would suggest using Tiny-Yolo which performs 244 FPS on COCO dataset https://pjreddie.com/darknet/yolo/
Preparing training dataset for Tiny-Yolo is very easy if you use this repository
And
I didn't understand yet if the less is the "training loss" the more is the "object recognition speed"
Training lost has nothing to do with speed.
Related
I am trying to train a custom object detector, is ther a limit on the number of target class objects that the yolov5 architecture can be trained on.
For example- coco dataset has 80 target class, suppose I have 500 object types to detect, is it advisable to use yolov5.
Can this be explain with reasons.
You can add as many classes you want to any network.
The yolo architecture is known for giving more attention to inference time rather than to performance. Although achieving good results on conventional datasets, the yolo model is built for speed.
But essentially you want a network that has a good backbone (deep and wide) that can really obtain rich features from your image.
From my experience, there is really no straight forward answer. It depends on your dataset as well, if you have large/medium/small objects to detect. I really recommend trying out different models, because every single model, will perform differently on custom datasets. From here you select the best one. State-of-the-art models don't directly relate to the best model on transfer learning and fine-tuning.
The Yolo and all other single shot detectors, for me, were the ones who worked best for fine-tuning purposes (RetinaNet was best for my use cases so far), they are good for hyper parameter tuning because you can train them fast and test what works and what doesn't. With two stage detectors (Faster-RCNN etc) I never achieved overall good results, mainly because the training process is different and much slower.
I recommend you read this article, it explains both architecture types, pros and cons.
Additionally if you want to train a model for more than 500 classes, Tensorflow Object Detection API has pre-trained models for the OpenImages dataset (600 classes), and there is the Detectron 2 on LVIS dataset (1200 classes). I recommend starting with models which were trained on a higher number of classes if you want to fine tune to a similar number of classes in your dataset.
I tried image classification using trained model and its working well but some images could not find perfectly in that time have to get that image and label from users so my doubt is..Is it possible to add new data into already trained model?
No, during inference time you use the weights of the trained model for predictions. Which basically means that at the time your model is deployed the capabilities of your image classifier are fixed by the weights. If you wish to improve your model, you would have to retrain your model with the new - data. However, there is another paradigm of learning called "Online Learning" where the model is continuously learning and modifying the weights. In this case your weights are not fixed and your model is continuously updating its weights with each training input. However afaik this is not usually recommended for CNNs, because the backward pass of gradients is computationally intensive and your inference will be slow because of this.
No model can predict with 100% accuracy if it does it's an ideal model. And if you want to add more data to your train model you have to retrain the model with the new data. Having more data is always a good idea. It allows the “data to tell for itself,” instead of relying on assumptions and weak correlations. Presence of more data results in better and accurate models. So if you want to get better accuracy you have to train your model with more data. Without retraining, you can't add data to your trained model.
I am trying to try an object detection model on a custom data set. I want it to recognize a specififc piece of metal from my garage. I took like 32 photos and labelled them. The training goes well, but up to 10% loss. After that it goes very slow, so I need to stop it. After that, I implemented the model on camera, but it has no accuracy. Could it be because of the fact that I have only 32 images of the object? I have tried with YoloV2 and Faster RCNN.
It has low probability that your model implemented to a camera has no accuracy because you have only 32 images.
Anyway before you've had about up to 10% loss (It seems to be about 90% accuracy), so it should work I think the problem is not in the amount of images.
After training your model you need to save the coefficients of model trained.
Make sure that you implemented model trained, and not you don't use model from scratch
Just labeling will not help in object detection. What you are doing is image classification but expecting results of object detection.
Object detection requires bounding box annotations and changes in the loss function which is to be fed to the model during each backpropagation step.
You need some tools to do data annotations first, then manipulate your Yolov2/Fast-RCNN codes along with the loss function. Train it well and try using Image Augmentations to generate little more images because 32 images are less. In that case, you might end up in a pitfall of getting higher training accuracy but less test accuracy. Training models in fewer images sometimes lead to unexpected overfitting.
Only then you should try to implement using the camera.
I have an image classification task to solve, but based on quite simple/good terms:
There are only two classes (either good or not good)
The images always show the same kind of piece (either with or w/o fault)
That piece is always filmed from the same angle & distance
I have at least 1000 sample images for both classes
So I thought it should be easy to come up with a good CNN solution - and it was. I created a VGG16-based model with a custom classifier (Keras/TF). Via transfer learning I was able to achieve up to 100% validation accuracy during model training, so all is fine on that end.
Out of curiosity and because the VGG-based approach seems a bit "slow", I also wanted to try it with a more modern model architecture as base, so I did with ResNet50v2 and Xception. I trained it similar to the VGG-based model, tried it with several hyperparameter modifications etc. However, I was not able to achieve a better validation accuracy then 95% - so much worse than with the "old" VGG architecture.
Hence my question is: Given these "simple" (always the same) images and only two classes, is the VGG model probably a better base than a modern network like ResNet or Xception? Or is it more likely that I messed something up with my model or simply got the training / hyperparameters not right?
I trained a custom person detector using Tensorflow and Inception's pretrained model then after a few thousands of step and an average of 2-1 loss, I've stopped the training and tested it with a live video. The result was quite good and only gets few false positives. It can detect some person but not everyone so I decided to continue on training the model until I get an average loss of below 1 then tested it again. It now detects almost everything as a person even the whole frame of the video even when there is no object present. The models seems to work great on pictures but not on videos. Is that an overfitting?
Sorry I forgot how many steps it is. I accidentally deleted the training folder that contains the ckpt and tfevents.
edit: I forgot that I am also training the same model with same dataset but higher batch size on a cloud as a backup which is now on a higher step. I'll edit the post later and will provide the infos from tensorboard once I've finished downloading and testing the model from the cloud.
edit2: I downloaded the trained model on 200k steps from the cloud and it is working, it detects persons but sometimes recognizes the whole frame as "person" for less than a second when I am moving the camera. I guess this could be improved by continuing on training the model.
Total Loss on tensorboard
For now, I'll just continue the training on the cloud and try to document every results of my test. I'll also try to resize some images on my dataset and train it on my local machine using mobilenet and compare the results from two models.
As you are saying the model did well when there were less training iterations, I guess the pre-trained model could already detect the person object and your training set made the detection worse.
The models seems to work great on pictures but not on videos
If your single pictures are detected fine, then videos should work too. the only difference can be from video image resolution and quality. So, compare the image resolution and the video.
Is that an overfitting?
The images and the videos, you are talking about, If the images were used in training you should not use them to evaluate the model. If the model is over fitted it will detect the training images but not any other ones.
As you are saying, the model detects too many detections, I think this is not because of overfitting, it can be about your dataset. I think
You have too little amount of data to train.
The network model is too big and complicated for the amount of data. Try smaller network like VGG, inception_v1(ssd mobile net) etc.
The image resolution used in training set is very different from the evaluation images.
Learning rate is important, but I think in your case it's fine.
I think you can check carefully the dataset you used for training and use as many data as you can for the training. These are the things I generally experienced and wasted time.