I'm having problems in finding the best network and configuration to detect small-scale objects. Since now I got very Los mAPs on small objects (i am trying to detect traffic Signs using mapillary dataset)
I have tried using Faster R-CNN 101 (resizing the input to 1024) and the SSD 101 with FPN (resizing the input to 1024).
I did not find a pre-trained model of faster R-CNN with FPN so i could not try that.
What do you think would be the best network and confuguration to detect small objects?
Thank you.
The models you mentioned are models that are built for speed. With small object detection, you often care more about accuracy of the model. So you should probably use bigger models that sacrifice speed for accuracy (mAP). If you want to use tensorflow 2, here is an overview of the available models. Also, for small object detection you should keep high resolution, as you said. You could also maybe crop images into multiple crops instead, to detect on portions of images.
So I disagree with #Akash Desai about SSD, but I also think that detectron2 is more up to date to state of the art models for better performance. So if you don't care about the framework, maybe switch to detectron2.
SSD is best for detecting small as well as large target ,because it will try to do prediction on each and every feature map.
you resized images to 1024 ??? it this case model will take more time to train on dataset, so keep the size of images small like 460*460.
also you can try with detectron2 ,its faster & simpler than tensorflow.
https://colab.research.google.com/github/Tony607/detectron2_instance_segmentation_demo/blob/master/Detectron2_custom_coco_data_segmentation.ipynb
Related
I have a real-time problem which is aimed to detect 9 objects. As far as I understand, yolo has promising results on real-time object detection problems so I am searching good instructions to train a pre-trained yolo model with my custom "own" dataset.
I have my dataset and they are already labeled, also they have bounding box coordinates in .txt files in yolo format. However, it is a bit confusing to find a good instruction on the web about yolo custom dataset training for own object detection problem, since instructions are mostly using generic dataset such as COCO, PASCAL etc. or their instructions are not well enough to implement the object detection model on own dataset.
TL;DR
My question is, are there some handy instructions about implementing yolo object detection for own dataset? I am more looking for frameworks to implement yolo model rather than darknet C implementation since I am more familiar with python so it would be perfect if you could provide Pytorch or Tensorflow implementation.
It is more appraciated if you already implemented yolov3-v4 with your own dataset with the help of instructions you found on the web and you are willing to share those instructions.
Thanks in advance.
For training purpose I would highly recommend AlexeyAB's repository as it's highly optimised for accuracy and speed, although it is also written in C. As far as testing and deployment is considered you have a lot of options:
OpenCV's DNN Module: refer this article.
Tensorflow Model
Pytorch Model
Out of these OpenCV's DNN implementation is the fastest for testing/inference.
I am trying to train a custom object detector, is ther a limit on the number of target class objects that the yolov5 architecture can be trained on.
For example- coco dataset has 80 target class, suppose I have 500 object types to detect, is it advisable to use yolov5.
Can this be explain with reasons.
You can add as many classes you want to any network.
The yolo architecture is known for giving more attention to inference time rather than to performance. Although achieving good results on conventional datasets, the yolo model is built for speed.
But essentially you want a network that has a good backbone (deep and wide) that can really obtain rich features from your image.
From my experience, there is really no straight forward answer. It depends on your dataset as well, if you have large/medium/small objects to detect. I really recommend trying out different models, because every single model, will perform differently on custom datasets. From here you select the best one. State-of-the-art models don't directly relate to the best model on transfer learning and fine-tuning.
The Yolo and all other single shot detectors, for me, were the ones who worked best for fine-tuning purposes (RetinaNet was best for my use cases so far), they are good for hyper parameter tuning because you can train them fast and test what works and what doesn't. With two stage detectors (Faster-RCNN etc) I never achieved overall good results, mainly because the training process is different and much slower.
I recommend you read this article, it explains both architecture types, pros and cons.
Additionally if you want to train a model for more than 500 classes, Tensorflow Object Detection API has pre-trained models for the OpenImages dataset (600 classes), and there is the Detectron 2 on LVIS dataset (1200 classes). I recommend starting with models which were trained on a higher number of classes if you want to fine tune to a similar number of classes in your dataset.
I'm new to the computer vision world, I'm trying to create a script with the objective to gather data from a dataset of images.
I'm interested in what kind of objects are in those images and getting a summary of them in a json file for every image.
I've checked out some YOLO implementations but the ones I've seen are almost always based on COCO and have 80 classes or have a custom dataset.
I've seen that there are algorithms like InceptionV3 etc. which are capable of classifying 1000 classes. But per my understanding object classification is different from object recognition.
Is there a way to use those big dataset classification algos for object detection?
Or any other suggestion?
Unfortunately, I do not know where the breaking point is, and of course, it will depend on acceptable evaluation metrics and training data size.
From a technical point of view, there is no hard limit and if you go to extremes there could be Core ML model size issues and memory issues during inferences. However, that will only happen for an extremely large number of classes.
From a modeling perspective (which is a problem that will happen much earlier than the technical limitation) it is not as clear. As you increase the number of classes, you increase the risk of making classification mistakes. Although, the severity of a lot of the mistakes should simultaneously go down as you will have more and more classes that are naturally similar (breeds of dogs, etc.). The original YOLO9000 paper (https://arxiv.org/pdf/1612.08242.pdf) trained a model using 9000+ classes with reasonable results (lots of mistakes of course, but still impressive). They trained it on a combination of detection and classification data, so if they actually had detection data for all 9000, then results would presumably be even better.
In your experiment, it sounds like 50-60 was OK (thanks for giving us a sample point!). Anything below 100 is definitely tried and true, as long as you have the data. However, will 300 do OK? Will 1000 do OK? Theoretically, I would say yes, if you are able to provide enough training data and you adjust your expectation of what a good evaluation metric is since you know you'll make more mistakes. For instance, for classification with 1000 classes, it is common to report top-5 accuracy (that is, the correct label is in your top-5 classes for a sample).
Here is a useful link - https://github.com/apple/turicreate/issues/968
First, to level set on terminology.
Image Classification based neural networks, such as Inception and Resnet, classify an entire image based upon the classes the network was trained on. So if the image has a dog, then the classifier will most likely return the class dog with a higher confidence score as compared to the other classes the network was trained on. To train a network such as this, it's simple enough to group the same class images (all images with a dog) into folders as inputs. ImageNet and Pascal VOC are examples of public labeled datasets for Image Classification.
Object Detection based neural networks on the other hand, such as SSD and Yolo, will return a set of coordinates that indicate a bounding box and confident score for each class (object) that is detected based upon what the network was trained with. To train a network such as this, each object in an image much as annotated with a set of coordinates that correspond to the bounding boxes of the class (object). The COCO dataset, for example, is an annotated dataset of 80 classes (objects) with coordinates corresponding to the bounding box around each object. Another popular dataset is Object365 that contains 365 classes.
Another important type of neural network that the COCO dataset provides annotations for is Instance Segmentation models, such as Mask RCNN. These models provide pixel-level classification and are extremely compute-intensive, but critical for use cases such as self-driving cars. If you search for Detectron2 tutorials, you will find several great learning examples of training a Mask RCNN network on the COCO dataset.
So, to answer your question, Yes, you can use the COCO dataset (amongst many other options available publicly on the web) for object detection, or, you can also create your own dataset with a little effort by annotating your own dataset with bounding boxes around the object classes you want to train. Try Googling - 'using coco to train ssd model' to get some easy-to-follow tutorials. SSD stands for single-shot detector and is an alternative neural network architecture to Yolo.
Hello everybody,
My objective is to detect people and cars (day and night) on images of the size of 1920x1080, for this I use the tensorflow API, I use a SSD mobilenet model, I annotated 1000 images (900 for training, 100 for evaluation) from 7 different cameras. I launch the training with an image size of 960x540. My model does not converge. I do not know what to do, should I make different classes for day and night objects?
On a tutorial for face detection with the tensorflow API, they use a dataset with images containing only faces, then use the model on complex scenes. Is this a good idea knowing that a model like SSD also learns negative examples?
Thank you
(sources: https://blog.usejournal.com/face-detection-for-cctv-surveillance-6b8851ca3751)
What do you mean by "not converge"? Are you referring to the train/validation loss?
In this case, the first thing that comes to my mind is to reduce the learning rate (I had a similar problem).
You can do it by modifying you configuration file, in the "train_config" section you'll find the value "initial_learning_rate".
Try to set it up to a lower value (like, an order of magnitude lower) and see if it helps.
I have millions of images to infer on. I know how to write my own code to create batches and forward the batches to a trained network using MxNet Module API in order to get the predictions. However, creating the batches leads to a lot of data manipulation that is not especially optimized.
Before doing any optimisation myself, I would like to know if there are some recommended approaches for batch predictions/inferences. More specifically, since this is a common use case, I was wondering if there is an interface/api that can do the usual image pre-processing, batch creation, and inference given a trained model (i.e. symbole file & epoch checkpoint)?
If you are using a standard pretrained model, I would highly recommend to take a look into gluoncv project - a toolkit for Computer Vision based on Apache MXNet.
They have really nice implementations of state of the art models, sometimes even beating the original results that are published in scientific papers. What is cool is that they also provide the data preprocessing code - as far as I understand, this is what you are looking for. (see gluoncv.data.transforms.presets package).
I don't know which inference you want to do, like image classification, segmentation, etc, but take a look to the list of tutorials and most probably you will find one you need.
Other than that, optimization for the fast wall clock time requires you to make sure that your GPU is 100% utilized. You may find useful to watch this video to learn more about tips and tricks on optimizing performance. It discusses training, but the same techniques applies to inference.