How to tile images on inference with Yolo Object Detection

How to tile images on inference with Yolo Object Detection - python

I've recently trained a custom yolov5 model to recognize animals on safari.
Animals on safari are far away most of the time, and so, after resizing images to 640x640, most of the animals are now too small to be detected.
I've researched the technique of tiling, taking a large image and splitting it into 5x5 smaller images, allowing the inference not to take up as much ram as trying to run the inference on the initial large image.
However, there is no instruction on how to do this on real-time inference.
The model I'm using is Yolov5 trained with PyTorch.
Does anyone know how to do tiling on real-time inference?

Related

YOLOV5 is not giving prediction on live webcam?

Actually i have trained a yolov5 model with traffic sign dataset diamentions (1360,800) with size of each image is 600KB but when i do real-time prediction on laptop camera its note able to predict on those signs. The diamention of webcame image are (600,450) with size 280KB does this problem due to the size or diamention of the image.
One thing to keep in mind i have no GPU,CPU in local pc i have trained model on colab. and its working fine on image for test in dataset with high size and diamentions.
This YOLOV5 trained model is working on test data of their own dataset but its not working on my captured image or live webcame of my pc.

Detecting small objects with Tensorflow 2 Object Detection API

I'm having problems in finding the best network and configuration to detect small-scale objects. Since now I got very Los mAPs on small objects (i am trying to detect traffic Signs using mapillary dataset)
I have tried using Faster R-CNN 101 (resizing the input to 1024) and the SSD 101 with FPN (resizing the input to 1024).
I did not find a pre-trained model of faster R-CNN with FPN so i could not try that.
What do you think would be the best network and confuguration to detect small objects?
Thank you.

The models you mentioned are models that are built for speed. With small object detection, you often care more about accuracy of the model. So you should probably use bigger models that sacrifice speed for accuracy (mAP). If you want to use tensorflow 2, here is an overview of the available models. Also, for small object detection you should keep high resolution, as you said. You could also maybe crop images into multiple crops instead, to detect on portions of images.
So I disagree with #Akash Desai about SSD, but I also think that detectron2 is more up to date to state of the art models for better performance. So if you don't care about the framework, maybe switch to detectron2.

SSD is best for detecting small as well as large target ,because it will try to do prediction on each and every feature map.
you resized images to 1024 ??? it this case model will take more time to train on dataset, so keep the size of images small like 460*460.
also you can try with detectron2 ,its faster & simpler than tensorflow.
https://colab.research.google.com/github/Tony607/detectron2_instance_segmentation_demo/blob/master/Detectron2_custom_coco_data_segmentation.ipynb

Object detection in 1080p with SSD Mobilenet (Tensorflow API)

Hello everybody,
My objective is to detect people and cars (day and night) on images of the size of 1920x1080, for this I use the tensorflow API, I use a SSD mobilenet model, I annotated 1000 images (900 for training, 100 for evaluation) from 7 different cameras. I launch the training with an image size of 960x540. My model does not converge. I do not know what to do, should I make different classes for day and night objects?
On a tutorial for face detection with the tensorflow API, they use a dataset with images containing only faces, then use the model on complex scenes. Is this a good idea knowing that a model like SSD also learns negative examples?
Thank you
(sources: https://blog.usejournal.com/face-detection-for-cctv-surveillance-6b8851ca3751)

What do you mean by "not converge"? Are you referring to the train/validation loss?
In this case, the first thing that comes to my mind is to reduce the learning rate (I had a similar problem).
You can do it by modifying you configuration file, in the "train_config" section you'll find the value "initial_learning_rate".
Try to set it up to a lower value (like, an order of magnitude lower) and see if it helps.

Automatically make a composite image for cnn training

i would like to train a CNN for detection and classification of any kind of signs (mainly laboratory and safety markers) using tensorflow.
While I can gather enough training data for the classification training set, using e.g. The Bing API, I‘m struggeling to think about a solution to get enough images for the object detection training set. Since these markers are mostly not public available, I thought I could make a composite of a natrual scene image with the image of the marker itself, to get a training set. Is there any way to do that automatically?
I looked at tensorflow data augmentation class, but it seems it only provides functionality for simpler data augmentation tasks.

You can do it with OpenCV as preprocessing.
The algorithm follows:
Choose a combination of a natural scene image and a sign image randomly.
Sample random position in the natural scene image where the sign image is pasted.
Paste the sign image at the position.
Obtain the pasted image and the position as a part of training data.
Step1 and 2 is done with python standard random module or numpy.
Step3 is done with opencv-python. See overlay a smaller image on a larger image python OpenCv
.

Trained model detects almost everything as one class after a long training

I trained a custom person detector using Tensorflow and Inception's pretrained model then after a few thousands of step and an average of 2-1 loss, I've stopped the training and tested it with a live video. The result was quite good and only gets few false positives. It can detect some person but not everyone so I decided to continue on training the model until I get an average loss of below 1 then tested it again. It now detects almost everything as a person even the whole frame of the video even when there is no object present. The models seems to work great on pictures but not on videos. Is that an overfitting?
Sorry I forgot how many steps it is. I accidentally deleted the training folder that contains the ckpt and tfevents.
edit: I forgot that I am also training the same model with same dataset but higher batch size on a cloud as a backup which is now on a higher step. I'll edit the post later and will provide the infos from tensorboard once I've finished downloading and testing the model from the cloud.
edit2: I downloaded the trained model on 200k steps from the cloud and it is working, it detects persons but sometimes recognizes the whole frame as "person" for less than a second when I am moving the camera. I guess this could be improved by continuing on training the model.
Total Loss on tensorboard
For now, I'll just continue the training on the cloud and try to document every results of my test. I'll also try to resize some images on my dataset and train it on my local machine using mobilenet and compare the results from two models.

As you are saying the model did well when there were less training iterations, I guess the pre-trained model could already detect the person object and your training set made the detection worse.
The models seems to work great on pictures but not on videos
If your single pictures are detected fine, then videos should work too. the only difference can be from video image resolution and quality. So, compare the image resolution and the video.
Is that an overfitting?
The images and the videos, you are talking about, If the images were used in training you should not use them to evaluate the model. If the model is over fitted it will detect the training images but not any other ones.
As you are saying, the model detects too many detections, I think this is not because of overfitting, it can be about your dataset. I think
You have too little amount of data to train.
The network model is too big and complicated for the amount of data. Try smaller network like VGG, inception_v1(ssd mobile net) etc.
The image resolution used in training set is very different from the evaluation images.
Learning rate is important, but I think in your case it's fine.
I think you can check carefully the dataset you used for training and use as many data as you can for the training. These are the things I generally experienced and wasted time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.