I am trying to understand RPN network in Faster RCNN.
I understand the concept of RPN network,
Pass the input images to the pre trained CNN, and get the output as feature maps
Make fixed size of the feature maps
Extract anchors (3 different scales and ratio for every sliding window) from the fixed size feature maps.
Use two 1×1 Fully connected NN to find the background or object and the bounding box coordinates (4 values)
Calculate IOU for Anchors bounding box with Ground Truth bounding box, if IOU>0.7, then the anchor has object, otherwise, the anchor has background.
The theme for RPN is to give the region proposals which have objects.
But, I do not understand the input and the output structure.
For example, I have 50 images, each images having 5 to 6 objects, and labeling informations(coordinates of each objects).
How do I generate target values, to train PRN Network...
In all the blogs, they shows the architecture as feed the entire image to the pre trained CNN.
And, the output of RPN, the model has to tell whether the anchor has object or not, and also predict the bounding box for the object in the anchor.
For this, how to prepare the input and target/output values like we do in dog/cat or dog/cat/car classification problem.
Let me correct if I am not correct,
Is that, we have to crop all the objects in every image and do binary classification as object vs background for classifying the anchor has object or not
And, Is that, we have to give the ground truth value as target for every cropped objects from all images in the dataset, so that the RPN network trained well to predict the bounding box for the object in every anchor.
Hope, I clearly explained my doubts.
Help me to learn this concept, Thank you
Related
There is a tutorial on the web for drawing bounding boxes using R-CCN, where a VGG16 network is modified for this task (using transfer learning take advantage that the inner layers are trained already.).
The edit consists on:
removing the classification layer
using a regression layer instead
The training involves images for inputs and [x1,y1,x2,y2] labeled outputs, each pair being a corner of an image, i.e a description of a square box around the object we want to detect.
I have tried it, and so far didn't have luck for the coordinates predicted. So my questions are:
Is the procedure of editing the CNN to create an R-CNN that outputs the vector (also in link at the top) a correct approach for predicting a bounding box for a specific object ?
I am trying with Mobile Net because it is lighter, so assuming 1. is correct, would this also be a "logically similar" idea?
I have an image segmentation problem. First I need to find a certain animal out of an image with multiple different animals. Then I need to find a certain feature in the animal. The first network build to find the particular animal is simply a unet doing binary classification. I have a resulting dice score of 96%.
Now I would like to be able to use the mask from the first network to crop the original image around the animal, I would also need to crop the second ground truth mask related to that image (this is the ground thruth for the features). How can I retreive a bounding box from the first mask predicted to be able to crop my images further?
I am coding in python and using pytorch and torchvision. I would like to avoid keras and tensorflow, any other library is welcome.
I am learning object detection using R-CNN...
I have the images and the annotation file which gives the bounding box for the object
I understand these steps in R-CNN,
Using selective search to get the proposed regions
Make all the region same size
Feed those images in CNN
Save the feature maps and feed to SVM for classification
In training, I took all the objects (only the objects from images not the background) and feed to the CNN and then train the feature maps in SVM for classification.
In every blogs, all are saying in R-CNN, there are three parts,
1st -selective search
2nd -CNN
3rd -BBox Regression
But, I don't get the deep explanation of the BBox Regression.
I understand the IOU(Intercept over Union) to check the BBox accuracy.
Could you please help me to learn how this BBox Regression is used to get the coordinates of the object.
To explain about the BBox regression working which is as mentioned below.
Like you mentioned it happens in multiple stages.
Selective Search:
Generate initial sub-segmentation, we generate many candidates or part regions.
Use greedy algorithm to recursively combine similar regions into larger ones.
Use the generated regions to produce the final candidate region proposals.
CNN and BBox Regression:
The regressor is a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values (𝑟,𝑐,ℎ,𝑤), where (𝑟,𝑐) specify the values of the position of the left corner and (ℎ,𝑤) the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled (𝑟,𝑐,ℎ,𝑤) in the training set.
I want to detect small objects (9x9 px) in my images (around 1200x900) using neural networks. Searching in the net, I've found several webpages with codes for keras using customized layers for custom objects classification. In this case, I've understood that you need to provide images where your object is alone. Although the training is goodand it classifies them properly, unfortunately I haven't found how to later load this trained network to find objects in my big images.
On the other side, I have found that I can do this using the cnn class in cv if I load the weigths from the Yolov3 netwrok. In this case I provide the big images with the proper annotations but the network is not well trained...
Given this context, could someone show me how to load weigths in cnn that are trained with a customized network and how to train that nrtwork?
After a lot of search, I've found a better approach:
Cut your images in subimages (I cut it in 2 rows and 4 columns).
Feed yolo with these subimages and their proper annotations. I used yolov3 tiny, with a size of 960x960 for 10k steps. In my case, intensity and color was important so random parameters such as hue, saturation and exposition were kept at 0. Use random angles. If your objects do not change in size, disable random at yolo layers (random=0 in cfg files. It only randomizes the fact that it changes the size for training in every step). For this, I'm using Alexey darknet fork. If you have some blur object, add blur=1 in the [net] properties in cfg file (after hue). For blur you need Alexey fork and to be compiled with opencv (appart from cuda if you can).
Calculate anchors with Alexey fork. Cluster_num is the number of pairs of anchors you use. You can know it by opening your cfg and look at any anchors= line. Anchors are the size of the boxes that darknet will use to predict the positions. Cluster_num = number of anchors pairs.
Change cfg with your new anchors. If you have fixed size objects, anchors will be very close in size. I left the ones for bigger (first yolo layer) but for the second, the tinies, I modified and I even removed 1 pair. If you remove some, then change the order in mask [yolo] (in all [yolo]). Mask refer to the index of the anchors, starting at 0 index. If you remove some, change also the num= inside the [yolo].
After, detection is quite good.It could happen that if you detect on a video, there are objects that are lost in some frames. You can try to avoid this by using the lstm cfg. https://github.com/AlexeyAB/darknet/issues/3114
Now, if you also want to track them, you can apply a deep sort algorithm with your yolo pretrained network. For example, you can convert your pretrained network to keras using https://github.com/allanzelener/YAD2K (add this commit for tiny yolov3 https://github.com/allanzelener/YAD2K/pull/154/commits/e76d1e4cd9da6e177d7a9213131bb688c254eb20) and then use https://github.com/Qidian213/deep_sort_yolov3
As an alternative, you can train it with mask-rcnn or any other faster-rcnn algorithm and then look for deep-sort.
Currently I am testing the yolo 9000 model for object detection and in the Paper I understand that the image is splited in 13X13 boxes and in each boxes we calculate P(Object), but How can we calculate that ? how can the model know if there is an object in this boxe or not, please I need help to understand that
I am using tensorflow
Thanks,
They train for the confidence score = P(object) * IOU. For the ground truth box they take P(object)=1 and for rest of the grid pixels the ground truth P(object) is zero. You are training your network to tell you if some object in that grid location i.e. output 0 if not object, output IOU if partial object and output 1 if object is present. So at test time, your model has become capable of telling if there is an object at that location.
As they mentioned in the paper(2nd page section 2) confident score is = P(object) * IOU. But in that paragraph they have mentioned that if there's an object then confident score will be IOU otherwise zero. So it's just a guide line.
There are 13x13 grid cells, true, but P(object) is calculated for each of 5x13x13 anchor boxes. From the YOLO9000 paper:
When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box.
I can't comment yet because I'm new here, but if you're wondering about test time, it works kind of like an RPN. At each grid cell, the 5 anchor boxes each predict a bounding box, which can be larger than the grid cell, and then non-maximum suppression is used to pick the top few boxes to do classification on.
P(object) is just a probability, the network doesn't "know" if there is really an object in there or not.
You can also look at the source code for the forward_region_layer method in region_layer.c and trace how the losses are calculated, if you're interested.
During test time the YOLO network gets the IOU from the default setted value. That is 0.5.