Clarification in R-CNN

Clarification in R-CNN - python

I am learning object detection using R-CNN...
I have the images and the annotation file which gives the bounding box for the object
I understand these steps in R-CNN,
Using selective search to get the proposed regions
Make all the region same size
Feed those images in CNN
Save the feature maps and feed to SVM for classification
In training, I took all the objects (only the objects from images not the background) and feed to the CNN and then train the feature maps in SVM for classification.
In every blogs, all are saying in R-CNN, there are three parts,
1st -selective search
2nd -CNN
3rd -BBox Regression
But, I don't get the deep explanation of the BBox Regression.
I understand the IOU(Intercept over Union) to check the BBox accuracy.
Could you please help me to learn how this BBox Regression is used to get the coordinates of the object.

To explain about the BBox regression working which is as mentioned below.
Like you mentioned it happens in multiple stages.
Selective Search:
Generate initial sub-segmentation, we generate many candidates or part regions.
Use greedy algorithm to recursively combine similar regions into larger ones.
Use the generated regions to produce the final candidate region proposals.
CNN and BBox Regression:
The regressor is a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values (𝑟,𝑐,ℎ,𝑤), where (𝑟,𝑐) specify the values of the position of the left corner and (ℎ,𝑤) the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled (𝑟,𝑐,ℎ,𝑤) in the training set.

Related

Bounding Box prediction using R-CNN

There is a tutorial on the web for drawing bounding boxes using R-CCN, where a VGG16 network is modified for this task (using transfer learning take advantage that the inner layers are trained already.).
The edit consists on:
removing the classification layer
using a regression layer instead
The training involves images for inputs and [x1,y1,x2,y2] labeled outputs, each pair being a corner of an image, i.e a description of a square box around the object we want to detect.
I have tried it, and so far didn't have luck for the coordinates predicted. So my questions are:
Is the procedure of editing the CNN to create an R-CNN that outputs the vector (also in link at the top) a correct approach for predicting a bounding box for a specific object ?
I am trying with Mobile Net because it is lighter, so assuming 1. is correct, would this also be a "logically similar" idea?

Faster RCNN-RPN Network Training

I am trying to understand RPN network in Faster RCNN.
I understand the concept of RPN network,
Pass the input images to the pre trained CNN, and get the output as feature maps
Make fixed size of the feature maps
Extract anchors (3 different scales and ratio for every sliding window) from the fixed size feature maps.
Use two 1×1 Fully connected NN to find the background or object and the bounding box coordinates (4 values)
Calculate IOU for Anchors bounding box with Ground Truth bounding box, if IOU>0.7, then the anchor has object, otherwise, the anchor has background.
The theme for RPN is to give the region proposals which have objects.
But, I do not understand the input and the output structure.
For example, I have 50 images, each images having 5 to 6 objects, and labeling informations(coordinates of each objects).
How do I generate target values, to train PRN Network...
In all the blogs, they shows the architecture as feed the entire image to the pre trained CNN.
And, the output of RPN, the model has to tell whether the anchor has object or not, and also predict the bounding box for the object in the anchor.
For this, how to prepare the input and target/output values like we do in dog/cat or dog/cat/car classification problem.
Let me correct if I am not correct,
Is that, we have to crop all the objects in every image and do binary classification as object vs background for classifying the anchor has object or not
And, Is that, we have to give the ground truth value as target for every cropped objects from all images in the dataset, so that the RPN network trained well to predict the bounding box for the object in every anchor.
Hope, I clearly explained my doubts.
Help me to learn this concept, Thank you

Unify text and image classification (Python)

I am working on a code to classify texts of scientific articles (using the title and the abstract). And for this I'm using an SVM, which delivers a good accuracy (83%). At the same time I used a CNN to classify the images of these articles. My idea is to merge the text classifier with the image classifier, to improve the accuracy.
It is possible? If so, you would have some idea how I could implement it or some kind of guideline?
Thank you!

You could use the CNN to do both. For this you'd need two (or even three) inputs. One for the text (or two where one is for the abstract and the other for the title) and the second input for the image. Then you'd have some conv-max pooling layers before you merge them at one point. You then plug in some additional CNN or dense layers.
You could also have multiple outputs in this model. E.g a combined one, one for the text and one for the images. If you're using keras you would need the functional API. A picture of an example model can be found here (They're using LSTM in the example, but I guess you should stick to CNN.)

If you get probability from both classifiers you can average them and take the combined result. However taking a weighted average might be a better approach in which case you can use a validation set to find the suitable value for the weight.
prob_svm = probability from SVM text classifier
prob_cnn = probability from CNN image classifier
prob_total = alpha * prob_svm + (1-alpha) * prob_cnn # fine-tune alpha with validation set
If you can get another classifier (maybe a different version of any of these two classifiers), you can also do a majority voting i.e., take the class on which two or all three classifiers agree on.

Clustering images using unsupervised Machine Learning

I have a database of images that contains identity cards, bills and passports.
I want to classify these images into different groups (i.e identity cards, bills and passports).
As I read about that, one of the ways to do this task is clustering (since it is going to be unsupervised).
The idea for me is like this: the clustering will be based on the similarity between images (i.e images that have similar features will be grouped together).
I know also that this process can be done by using k-means.
So the problem for me is about features and using images with K-means.
If anyone has done this before, or has a clue about it, please would you recommend some links to start with or suggest any features that can be helpful.

Most simple way to get good results will be to break down the problem into two parts :
Getting the features from the images: Using the raw pixels as features will give you poor results. Pass the images through a pre trained CNN(you can get several of those online). Then use the last CNN layer(just before the fully connected) as the image features.
Clustering of features : Having got the rich features for each image, you can do clustering on these(like K-means).
I would recommend implementing(using already implemented) 1, 2 in Keras and Sklearn respectively.

Label a few examples, and use classification.
Clustering is as likely to give you the clusters "images with a blueish tint", "grayscale scans" and "warm color temperature". That is a quote reasonable way to cluster such images.
Furthermore, k-means is very sensitive to outliers. And you probably have some in there.
Since you want your clusters correspond to certain human concepts, classification is what you need to use.

I have implemented Unsupervised Clustering based on Image Similarity using Agglomerative Hierarchical Clustering.
My use case had images of People, so I had extracted the Face Embedding (aka Feature) Vector from each image. I have used dlib for face embedding and so each feature vector was 128d.
In general, the feature vector of each image can be extracted. A pre-trained VGG or CNN network, with its final classification layer removed; can be used for feature extraction.
A dictionary with KEY as the IMAGE_FILENAME and VALUE as the FEATURE_VECTOR can be created for all the images in the folder. This will make the co-relation between the filename and it’s feature vector easier.
Then create a single feature vector say X, which comprises of individual feature vectors of each image in the folder/group which needs to be clustered.
In my use case, X had the dimension as : NUMBER OF IMAGE IN THE FOLDER, 128 (i.e SIZE OF EACH FEATURE VECTOR). For instance, Shape of X : 50,128
This feature vector can then be used to fit an Agglomerative Hierarchical Cluster. One needs to fine tune the distance threshold parameter empirically.
Finally, we can write a code to identify which IMAGE_FILENAME belongs to which cluster.
In my case, there were about 50 images per folder so this was a manageable solution. This approach was able to group image of a single person into a single clusters. For example, 15 images of PERSON1 belongs to CLUSTER 0, 10 images of PERSON2 belongs to CLUSTER 2 and so on…

Image Segmentation with TensorFlow

I am trying to see the feasibility of using TensorFlow to identify features in my image data. I have 50x50px grayscale images of nuclei that I would like to have segmented- the desired output would be either a 0 or 1 for each pixel. 0 for the background, 1 as the nucleus.
Example input: raw input data
Example label (what the "label"/real answer would be): output data (label)
Is it even possible to use TensorFlow to perform this type of machine learning on my dataset? I could potentially have thousands of images for the training set.
A lot of the examples have a label correspond to a single category, for example, a 10 number array [0,0,0,0,0,0,0,0,0,0,0] for the handwritten digit data set, but I haven't seen many examples that would output a larger array. I would assume I the label would be a 50x50 array?
Also, any ideas on the processing CPU time for this time of analysis?

Yes, this is possible with TensorFlow. In fact, there are many ways to approach it. Here's a very simple one:
Consider this to be a binary classification task. Each pixel needs to be classified as foreground or background. Choose a set of features by which each pixel will be classified. These features could be local features (such as a patch around the pixel in question) or global features (such as the pixel's location in the image). Or a combination of the two.
Then train a model of your choosing (such as a NN) on this dataset. Of course your results will be highly dependant upon your choice of features.
You could also take a graph-cut approach if you can represent that computation as a computational graph using the primitives that TensorFlow provides. You could then either not make use of TensorFlow's optimization functions such as backprop or if there are some differentiable variables in your computation you could use TF's optimization functions to optimize those variables.

SoftmaxWithLoss() works for your image segmentation problem, if you reshape the predicted label and true label map from [batch, height, width, channel] to [N, channel].
In your case, your final predicted map will be channel = 2, and after reshaping, N = batchheightwidth, then you can use SoftmaxWithLoss() or similar loss function in tensorflow to run the optimization.
See this question that may help.

Try using a convolutional filters for the model. A stacking of convolution and downsampling layers. The input should be the normalized pixel image and output should be the mask. The last layer should be a softmaxWithLoss. HTH.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.