One year ago I trained a model to detect flowers. One year later I am starting this project up again, but first I decided to make sure I still remembered by training it to detect and red and green crayons.
My process is more or less following this tutorial –
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10
I have two labels, green and red. I have 200 training images and 20 test images.
Using faster_rcnn_inception. I followed the steps and ran my model.
It detects the crayons as well as you could with only 200 images, however, can’t tell the red and green crayon apart at all. I thought maybe I had screwed up the settings, but if I move a blue pen in, the label pops up!
Even if I feed it the training images, it classifies 99% of them as two green pens. Even though each image always has two different pens!!!
Can this model work with colour? Or is it converting the colour somehow and messing it up? Is colour hard to detect, and I just need more training images? Have I likely screwed up a setting, since it can’t even correctly classify the training images?
The config file I am using is here:
https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_inception_v2_pets.config
I've changed line 9, line 130 and line 108 to false.
In general, neural networks can detect colour.
But often they learn not to. Due to differences in colour temperature and perspective different colours can produce same or similar pixel-level values. Therefore, when training on larger datasets networks tend to become highly colour agnostic. Unfortunately, I can only speak from the gut feeling and can not provide any example or reference, but the picture above should give you a sense why.
In your case issues are further complicated by the fact, that there is a competing task of detecting object box. Due to that during retraining detection net can become insensitive to weak clues like colour.
To troubleshoot the situation I would recommend to look closely on your classification accuracy during retraining. As far as i can tell, tutorial code only provides loss value. One should expect that during retraining at least the train set should be overfit almost perfectly i.e. green and red crayons must become distinguishable. If not, it might make sense to train for longer or decrease the learning rate.
Related
I will be using the values produced by this function as a ground-truth labels for a computer vision task, where I will train a model using simulation data and test it using ArUco's real world ones.
I calibrated my smartphone camera with a chess board and got a reprojection error of 0.95 (I wasn't able to get less than this, tried all the possible options like ArUco and ChAruco, captured tens of images, filtered the bad ones, and none has improved the error) I read somewhere that this is expected from smartphones, if not, please let me know.
Now I placed my ArUco marker somewhere in the environment and captured images of it, produced its pose relative to the camera using estimatePoseSingleMarker .
Upon drawing the axis, everything looks perfect and it's accurate. However, when I compare the pose values with the ones generated from simulating the same environment with the same object and camera. The values are actually quite different, especially the z value.
I'm 100% sure that my simulation environment has no error, so I presume this gap is caused by ArUco.
Do you suggest any solution? how can I predict the error of ArUco?
Is there any other possible solution to collected ground-truth labels?
I'm currently working with SqueezeDet for detection purposes. I trained the network on synthetic data and it performs reasonably well. detection results
For my project I would like to be able to visualize which parts of the input were more relevant for the detection process. So in case of the detection of a pedestrian, I'd assume that its pixel would be more important than for example the surroundings. I tried a couple of different methods, but none of them is fully satisfactory.
I did my own research and couldnt't really any papers that talk about visualization for object detection. So I implemented VisualBackProp, the results however don't look all to promising. If instead I compute the relevance things look slightly better, but still not as expected.
I started thinking that perhaps the issues might be related to the complexity of my outputs, with respect to a network that might only be dealing with classification, or as in the VisualBackProp paper just the prediction of steering angle.
I was wondering if anyone has idea of what visualization technique might best suit the detection task.
You could try just augmenting different areas of the image and see how it affects the detection confidence. For example, you could put the area containing the pedestrian on just a black background instead of the natural background to see how much the surroundings actually affect things. You could also add moderate to severe noise to select areas of the image and observe which areas correspond to the biggest change in detection confidence.
More directly, mathematically you seem to be interested in the gradient of detection confidence WRT pixel data. Depending on what deep learning platform you are using, if you run a single training iteration you may be able to obtain the gradients in the data layer (dL/dx) which will directly show these. This will only represent the effect of small changes to the pixel data - if you are aiming for more macroscopic insights than that, I think my first suggestion is probably your only option.
Bit of theoretical question. I'd like someone to explain me which colour space provides the best distances among similar looking colours? I am trying to make a humidity detection system using a normal RGB camera in dry fruits like almond and peanuts. I have tried RGB and HSV color space for EDA(please find the attachment). Currently I am unable to find a really big difference between the accepted and rejected model. It would be really a great help if someone can tell me what should I look for and also where.
The problem with this question is that you can't define "similar looking" without some metric value, and the metric value depends on the color space you choose.
That said, CIELab color space is said to is supposed to be created with the aim of similar looking colors having similar coordinates, and is frequently used in object recognition. haven't used it myself though, so no personal experience.
For starters I would recommend on treating the pixels associated with the dry fruits as 3D coordinates in the color space that you chose, and try to apply classification algorithm on these data points. Common algorithms that I can think of are linear discriminant analysis (LDA), Support Vector Machine (SVM) and Expectation Maximization (EM). All these algorithms belong to supervised learning class, as they require labeled data.
If you images are taken under different light conditions, a good choice for color space is one that separates the luminance value from the chromatic values, such as LUV.
Anyhow, it will be easier to answer this question if you provide example images.
I am using python and openCV to create face recognition with Eigenfaces. I stumbled on a problem, since I don't know how to create training set.
Do I need multiple faces of people I want to recognize(myself for example), or do I need a lot of different faces to train my model?
First I tried training my model with 10 pictures of my face and 10 pictures of ScarJo face, but my prediction was not working well.
Now I'm trying to train my model with 20 different faces (mine is one of them).
Am I doing it wrong and if so what am I doing wrong?
You can do both, actually. If you look at the FaceRecognizer train method, it takes in two arguments. The first is a list of pictures. The second is a list of labels (integers) that correspond to the pictures. Use the labels to designate which pictures are which faces. So in your case of just pictures of yourself, the labels would be all the same (0). In the case where there are pictures of yourself and someone else is where it would really matter. For example here's what your labels might look like if you had pictures of both yourself and ScarJo
faces = [scarjo_1, scarjo_2, me_1, me_2, scar_jo_3]
labels = [ 0, 0, 1, 1, 0]
Notice how the last index in labels has a value of 0...the label which corresponds to ScarJo's face.
I later found the answer and would like to share it if someone will be facing the same challenges.
You need pictures only for the different people you are trying to recognise. I created my training set with 30 images of every person (6 persons) and figured out that histogram equalisation can play an important role when creating the training set and later when recognising faces. Using the histogram equalisation model accuracy was greatly increased. Another thing to consider is eye axis alignment so that all pictures have their eye axis aligned before they enter face recognition.
I'm writing an OCR application to read characters from a screenshot image. Currently, I'm focusing only on digits. I'm partially basing my approach on this blog post: http://blog.damiles.com/2008/11/basic-ocr-in-opencv/.
I can successfully extract each individual character using some clever thresholding. Where things get a bit tricky is matching the characters. Even with fixed font face and size, there are some variables such as background color and kerning that cause the same digit to appear in slightly different shapes. For example, the below image is segmented into 3 parts:
Top: a target digit that I successfully extracted from a screenshot
Middle: the template: a digit from my training set
Bottom: the error (absolute difference) between the top and middle images
The parts have all been scaled (the distance between the two green horizontal lines represents one pixel).
You can see that despite both the top and middle images clearly representing a 2, the error between them is quite high. This causes false positives when matching other digits -- for example, it's not hard to see how a well-placed 7 can match the target digit in the image above better than the middle image can.
Currently, I'm handling this by having a heap of training images for each digit, and matching the target digit against those images, one-by-one. I tried taking the average image of the training set, but that doesn't resolve the problem (false positives on other digits).
I'm a bit reluctant to perform matching using a shifted template (it'd be essentially the same as what I'm doing now). Is there a better way to compare the two images than simple absolute difference? I was thinking of maybe something like the EMD (earth movers distance, http://en.wikipedia.org/wiki/Earth_mover's_distance) in 2D: basically, I need a comparison method that isn't as sensitive to global shifting and small local changes (pixels next to a white pixel becoming white, or pixels next to a black pixel becoming black), but is sensitive to global changes (black pixels that are nowhere near white pixels become black, and vice versa).
Can anybody suggest a more effective matching method than absolute difference?
I'm doing all this in OpenCV using the C-style Python wrappers (import cv).
I would look into using Haar cascades. I've used them for face detection/head tracking, and it seems like you could build up a pretty good set of cascades with enough '2's, '3's, '4's, and so on.
http://alereimondo.no-ip.org/OpenCV/34
http://en.wikipedia.org/wiki/Haar-like_features
OCR on noisy images is not easy - so simple approaches no not work well.
So, I would recommend you to use HOG to extract features and SVM to classify. HOG seems to be one of the most powerful ways to describe shapes.
The whole processing pipeline is implemented in OpenCV, however I do not know the function names in python wrappers. You should be able to train with the latest haartraining.cpp - it actually supports more than haar - HOG and LBP also.
And I think the latest code (from trunk) is much improved over the official release (2.3.1).
HOG usually needs just a fraction of the training data used by other recognition methods, however, if you want to classify shapes that are partially ocludded (or missing), you should make sure you include some such shapes in training.
I can tell you from my experience and from reading several papers on character classification, that a good way to start is by reading about Principal Component Analysis (PCA), Fisher's Linear Discriminant Analysis (LDA), and Support Vector Machines (SVMs). These are classification methods that are extremely useful for OCR, and it turns out that OpenCV already includes excellent implementations on PCAs and SVMs. I haven't seen any OpenCV code examples for OCR, but you can use some modified version of face classification to perform character classification. An excellent resource for face recognition code for OpenCV is this website.
Another library for Python that I recommend you is "scikits.learn". It is very easy to send cvArrays to scikits.learn and run machine learning algorithms on your data. A basic example for OCR using SVM is here.
Another more complicated example using manifold learning for handwritten character recognition is here.