I'm currently working with SqueezeDet for detection purposes. I trained the network on synthetic data and it performs reasonably well. detection results
For my project I would like to be able to visualize which parts of the input were more relevant for the detection process. So in case of the detection of a pedestrian, I'd assume that its pixel would be more important than for example the surroundings. I tried a couple of different methods, but none of them is fully satisfactory.
I did my own research and couldnt't really any papers that talk about visualization for object detection. So I implemented VisualBackProp, the results however don't look all to promising. If instead I compute the relevance things look slightly better, but still not as expected.
I started thinking that perhaps the issues might be related to the complexity of my outputs, with respect to a network that might only be dealing with classification, or as in the VisualBackProp paper just the prediction of steering angle.
I was wondering if anyone has idea of what visualization technique might best suit the detection task.
You could try just augmenting different areas of the image and see how it affects the detection confidence. For example, you could put the area containing the pedestrian on just a black background instead of the natural background to see how much the surroundings actually affect things. You could also add moderate to severe noise to select areas of the image and observe which areas correspond to the biggest change in detection confidence.
More directly, mathematically you seem to be interested in the gradient of detection confidence WRT pixel data. Depending on what deep learning platform you are using, if you run a single training iteration you may be able to obtain the gradients in the data layer (dL/dx) which will directly show these. This will only represent the effect of small changes to the pixel data - if you are aiming for more macroscopic insights than that, I think my first suggestion is probably your only option.
Related
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/
Given an image of puzzles on some background (depends on difficulty of the task), recognize the number of puzzles on it and classify each puzzle on the image (for each puzzle tell how many peninsulas and bays it has)
Background can be red
Or colored
I managed to solve (or almost solve) easy version of this problem (red background) using opencv (otsu thresholding -> dilating -> some smoothing convolutions -> findContours and then pass contours to some classifier)
But I have serious difficulties when trying to solve a complicated version. This
is the best I have achieved so far using otsu thresholding and erode + dilate. It appears that this thresholding method does not work so good for hard background.
My dataset is very small (less than 10 images), so i guess its not possible to use some deep learning segmentation techniques. But may be i can use some pre-trained models?
This is my first CV problem, so i dont have much knowledge about it. I'm kinda out of ideas and asking you to help me.
Thanks!
I think you need to use shape based matching. For instance you compute gradients for each shape for each rotation angle and scale you will ise with some small step. Then train neural network detector.
For implementation instance check: https://github.com/meiqua/shape_based_matching
One year ago I trained a model to detect flowers. One year later I am starting this project up again, but first I decided to make sure I still remembered by training it to detect and red and green crayons.
My process is more or less following this tutorial –
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10
I have two labels, green and red. I have 200 training images and 20 test images.
Using faster_rcnn_inception. I followed the steps and ran my model.
It detects the crayons as well as you could with only 200 images, however, can’t tell the red and green crayon apart at all. I thought maybe I had screwed up the settings, but if I move a blue pen in, the label pops up!
Even if I feed it the training images, it classifies 99% of them as two green pens. Even though each image always has two different pens!!!
Can this model work with colour? Or is it converting the colour somehow and messing it up? Is colour hard to detect, and I just need more training images? Have I likely screwed up a setting, since it can’t even correctly classify the training images?
The config file I am using is here:
https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_inception_v2_pets.config
I've changed line 9, line 130 and line 108 to false.
In general, neural networks can detect colour.
But often they learn not to. Due to differences in colour temperature and perspective different colours can produce same or similar pixel-level values. Therefore, when training on larger datasets networks tend to become highly colour agnostic. Unfortunately, I can only speak from the gut feeling and can not provide any example or reference, but the picture above should give you a sense why.
In your case issues are further complicated by the fact, that there is a competing task of detecting object box. Due to that during retraining detection net can become insensitive to weak clues like colour.
To troubleshoot the situation I would recommend to look closely on your classification accuracy during retraining. As far as i can tell, tutorial code only provides loss value. One should expect that during retraining at least the train set should be overfit almost perfectly i.e. green and red crayons must become distinguishable. If not, it might make sense to train for longer or decrease the learning rate.
There are several packages and methods for segmentation in Python. However, if I know apriori that certain pixels (and no others) correspond to a particular object, how can I use that to segment other objects?
Which methods implemented in python would lend themselves to this approach?
Thanks.
You'll want to take a look at semi-automated image segmentation. Image segmentation in a semi-automated perspective means that you know before hand what class certain pixels belong to - either foreground or background. Given this a priori information, the goal is to minimize an energy function that best segments the rest of the pixels into foreground and background.
The best two methods that I know of are Graph Cuts and Random Walks. If you want to study the fundamentals of both of them, you should read the canonical papers by Boykov (Graph Cuts) and Grady (Random Walks) respectively:
Graph Cuts - Boykov: http://www.csd.uwo.ca/~yuri/Papers/ijcv06.pdf
Random Walks - Grady: http://webdocs.cs.ualberta.ca/~nray1/CMPUT615/MRF/grady2006random.pdf
For Graph Cuts, OpenCV uses the GrabCut algorithm, which is an extension of the original Graph Cuts algorithm: http://en.wikipedia.org/wiki/GrabCut. Essentially, you surround a box around the object you want segmented, and Gaussian Mixture Models are used to model the foreground and background and the object will be segmented from the background inside this box. Additionally, you can add foreground and background markers inside the box to further constrain the solution to ensure you get a good result.
Take a look at this official OpenCV tutorial for more details: http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_grabcut/py_grabcut.html
For Random Walks, this is implemented in the scikit-image library and here's a great tutorial on how to get the segmentation up and running off of their official website: http://scikit-image.org/docs/dev/auto_examples/plot_random_walker_segmentation.html
Good luck!
I'm writing an OCR application to read characters from a screenshot image. Currently, I'm focusing only on digits. I'm partially basing my approach on this blog post: http://blog.damiles.com/2008/11/basic-ocr-in-opencv/.
I can successfully extract each individual character using some clever thresholding. Where things get a bit tricky is matching the characters. Even with fixed font face and size, there are some variables such as background color and kerning that cause the same digit to appear in slightly different shapes. For example, the below image is segmented into 3 parts:
Top: a target digit that I successfully extracted from a screenshot
Middle: the template: a digit from my training set
Bottom: the error (absolute difference) between the top and middle images
The parts have all been scaled (the distance between the two green horizontal lines represents one pixel).
You can see that despite both the top and middle images clearly representing a 2, the error between them is quite high. This causes false positives when matching other digits -- for example, it's not hard to see how a well-placed 7 can match the target digit in the image above better than the middle image can.
Currently, I'm handling this by having a heap of training images for each digit, and matching the target digit against those images, one-by-one. I tried taking the average image of the training set, but that doesn't resolve the problem (false positives on other digits).
I'm a bit reluctant to perform matching using a shifted template (it'd be essentially the same as what I'm doing now). Is there a better way to compare the two images than simple absolute difference? I was thinking of maybe something like the EMD (earth movers distance, http://en.wikipedia.org/wiki/Earth_mover's_distance) in 2D: basically, I need a comparison method that isn't as sensitive to global shifting and small local changes (pixels next to a white pixel becoming white, or pixels next to a black pixel becoming black), but is sensitive to global changes (black pixels that are nowhere near white pixels become black, and vice versa).
Can anybody suggest a more effective matching method than absolute difference?
I'm doing all this in OpenCV using the C-style Python wrappers (import cv).
I would look into using Haar cascades. I've used them for face detection/head tracking, and it seems like you could build up a pretty good set of cascades with enough '2's, '3's, '4's, and so on.
http://alereimondo.no-ip.org/OpenCV/34
http://en.wikipedia.org/wiki/Haar-like_features
OCR on noisy images is not easy - so simple approaches no not work well.
So, I would recommend you to use HOG to extract features and SVM to classify. HOG seems to be one of the most powerful ways to describe shapes.
The whole processing pipeline is implemented in OpenCV, however I do not know the function names in python wrappers. You should be able to train with the latest haartraining.cpp - it actually supports more than haar - HOG and LBP also.
And I think the latest code (from trunk) is much improved over the official release (2.3.1).
HOG usually needs just a fraction of the training data used by other recognition methods, however, if you want to classify shapes that are partially ocludded (or missing), you should make sure you include some such shapes in training.
I can tell you from my experience and from reading several papers on character classification, that a good way to start is by reading about Principal Component Analysis (PCA), Fisher's Linear Discriminant Analysis (LDA), and Support Vector Machines (SVMs). These are classification methods that are extremely useful for OCR, and it turns out that OpenCV already includes excellent implementations on PCAs and SVMs. I haven't seen any OpenCV code examples for OCR, but you can use some modified version of face classification to perform character classification. An excellent resource for face recognition code for OpenCV is this website.
Another library for Python that I recommend you is "scikits.learn". It is very easy to send cvArrays to scikits.learn and run machine learning algorithms on your data. A basic example for OCR using SVM is here.
Another more complicated example using manifold learning for handwritten character recognition is here.