I am so new to this area. I want to improve myself and I need your advices. I want to detect objects and find the distances between the objects and my camera by using a phone camera. What should I learn in order to achive this? Any advices would be appreciated.
If you want the following: "a single picture, taken with any camera, at any distance, and calculate the distance given an image", then I fear that might be impossible, because there is no depth with a single view. It would be pretty impossible for a nn to just guess how far an object is away by how big an image is. Retrieved from wikipedia:
Depth perception arises from a variety of depth cues. These are
typically classified into binocular cues that are based on the receipt
of sensory information in three dimensions from both eyes and
monocular cues that can be represented in just two dimensions and
observed with just one eye
Now this is out of the way, you did say YOUR camera, using a specific camera changes things, if you know the focal length and angle of view, that would help a lot. Here are some links to illustrate that:
focal length
angle of view
Maybe you can calculate your way out of this, but you will need some constraints or callibration, one way or another. Hope I helped a bit
Related
I am working on object detection project and to measure it dimension correctly, for that I am using coin for reference, to measure accurately, I need a bird eye view of this image.
[Image Here]
Disclaimer: This approach is not mathematically complete nor exact, I know. Although I hope someone will find it useful for real life applications or has some positive ideas how to improve it.
As you can see from the discussion you can't get an accurate estimation of the vanishing point / the horizon by just one coin because a circle can be projected to the same ellipse for different vananishing points. However if there are two coins of same size at bottom center and top center of the image it should be manageble to get an acceptable accuracy:
If your business allows it you can do assumptions that will lower the accuracy but make it easier to implement:
Assume that the plane's normal vector is parallel to the yz-plane of your image, i.e the camera is held in a "normal" way and - in relation to the plane - not tilted to the left or right.
Assume that the two coins are placed in the middle of the picture.
With this you can:
Extract the two ellipses.
Get the two tangents of both ellipses left and right.
Get the two horizontal tangents of the bigger ellipse.
Finnally get the four points where the tangents intersect.
Use the four points as input to warpPerspective as descibed here.
Of course, if we are talking about a mobile app, then sensor and camera data from the phone could help without bothering the user too much.
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/
I have a photo taken from a camera (whose focal length, principle point, and distortion coefficients I know). The photo has a 8cm x 8cm post-in on a table and the center of the post-it is the origin (0, 0) again in cm. I've also indicated the positive-y axis on the post-it.
From this information is it possible to compute the location of the camera and the vector in which the camera is looking in Python using OpenCV? If someone has a snippet of code that does that (assuming you know the coordinates of the post-it corners already) that would be amazing!
Use OpenCV's solvePnP specifying SOLVEPNP_IPPE_SQUARE in the flags. With only 4 points (and a postit) the solution will be quite sensitive to how accurately you mark their images, so ask yourself whether you really need the camera pose and location for your application, and how accurately. E.g., if you just want to make a flat CG "sticker" stay fixed on the table while the camera moves, all you need is estimating a homography, a much simpler task.
It does look like you have all the information required. The marker you use can be easily segmented. Shape analysis will provide corners. I did something similar to get basic eyesight tracking:
Here is a complete example.
Segmentation result for the example:
Please notice, accuracy really matters, so it might be useful to rely on several sets of points.
I have an image captured by android camera. Is it possible to calculate depth of object in the image ? Image contains object and background only. Any suggestion, explanation or links that you think can help me will be appreciated.
OpenCV is the library you need.
I did some depth identification of water levels in pure white background a few days ago. Generally, if you want to identify the depth, you can convert the question to identify the edge of the changing colors. In this case, you can convert the colorful pictures to grey and identify the changing of while-black-grey interface. OpenCV is capable of doing the job at high speed.
Hope it helps. Let me know if you need further help.
Edits:
If you want to find the actual depths, you need to project the coordinate system of your pictures to the real world, or vice versa. To do it, you have to know a fix location as your reference and the relationship between pixels and real distances.
What I did is find the fixed location and set it as zero. Afterwards, I measured a length of an object in the picture, and also calculated the pixel amount of the object. Therefore I obtained the relationship between pixels and real distances.
Note that these procedures may involve errors in the identification. I did it very carefully and the error was acceptable in my case.
With only one image, accurate depth estimation is near impossible. However, there are various methods of estimating depth under certain assumptions or the availability of the camera calibration matrix. As mentioned by #WenlongLiu, OpenCV is a very good place to start with.
I'd like to determine the position and orientation of a stereo camera relative to its previous position in world coordinates. I'm using a bumblebee XB3 camera and the motion between stereo pairs is on the order of a couple feet.
Would this be on the correct track?
Obtain rectified image for each pair
Detect/match feature points rectified images
Compute Fundamental Matrix
Compute Essential Matrix
Thanks for any help!
Well, it sounds like you have a fair understanding of what you want to do! Having a pre-calibrated stereo camera (like the Bumblebee) will then deliver up point-cloud data when you need it - but it also sounds like you basically want to also use the same images to perform visual odometry (certainly the correct term) and provide absolute orientation from a last known GPS position, when the GPS breaks down.
First things first - I wonder if you've had a look at the literature for some more ideas: As ever, it's often just about knowing what to google for. The whole idea of "sensor fusion" for navigation - especially in built up areas where GPS is lost - has prompted a whole body of research. So perhaps the following (intersecting) areas of research might be helpful to you:
Navigation in 'urban canyons'
Structure-from-motion for navigation
SLAM
Ego-motion
Issues you are going to encounter with all these methods include:
Handling static vs. dynamic scenes (i.e. ones that change purely based on the camera motion - c.f. others that change as a result of independent motion occurring in the scene: trees moving, cars driving past, etc.).
Relating amount of visual motion to real-world motion (the other form of "calibration" I referred to - are objects small or far away? This is where the stereo information could prove extremely handy, as we will see...)
Factorisation/optimisation of the problem - especially with handling accumulated error along the path of the camera over time and with outlier features (all the tricks of the trade: bundle adjustment, ransac, etc.)
So, anyway, pragmatically speaking, you want to do this in python (via the OpenCV bindings)?
If you are using OpenCV 2.4 the (combined C/C++ and Python) new API documentation is here.
As a starting point I would suggest looking at the following sample:
/OpenCV-2.4.2/samples/python2/lk_homography.py
Which provides a nice instance of basic ego-motion estimation from optic flow using the function cv2.findHomography.
Of course, this homography H only applies if the points are co-planar (i.e. lying on the same plane under the same projective transform - so it'll work on videos of nice flat roads). BUT - by the same principal we could use the Fundamental matrix F to represent motion in epipolar geometry instead. This can be calculated by the very similar function cv2.findFundamentalMat.
Ultimately, as you correctly specify above in your question, you want the Essential matrix E - since this is the one that operates in actual physical coordinates (not just mapping between pixels along epipoles). I always think of the Fundamental matrix as a generalisation of the Essential matrix by which the (inessential) knowledge of the camera intrinsic calibration (K) is omitted, and vise versa.
Thus, the relationships can be formally expressed as:
E = K'^T F K
So, you'll need to know something of your stereo camera calibration K after all! See the famous Hartley & Zisserman book for more info.
You could then, for example, use the function cv2.decomposeProjectionMatrix to decompose the Essential matrix and recover your R orientation and t displacement.
Hope this helps! One final word of warning: this is by no means a "solved problem" for the complexities of real world data - hence the ongoing research!