So if you take a pinhole camera and make it as the origin of our plane(3D) and a pixel from the image plane and connect the two with a straight line it should make a vector, which has direction and a length. Think of this as the path followed by the light reflected from an object into the camera lens. And I want to calculate this. I think we have to use the cameras intrinsic properties for this.
Below is a statement that made me think about it all.
In a pinhole camera model, each pixel defines a direction vector in 3D space, specifically the vector from the projection center through the pixel's position on the image plane.
Here is a diagram better explaining this.
I want to calculate the three red lines, and known parameters would be, I guess, the camera position(origin) and the image pixel value, and the intrinsic camera parameters.
Related
I am working on an application using an IFM 3D camera to identify parts prior to a robot pickup. Currently I am able to find the centroid of these objects using contours from a depth image and from there calculate the center point of these objects in pixel space.
My next task is to then transform the 2D centroid coordinates to a 3D point in 'real' space. I am able to train the robot such that it's coordinate frame is either at the center of the image or at the traditional (0,0) point of an image (top left).
The 3D camera I am using provides both an intrinsic and extrinsic matrix. I know I need to use some combination of these matrices to project my centroid into three space but the following questions remain:
My current understanding from googling is the intrinsic matrix is used to fix lens distortion (barrel and pinhole warping, etc.) whereas the extrinsic matrix is used to project points into the real world. Is this simplified assumption correct?
How can a camera supply a single extrinsic matrix? I know traditionally these matrices are found using the checkerboard corners method but are these not dependent on the height of the camera?
Is the solution as simple as taking the 3x4 extrinsic matrix and multiplying it by a 3x1 matrix [x, y, 1] and if so, will the returned values be relative to the camera center or the traditional (0,0) point of an image.
Thanks in advance for any insight! Also if it's any consolation I am doing everything in python and openCV.
No. I suggest you read the basics in Multiple View Geometry of Hartley and Zisserman, freely available in the web. Dependent on the camera model, the intrinsics contain different parameters. For the pinhole camera model, these are the focal length and the principal point.
The only reason why you maybe could directly transform your 2D centroid to 3D is that you use a 3D camera. Read the manual of the camera, it should be explained how the relation between 2D and 3D coordinates is given for your specific model.
If you have only image data, you can only compute a 3D point from at least two views.
No, of course not. Please don't be lazy and start reading the basics about camera projection instead of asking for others to explain the common basics that are written down everywhere in the web and literature.
I am trying to find the ground-plane coordinates of contour of an image. I have been able to detect the contour and get the x, y, w and h of the contour, but I now need it on a 2D ground plane coordinate system. I am trying to perform camera calibration to find out data about the camera so that I can perform homography to get the ground plane data. However I am using pre-made videos so I cannot use the checkerboard solution that I am finding online. Does anyone have any ideas?
Below is an example image:
For objects far away, the height of the bridge in the foreground is very small compared to their distance from the camera, so a homography computed from points on the bridge will be approximately correct far away, provided the bridge is approximately horizontal (no roll). This is another way of saying that the images all horizontal planes pass through the horizon.
For objects nearby, but not on the bridge, the above approximation will suffer from parallax error. Unless you have an object of known scale on the plane of interest (the sea), or an estimate of the distance of the bridge from that plane, there is no information available to resolve depth - as far as you can tell the bridge could be 100m or 1mm above the sea.
I have a single camera that I can move around. I have the intrinsic parameter matrix and the extrinsic parameter matrix for each camera orientation. For object detection, I use YOLO and I get 2D bounding boxes in the image plane. My plan is to use a temporal pair of images, with the detected object in it, to triangulate the midpoint of the resulting 2D bounding box around the object.
Right now, I use two images that are 5 frames apart. That means, the first frame has the object in it and the second frame has the same object in it after a few milliseconds. I use cv2.triangulatePoints to get the corresponding 3D point for the 2D midpoint of the bounding box.
My main problem is that when the camera is more or less steady, the resulting distance value is accurate (within a few centimeters). However, when I move the camera around, the resulting distance value for the object starts varying quite a bit (the object is static and never moves, only the camera looking at it moves). I can't seem to understand why this is the case.
For cv2.triangulatePoints, I get the relative rotation matrix between the two temporal camera orientations (R = R2R1) and then get the relative translation (t = t2 - Rt1). P1 and P2 are the final projection matrices (P1 for the camera at an earlier position and P2 for the camera at a later position). P1 = K[I|0] and P2 = K[R|t], where K is the 3x3 intrinsic parameter matrix, I is a 3x3 identity matrix, and 0 is 3x1 vector of zeros.
Should I use a temporal gap of 10 frames or is using this method to localize objects using a single camera never accurate?
The centers of the bounding boxes are not guaranteed to be the projections of a single scene (3d) point, even with a perfect track, unless additional constraints are added. For example, that the tracked object is planar, or that the vertexes of the bounding boxes track points that are on a plane. Things get more complicated when tracking errors are present.
If you really need to triangulate the box centers (do you? are you sure you can't achieve your goals using only well-matched projections?), you could use a small area around the center in one box as a pattern, and track it using a point tracker (e.g. one based on the Lucas-Kanade algorithm, or one based on normalized cross-correlation) in the second image, using the second box to constrain the tracker search window.
Then you may need to address the accuracy of your camera motion estimation - if errors are significant your triangulations will go nowhere. Bundle adjustment may need to become your friend.
OpenCV provides methods to calibrate a camera. I want to know if it also has a way to simply generate a view projection matrix if and when the parameters are known.
i.e I know the camera position, rotation, up, FOV... and whatever else is needed, then call MagicOpenCVCamera(parameters) and obtain a 4x4 transformation matrix.
I have searched this up but I can only find information about calibrating the camera, not about creating one if you already know the parameters.
The projection matrix is simply a 3x4 matrix whose [0:3,0:3] left square is occupied by the product K.dot(R) of the camera intrinsic calibration matrix K and its camera-from-world rotation matrix R, and the last column is K.dot(t), where t is the camera-from-world translation. To clarify, R is the matrix that brings into camera coordinates a vector decomposed in world coordinates, and t is the vector whose tail is at the camera center, and whose tip is at the world origin.
The OpenCV calibration routines produce the camera orientations as rotation vectors, not matrices, but you can use cv.Rodrigues to convert them.
I'm trying to find the relative position of the camera to the chessboard (or the other way around) - I feel OK with converting between different coordinate systems, e.g. as suggested here. I decided to use chessboard not only for calibration but actual position determination as well at this stage, since I can use the findChessboardCorners to get the imagePoints (and this works OK).
I've read a lot on this topic and feel that I understand the solvePnP outputs (even though I'm completely new to openCV and computer vision in general). Unfortunately, the results I get from solvePnP and physically measuring the test set-up are different: translation in z-direction is off by approx. 25%. x and y directions are completely wrong - several orders of magnitude and different direction than what I've read to be the camera coordinate system (x pointing up the image, y to the right, z away from the camera). The difference persists if I convert tvec and rvec to camera pose in world coordinates.
My questions are:
What are the directions of camera and world coordinate systems' axes?
Does solvePnP output the translation in the same units as I specify the objectPoints?
I specified the world origin as the first of the objectPoints (one of the chessboard corners). Is that OK and is tvec the translation to exactly that point from the camera coordinates?
This is my code (I attach it pro forma as it does not throw any exceptions etc.). I used grayscale images to get the camera intrinsics matrix and distortion coefficients during calibration so decided to perform localisation in grayscale as well. chessCoordinates is a list of chessboard points location in mm with respect to the origin (one of the corner points). camMatrix and distCoefficients come from calibration (performed using the same chessboard and objectPoints).
camCapture=cv2.VideoCapture(0) # Take a picture of the target to get the imagePoints
tempImg=camCapture.read()
imgPts=[]
tgtPts=[]
tempImg=cv2.cvtColor(tempImg[1], cv2.COLOR_BGR2GRAY)
found_all, corners = cv2.findChessboardCorners(tempImg, chessboardDim )
imgPts.append(corners.reshape(-1, 2))
tgtPts.append(np.array(chessCoordinates, dtype=np.float32))
retval,myRvec,myTvec=cv2.solvePnP(objectPoints=np.array(tgtPts), imagePoints=np.array(imgPts), cameraMatrix=camMatrix, distCoeffs=distCoefficients)
The camera coordinates are the same as image coordinates. So You have x axe pointing in the right side from the camera, y axe pointing down, and z pointing in the direction camera is faced. This is a clockwise axe system, and the same would apply to the chessboard, so if You specified the origin in, lets say, upper right corner of the chessboard, x axe goes along the longer side to the right and y along shorter side of the chessboard, z axe would be pointing downward, to the ground.
Solve PnP outputs the translation in the same units as the units in which You specified the length of chessboard fields, but it might also use units specified in camera calibration, as it uses the camera matrix.
Tvec points to the origin of the world coordinates in which You placed the calibration object. So if You placed the first object point in (0,0), thats where tvec will point to.
What are the directions of camera and world coordinate systems' axes?
The 0,0,0 corner on the boards is so that the X & Y axis are towards the rest of the corner points. The Z axis is always pointing away from the board. This means that it's usually pointing somewhat in the direction of the camera.
Does solvePnP output the translation in the same units as I specify the objectPoints?
Yes
I specified the world origin as the first of the objectPoints (one of the chessboard corners). Is that OK and is tvec the translation to exactly that point from the camera coordinates?
Yes, this is pretty common. In most of the cases, the first cam corner is set as 0,0,0 and subsequent corners being set at the z=0 plane (eg; (1,0,0) , (0,1,0), etc).
The tvec, combined with the rotation, points towards that point from the board coordinate frame toward the camera. In short; the tvec & rvec provide you with the inverse translation (world -> camera). With some basic geometry you can calculate the transformation that puts camera -> world.