Given: Mesh, Source Camera - I have intrinsic and extrinsic parameters, Image coordinate 2d
Output: 3D point, which is the intersection of a ray from camera center, through the 2d point on the image plane and the mesh. (I'm trying to find the 3d point on the mesh)
This is the process:
From Multiple View Geometry in Computer Vision book:
I have constructed the equation (6.14).
I'm not sure how to continue and get the 3d point that lies on the mesh (I also need the point that is closet to the camera).
I thought that it can be done in the following way:
Iterate over all the vertices and find the distance between the vertex and the line and the vertices that have the least distance lie on the line (if they're close to zero or zero), and finding the closet vertex is I guess finding the magnitude of between the center of the camera and the closet vertices, the smallest one will mean the point is the closest?
Quick update: This repo does seem to work with the rays:
I guess the bug now lies in getting the point D right..
As Grillteller pointed out in the comment, this is a ray intersection problem with the 3d mesh. As far as I know, humanity does not yet know a quick way to determine the intersection for an arbitrary mesh. In your problem context, you should Ray Tracing, which is also pointed out by Grillteller, however this has serious performance issues, although it gives a lot of shading possibilities.
To find the intersection of a ray and a mesh, the Ray Tracing algorithm typically uses different acceleration structures. Often such structures are a partition of space by trees:
KD-tree for Ray Tracing
BSP-tree for Ray Tracing
Octree for Ray Tracing
This presentation explains some of these and other approaches very well.
P.S .: If you only need a simple visualization, then it would be better to reverse the problem: for each mesh element, perform rasterisation.
I found another implementation called trimesh using python.
You need to read to installation guide and then you are able to load your meshes via:
import numpy as np
import trimesh
# attach to logger so trimesh messages will be printed to console
mesh = trimesh.load('models/CesiumMilkTruck.glb', force='mesh')
I found the relevant lines to import a camera in scene as trimesh.scene.Camera.
Then you can use the function cameras_to_rays(camera) (line 417) to "return one ray per pixel, as set in camera.resolution".
So now you are having the rays for every pixel and the mesh and can create a RayMeshIntersector as shown in Then, you can use intersects_location (line 75) to calculate cartesian image coordinates where a respective ray hits the mesh.
I found an example for your purpose here:
A very simple example of using scene cameras to generate
rays for image reasons.
Install `pyembree` for a speedup (600k+ rays per second)
from __future__ import division
import PIL.Image
import trimesh
import numpy as np
if __name__ == '__main__':
# test on a simple mesh
mesh = trimesh.load('../models/featuretype.STL')
# scene will have automatically generated camera and lights
scene = mesh.scene()
# any of the automatically generated values can be overridden
# set resolution, in pixels = [640, 480]
# set field of view, in degrees
# make it relative to resolution so pixels per degree is same = 60 * ( /
# convert the camera to rays with one ray per pixel
origins, vectors, pixels = scene.camera_rays()
# do the actual ray- mesh queries
points, index_ray, index_tri = mesh.ray.intersects_location(
origins, vectors, multiple_hits=False)
# for each hit, find the distance along its vector
depth = trimesh.util.diagonal_dot(points - origins[0],
# find pixel locations of actual hits
pixel_ray = pixels[index_ray]
# create a numpy array we can turn into an image
# doing it with uint8 creates an `L` mode greyscale image
a = np.zeros(, dtype=np.uint8)
# scale depth against range (0.0 - 1.0)
depth_float = ((depth - depth.min()) / depth.ptp())
# convert depth into 0 - 255 uint8
depth_int = (depth_float * 255).round().astype(np.uint8)
# assign depth to correct pixel locations
a[pixel_ray[:, 0], pixel_ray[:, 1]] = depth_int
# create a PIL image from the depth queries
img = PIL.Image.fromarray(a)
# show the resulting image
# create a raster render of the same scene using OpenGL
# rendered =
The problem in the question is to find the closest point on 3D mesh visible in specific 2D point of screen and it is a part of Ray tracing technique. The ray in question is uniquely defined by the camera location (ray's origin) and the pixel location, which the ray penetrates. So knowing both of them allows one to specify the ray and find its intersection (if any) with the triangular surface.
It is rather computationally expensive task especially for high resolution screens (millions of pixels) and detailed meshes (millions of triangles), so a number of highly optimized software libraries where developed for it, for example:
Nvidia OptiX uses GPU for fast finding of ray-surface intersections. One can find a wrapper library for python.
Intel Embree does the same on x86 processors. Python wrappers: python-embree and pyembree. The latter is a dependency of trimesh for fast queries.
And there are libraries not only from hardware vendors with python interface that can quickly find ray-mesh collisions, e.g. MeshLib.
I try to create Structured-light 3D scanner.
Camera calibration
Camera calibration is copy of OpenCV official tutorial. As resutlt I have camera intrinsic parameters(camera matrix).
Projector calibration
Projector calibration maybe is not correct but process was: Projector show chessboard pattern and camera take some photos from different angles. Images are cv.undistored with camera parameters and then result images are used for calibration with OpenCV official tutorial. As result I have projector intrinsic parameters(projector matrix).
Rotation and Transition
From cv.calibrate I have rotarion and transition vectors as results but vectors count are equal to images count and I thing it is not corect ones because I move camera and projector in calibration.
My new idea is to project chessboard on scanning background, perform calibration and in this way I will have Rotation vector and Transition vector. I don't know is that correct way.
Process of scanning is:
Generate patterns -> undistor patterns with projector matrix -> Project pattern and take photos with camera -> undistort taken photos with camera matrix
Camera-projector pixels map
I use GrayCode pattern and with cv.graycode.getProjPixel and have pixels mapping between camera and projector. My projector is not very high resolution and last patterns are not very readable. I will create custom function that generate mapping without the last patterns.
I don't know how to get depth map(Z) from all this information. My confution is because there are 3 coordinate systems - camera, projector and world coordinate system.
How to find 'Z' with code? Can I just get Z from pixels mapping between image and pattern?
Information that have:
p(x,y,1) = R*q(x,y,z) + T - where p is image point, q is real world point(maybe), R and T are rotation vector and transition vector. How to find R and T?
Z = B.f/(x-x') - where Z is coordinate(depth), B-baseline(distanse between camera and projector) I can measure it by hand but maybe this is not the way, (x-x') - distance between camera pixel and projector pixel. I don't know how to get baseline. Maybe this is Transition vector?
I tried to get 4 meaning point, use them in cv.getPerspectiveTransform and this result to be used in cv.reprojectImageTo3D. But cv.getPerspectiveTransform return 3x3 matrix and cv.reprojectImageTo3D use Q-4×4 perspective transformation matrix that can be obtained with stereoRectify.
Similar Questions:
How is point cloud data acquired from the structured light 3D scanning? - Answer is you need to define a vector that goes from the camera perspective center to the pixel in the image and then rotate this vector by the camera rotation. But I don't know how to define/find thid vercor and Rotation vector is needed.
How to compute the rotation and translation between 2 cameras? - Question is about R and T between two cameras but almost everywhere writes that projector is inverse camera. One good answer is The only thing you have to do is to make sure that the calibration chessboard is seen by both of the cameras. But I think if I project chessboard pattern it will be additional distored by wall(Projective transormation)
There are many other resources and I will update list with comment. I missed something and I can't figure out how to implement it.
Lets assume p(x,y) is the image point and the disparity as (x-x'). You can obtain the depth point as,
disparity = x-x_ # x-x'
point_and_disparity = np.array([[[x, y, disparity]]], dtype=np.float32)
depth = cv2.perspectiveTransform(point_and_disparity, q_matrix)
I am working on an application using an IFM 3D camera to identify parts prior to a robot pickup. Currently I am able to find the centroid of these objects using contours from a depth image and from there calculate the center point of these objects in pixel space.
My next task is to then transform the 2D centroid coordinates to a 3D point in 'real' space. I am able to train the robot such that it's coordinate frame is either at the center of the image or at the traditional (0,0) point of an image (top left).
The 3D camera I am using provides both an intrinsic and extrinsic matrix. I know I need to use some combination of these matrices to project my centroid into three space but the following questions remain:
My current understanding from googling is the intrinsic matrix is used to fix lens distortion (barrel and pinhole warping, etc.) whereas the extrinsic matrix is used to project points into the real world. Is this simplified assumption correct?
How can a camera supply a single extrinsic matrix? I know traditionally these matrices are found using the checkerboard corners method but are these not dependent on the height of the camera?
Is the solution as simple as taking the 3x4 extrinsic matrix and multiplying it by a 3x1 matrix [x, y, 1] and if so, will the returned values be relative to the camera center or the traditional (0,0) point of an image.
Thanks in advance for any insight! Also if it's any consolation I am doing everything in python and openCV.
No. I suggest you read the basics in Multiple View Geometry of Hartley and Zisserman, freely available in the web. Dependent on the camera model, the intrinsics contain different parameters. For the pinhole camera model, these are the focal length and the principal point.
The only reason why you maybe could directly transform your 2D centroid to 3D is that you use a 3D camera. Read the manual of the camera, it should be explained how the relation between 2D and 3D coordinates is given for your specific model.
If you have only image data, you can only compute a 3D point from at least two views.
No, of course not. Please don't be lazy and start reading the basics about camera projection instead of asking for others to explain the common basics that are written down everywhere in the web and literature.
I have a single camera that I can move around. I have the intrinsic parameter matrix and the extrinsic parameter matrix for each camera orientation. For object detection, I use YOLO and I get 2D bounding boxes in the image plane. My plan is to use a temporal pair of images, with the detected object in it, to triangulate the midpoint of the resulting 2D bounding box around the object.
Right now, I use two images that are 5 frames apart. That means, the first frame has the object in it and the second frame has the same object in it after a few milliseconds. I use cv2.triangulatePoints to get the corresponding 3D point for the 2D midpoint of the bounding box.
My main problem is that when the camera is more or less steady, the resulting distance value is accurate (within a few centimeters). However, when I move the camera around, the resulting distance value for the object starts varying quite a bit (the object is static and never moves, only the camera looking at it moves). I can't seem to understand why this is the case.
For cv2.triangulatePoints, I get the relative rotation matrix between the two temporal camera orientations (R = R2R1) and then get the relative translation (t = t2 - Rt1). P1 and P2 are the final projection matrices (P1 for the camera at an earlier position and P2 for the camera at a later position). P1 = K[I|0] and P2 = K[R|t], where K is the 3x3 intrinsic parameter matrix, I is a 3x3 identity matrix, and 0 is 3x1 vector of zeros.
Should I use a temporal gap of 10 frames or is using this method to localize objects using a single camera never accurate?
The centers of the bounding boxes are not guaranteed to be the projections of a single scene (3d) point, even with a perfect track, unless additional constraints are added. For example, that the tracked object is planar, or that the vertexes of the bounding boxes track points that are on a plane. Things get more complicated when tracking errors are present.
If you really need to triangulate the box centers (do you? are you sure you can't achieve your goals using only well-matched projections?), you could use a small area around the center in one box as a pattern, and track it using a point tracker (e.g. one based on the Lucas-Kanade algorithm, or one based on normalized cross-correlation) in the second image, using the second box to constrain the tracker search window.
Then you may need to address the accuracy of your camera motion estimation - if errors are significant your triangulations will go nowhere. Bundle adjustment may need to become your friend.
I have two questions relating to stereo calibration with opencv. I have many pairs of calibration images like these:
Across the set of calibration images the distance of the chessboard away from the camera varies, and it is also rotated in some shots.
From within this scene I would like to map pairs of image coordinates (x,y) and (x',y') onto object coordinates in a global frame: (X,Y,Z).
In order to calibrate the system I have detected pairs of image coordinates of all chessboard corners using cv2.DetectChessboardCorners(). From reading Hartley's Multiple View Geometry in Computer Vision I gather I should be able to calibrate this system up to a scale factor without actually specifying the object points of the chessboard corners. First question: Is this correct?
Investigating cv2's capabilities, the closest thing I've found is cv2.stereoCalibrate(objectpoints,imagepoints1,imagepoints2).
I have obtained imagepoints1 and imagepoints2 from cv2.findChessboardCorners. Apparently from the images shown I can approximately extract (X,Y,Z) relative to the frame on the calibration board (by design), which would allow me to apply cv2.stereoCalibrate(). However, I think this will introduce error, and it prevents me from using all of the rotated photos of the calibration board which I have. Second question: Can I calibrate without object points using opencv?
No. You must specify the object points. Note that they need not change across the image sequence, since you can interpret the change as due to camera motion relative to the target. Also, you can (should) assume that Z=0 for a planar target like yours. You may specify X,Y up to scale, and thus obtain after calibration translations up to scale.
Clarification: by "need not change across the image sequence" I mean that you can assume the target fixed in the world frame, and interpret the relative motion as due to the camera only. The world frame itself, absent a better prior, can be defined by the pose of the target in any one of the images (say, the first one). Obviously, I do not mean that the pose of the target relative to the camera does not change - in fact, it must change in order to obtain a calibration. If you do have a better prior, you should use if. For example, if the target moves on a turntable, you should solve directly for the parameters of the cylindrical motion, since there is less of them (one constant axis, one constant radius, plus one angle per image, rather than 6 parameters per image).
I have this image :
I don’t know exactly what kind on projection it is, I guess equirectangular or mercator by the shape. It's the texture for an attitude indicator, b.
I want to draw a orthographic projection, b or maybe a General Perspective projection (which one looks better) of it according to a direction vector defined by two angles (heading and pitch). This direction define a point on the sphere, this point should be the center of the projection.
I want it to look from the pilot point of view, so only half of the sphere should be drawn.
I use python, and I have not yet chosen a graphic library, I will probably be using pygame though.
I’ve found something related : but it uses OpenGL and I have no experience with it, but I can try if needed.
How should I do that ? I probably can draw it manually by calculating every pixel from the calculation formulas but I think there are some kind of library tools to do that efficiently (hardware accelerated probably ?).
For an all-Python solution (using numpy/scipy array ops, which will be faster than any explicit per-pixel looping), this:
#!/usr/bin/env python
import math
import numpy as np
import scipy
import scipy.misc
import scipy.ndimage.interpolation
import subprocess
for frame in xrange(0,frames):
# Image pixel co-ordinates
# Compute z of sphere hit position, if pixel's ray hits
# Some spin and tilt to make things interesting
# Rotate the hit points
# Compute map position of hit
latitude =np.where(hit,(0.5+np.arcsin(y)/np.pi)*src.shape[0],0.0)
# Resample, and zap non-hit pixels
for channel in [0,1,2]:
# Save to f0000.png, f0001.png, ...
# Use imagemagick to make an animated gif'convert -delay 10 f????.png anim.gif',shell=True)
will get you
OpenGL is really the place to be doing this sort of pixel wrangling though, especially if it's for anything interactive.
I glanced at the code in the "Off-Center Map Projections" stuff you linked...
As a starting point for you, I'd say it was pretty good, especially if you want to achieve this with any sort of efficiency in PyGame as offloading any sort of per-pixel operations to OpenGL will be much faster than they'll ever be in Python.
Obviously to get any further you'll need to understand the OpenGL; the projection is implemented in's GLSL code (the stuff in the string passed to mod_program.ShaderFragment) - the atan and asin there shouldn't be a surprise if you've read up on equirectangular projections.
However, to get to what you want, you'll have to figure out how to render a sphere instead of the viewport-filling quad (rendered in at glBegin(GL_QUADS);). Or alternatively, stick with the screen-filling quad and do a ray-sphere intersection in the shader code too (which is effectively what the python code in my other answer does).