3D Vision Made Easy

Published on June 21, 2017 by TIS Marketing.

3D Data Acquisition: Passive and Active Techniques

Whether it is the industrial smart robot in the age of IIoT using three dimensional data to orient itself in its working space, the reverse vending machine counting empty bottles in a case, or the surface inspection system alerting personnel to the smallest material defect - three dimensional information acquired by modern 3D sensors from the environment and the objects therein belongs to many industrial applications of the future.

Currently, there are a variety of technologies on the market which can be used to collect three-dimensional information from a scene. One critical point of differentiation which must be made among them, however, is between active and passive techniques: active techniques such as Lidar (Light detection and ranging) or time-of-flight sensors use an active light source in order to provide distance information; passive techniques, however, rely solely upon the camera-acquired image data - similar to depth perception in human visual systems.

Too little computing power, high prices and imprecise results put the breaks on early 3D systems in many applications. Thanks, however, to improvements in computer performance and high-resolution sensors, the technology is finding its way into more and more applications.

All of the techniques each have their advantages and disadvantages: so while time-of-flight systems as a rule use less computational power and have few limitations in terms of scene structure, the maximum spatial resolution of current ToF systems (800 x 600 pixels) is relatively low and their outdoor use very limited due to infrared radiation from the sun. Newer sensors on the market, however, have now enabled passive multi-view stereo vision systems to offer very high spatial resolution; they are, however, processor intensive and perform poorly when confronted with low-contrast or repeated textures. Nevertheless, today's computational resources as well as optional pattern projectors make real-time operation of stereo systems at high spatial and depth resolutions possible. Precisely for this reason, passive multi-view stereo systems are among the most popular and flexible systems for the acquisition of 3D information. Multi-view stereo systems consist of two or more cameras which simultaneously record data from a scene. When the cameras are calibrated and can be focused on a real-world point in the scene whose pixel can be located in the camera, a three-dimensional feature can then be reconstructed from the pixels via triangulation. The highest-possible level of precision which can be obtained depends on the distance between the cameras (baseline), the convergence angle between the cameras, the sensor's pixel size and the focal length. The essential aspects of calibration and correspondence matching alone make great demands on the underlying image processing algorithms.

Stereo Vision Systems in Real-time Use

Through camera calibration, the position and orientation of the individual cameras can be determined (external parameters) as well as the focal length, principal point and distortion parameters (internal parameters) which are significantly influenced by the selected lenses.

Camera calibration is usually performed by using a two-dimensional calibration pattern such as a checkerboard or dots in which control points can be easily and clearly detected. Where, of course, the measurements of the calibration pattern such as the distances between control points are precisely known. Next, image sequences of the calibration patterns (with varying positions of pattern and orientation) are made. Image processing algorithms detect the control points in the calibration pattern from the individual images. Edge and corner detection algorithms serve, for example, as the basis when using a checker board pattern and blob detection algorithms when using a dot calibration pattern. In so doing, a multitude of 3D-2D correspondences between the calibration object and the individual images emerge. Based on these correspondences, an optimization process subsequently delivers the camera parameters.

Example detection results from a calibration pattern in various positions and directions. Via the detected control points from the calibration pattern, the camera's internal and external parameters can be determined.

While the calibration is run only once (assuming the camera parameters do not change during system operation), the significantly more processor-intensive task of finding correspondences between the views must be carried out for each image in order to deliver the scene's 3D information. In the case of a stereo system, correspondences between two views are identified. In preprocessing, the images are usually rectified by means of the internal distortion parameters. For a pixel in the reference image, there will be a subsequent search for the corresponding point in the target image which represents the same 3D coordinate in the observed scene. Assuming Lambertian reflectance (i.e. a perfectly diffused surface), local regions in the reference and target image should be very similar. The correlation is computed between the source and the target region and indicates similarity. This is not the same as computing the correlation coefficients beforehand and comparing them afterwards (normalized cross-correlation is well-established). The normalized cross-correlation is one such similarity measure.

Correspondence Points

All available scene points are not needed for the target image: geometrically there are potentially corresponding points which lie in the rectified views on a line, a so-called epipolar line. Correspondences need only to be searched for along these epipolar lines. In order to additionally accelerate the search, undistorted input images are often rectified. The input images are transformed so that all corresponding epipolar lines share the same vertical image coordinates. Accordingly, for any given point in the reference image, one need only search along the line with the same vertical coordinate when looking for correspondences in the target image. While the algorithmic complexity of the search remains the same, the previous rectification allows for a more efficient search for correspondences. Furthermore, if the minimum and maximum working distances of the scene are known, the search can be additionally refined along the epipolar lines in order to accelerate it.

Above: original image pair from The Imaging Source's stereo vision system. Below: rectified image pair. For a point in the reference image (below, left), a corresponding point need only be searched for along the same image line in the target image (shown lower right as a red line for demonstration purposes).

If all possible target environments along the epipolar lines have been compared with the reference environments, the target environment with the greatest similarity is, as a rule (in the case of local stereo algorithms), selected as the final correspondence. If the correspondence search is complete, assuming that a clear correspondence has been found, for every pixel of the reference image (in a rectified stereo vision system) there will be the distance information in the form of the disparity - in other words, as the offset in pixels along the epipolar line. Here one speaks of the disparity image or disparity map.

With the help of the previously calibrated internal and external parameters, the disparity can in turn be converted into actual metric distance information. If the distance for every point is calculated, where a disparity could be estimated, the result is a three dimensional model in the form of what is known as a point cloud. In the case of low-contrast or repetitive patterns in a scene, the use of local 3D stereo techniques can lead to less reliable disparity calculations since many points with a low uniqueness value will exist in the target view. Global stereo techniques can help in such cases but are considerably more processor intensive as they place additional demands on the final disparity card (e.g. in the form of a smoothness constraint that penalizes discontinuities). Often here, it is easier to project an artificial structure onto the object so as to produce clarity in the correspondences (projected texture stereo). Whereby, the projector is not calibrated with reference to the camera since it serves only as a source of artificial structure.

Visualization of the disparity estimate and the final point cloud using an SDK from The Imaging Source. Left: disparity map relative to the reference image. Middle: 3D view of the texturized point cloud. Right: color-coded point cloud which shows distance from the camera.

Acceleration via GPUs

When high frame rates and high spatial resolution are needed, modern GPUs calculate 3D information at significantly accelerated speeds. For the final integration of a stereo vision system in an existing environment, The Imaging Source relies on modular solutions: the acquisition of 3D data can be achieved using either The Imaging Source's own C++ SDK with optional GPU acceleration in connection with cameras from The Imaging Source or MVTec's HALCON programming environment. While the SDK allows for the easy calibration of stereo vision systems as well as the acquisition and visualization of the 3D data, HALCON offers additional modalities such as hand-eye calibration for the integration of robotic systems and additional algorithms such as the registration of CAD models in relation to acquired 3D data.

The above article, written by Dr. Oliver Fleischmann (Project Manager at The Imaging Source), was published in the May edition of the German-language industry journal Computer&AUTOMATION under the title, "3D-Sehen leicht gemacht." Please click these links to find additional information about the IC 3D Stereo Camera System and about the IC 3D SDK.