Organization: Carnegie Mellon University
Principal Investigator: Hans Moravec
Date: February 24, 2002
Since 1999 we've been working full-time to develop laboratory prototype sensor-based software for utility mobile robots for industrial transport, floor maintenance, security etc., that matches the months-between-error reliability of existing industrial robots without requiring their expensive worksite preparation or site-specific programming. Our machines will navigate employing a dense 3D awareness of their surroundings, be tolerant of route surprises, and be easily placed by ordinary workers in entirely new routes or work areas. The long-elusive combination of easy installation and reliability should greatly expand cost-effective niches for mobile robots, and make possible a growing market that can itself sustain further development.
Our system is built around 3D grids of spatial occupancy evidence, a technique we have been developing since 1984, following a prior decade of robot navigation work using a different method. 2D versions of the grid approach found favor in many successful research mobile robots, but seem short of commercial reliability. 3D grids, with at least 1,000 times as much world data, were computationally infeasible until 1992, when when we combined increased computer power with 100x speedup from representational, organizational and coding innovations. In 1996 we wrote a preliminary stereoscopic front end for our fast 3D grid code, and the gratifying results convinced us of the feasibility of the approach, given at least 1,000 MIPS of computer power. This contract enables us to extend that start towards a universally convincing demonstration, just as the requisite computing power arrives.
In earlier reports we described a greatly improved camera calibration program and a sensor model learning approach guided by color consistency that produced near-photorealistic 3D maps. In July we reported on a new meticulously-prepared set of 606 images from a hallway traverse by a surrogate robot designed to overcome the limitations in two older data sets and mentioned that we were re-constructing the system in modular C++ to incorporate improved techniques and to rescue us from a plain C program that was growing into an impenetrable tangle.
We are happy to report that, as of February 2002, a first draft of the new program is complete, and the first results with the new data gratifyingly exceed the best realism we had achieved previously. Many internal images of our new grids can truly be mistaken for photographs of a real location, and are clearly superior for navigation planning. Defects remain at the outer edges of the grid, especially those at the upper extreme of the robot's fields of view. Contemplating these has produced a number of new ideas, two of which are particularly exciting and implementable and very likely to result in substantial further improvements, especially for the most difficult grid regions. We nickname them "Two Thresholds" and "Prior Probe". They are outlined in the technical results section below, and we look forward to trying them almost immediately.
Besides pursuing new research opportunities as they present themselves, we are following a long-term strategy towards practical free-ranging robot applications. The new position-calibrated data allows us temporarily to avoid the problem of registering uncertain viewpoints. The next phase of the project, however, will try to derive equally good maps from image sequences collected by imprecisely traveling robots. Scott Crosby, who joined the project last year, has developed a program that can geometrically register a partial map built from a single robot location with a more complete working map. Our most recent maps have cell dimensions 512x512x128, thus contain 32 million cells. Scott's program, in less than a second, selectively samples about 30 thousand of those to participate in the registration process, and, in versions of the new map data randomly displaced by a few centimeters and degrees, manages to recover the correct position and orientation to within a few millimeters displacement for the worst case cells at several meters range. Further improvements are expected, but the present performance would already be sufficient to work with robot data.
When we are satisfied with mapping of uncalibrated results, we will attempt autonomous runs, with new code that chooses paths as it incrementally constructs maps. When the autonomous runs go satisfactorily, we will add code to orchestrate full demonstration applications like patrol, delivery and cleaning. The work then should be ready for commercialization, initially in industrial applications, subsequently for a larger home market.
In 1999 we hoped to squeeze this work into three years of development. In 2.5 years we've accomplished about half the agenda, exceeding our quality goals but slipping our schedule because the technical problems and opportunities turned out to richer than outlined. We are more confident than ever that we can achieve the overall goals, at the cost of sustaining the current effort for a additional two or three years. I'd like to emphasize that the extra time reflects solely the richness of the problem. The first phase of the project saw construction of a comfortable software environment that allows us to focus undistractedly on developing, experimenting and testing new mapping and navigation ideas, which effort we pursue efficiently and single-mindedly.
We built a sensor head with three cameras and a textured light source, to enable stereoscopic ranging of edges in any orientation and also blank surfaces. (We've noted the following large animated GIFs tend to play faster in Explorer than in Netscape).
We collected a position-calibrated data set of both textured and untextured image triplets, 100 views in four compass directions along a 10 meter journey. The textured images are used for stereoscopic matching, the regularly lighted ones for coloring/learning.
The data set also includes two overhead views covering much of the traverse. The overheads are not used in the basic map-building program (since a wandering robot would not have such available), but function in training as a kind of ground truth by adding color variance during the learning process, penalizing sensor models that make poor choices even in parts of the map hidden from the ground-level viewpoints.
As mentioned in earlier reports, our learning works by projecting colors from the run's real camera images onto corresponding views of a grid's occupied cells. In a perfect grid each occupied cell would correspond to an actual object in the real scene with a definite color, and the cell would receive that same color from multiple images, despite their greatly different points of view. The color internal variance of all the occupied cells would thus be low. If the grid, by error, contained an extra cell not corresponding to a real feature, that cell would be painted in the color of its background in the real scene. That background would be different in the different points of view, so the extra cell develops a high color variance. A missing cell, on the other hand, causes the color of its corresponding real object to be sprayed to various background cells, disturbing their color uniformity. Thus either kind of imperfection raises the average color variance. This measure proved extremely effective as the learning criterion for adjusting a dozen parameters shaping the evidence rays that build grids from our previous data sets. Such tuning was especially important with the previous data because the adjustments mitigated high error rates and other limitations from simple two-camera stereoscopic ranging. Trinocular color images, textured lighting, improved algorithms and greater processing power give the new system much better stereoscopy, achieving about 90% correct ranges on 10,000 image patches selected per view. Yet learning still garners significant improvements over reasonable hand adjustment of the sensor model.
To best illustrate the grids, we've made simulated runthroughs that parallel the camera image summary runthrough linked above. As there, the image represents a 180 degree horizontal cylindrical projection. The vertical field of view is a bit over 70 degrees. Of course, with 3D synthetic data there is no need for gaps in the panorama. The first movie shows the view from knee-level height, corresponding to the height of the physical cameras, which provides about the best-looking vantage point. The learning process worked very hard to make the grid look right from just this perspective.
The worst view, from about eye-height, traverses the upper boundary of the grid, seen only at several meters at shallow angle by the uppermost camera fields of view. It is further contaminated by systematic stereoscopic errors across an occlusion, where a far wall is seen behind a near outer corner. The stereo process often returns a "compromise" distance, placed squarely in the open hallway between the wall and the corner. This problem inspired the stereoscopic "Prior Probe" idea outlined below.
Perhaps the most relevant vantage point is a mouse's, just above the floor. That level holds the information most relevant for navigation. We are happy the system already does well there.
The suitability of the grid for navigation is probably best shown in plan view. The following image was created by a program that mapped each vertical column of cells in the grid to an image pixel. The color of the pixel is the color of the topmost cell in the the largest cluster of occupied cells in the column. The plants in the scene are rendered dark because the low cameras saw mostly the dark shadowed underside of the topmost leaves.
The greatly improved quality of the new grid compared to our previous results is largely due to greatly improved stereoscopy. Our system ranges small localized image features, an approach compatible with complex varied scenes typical of navigated spaces, but now picks up to 10,000 patches per stereoscopic glimpse. These points are chosen by a simple "interest operator" that picks regions of maximal color variance, thus distinctiveness, in small neighborhoods. The use of three equilaterally placed cameras for matching means variation in any in any image direction suffices for matching. In past our operators demanded variance in certain directions - no more. The search points are chosen symmetrically from the three cameras, each directional ray in the field of view is assigned to the camera that maximizes the visibility of that ray from the other two cameras. This almost eliminates cases where a sought feature in the starting camera image is beyond the image boundary of another camera. With three equilaterally placed cameras, the interest regions divide the image space into three 120 degree segments, which the program determines automatically.
The new stereoscopic matching process has several innovations. We use three cameras on an equilateral mount and match in full color, searching successive range hypotheses simultaneously from each feature's "interest" camera to the other two cameras. Most of the search proceeds in steps limited by image pixel resolution, but for very short ranges, thus high disparity, the step size becomes instead determined by grid voxel size. The match windows are round (not square), in recent runs having an area of 65 pixels. (We experimented with adding a match between the two "non-interest" cameras' trial patches into the comparison, but abandoned the idea when we found it sometimes introduces errors when unrelated patches of the scene happen to match.)
When we first activated the new code we were greatly alarmed by below 50% correct ranges. Hardly any of the features picked on the "gray noise" carpet nor on the low-contrast wood-grain doors were correct. The carpet problem was tracked to near misses of the correct feature in the search process. We had taken great care to precisely calibrate the optical and mechanical geometry of the cameras, but that alone turned out to be insufficient. Calibration errors, mechanical shifts in the lens or camera mounts or asymmetric focus blurring effects resulted in a residual pixel, sometimes two, of displacement from ideal. Such errors turned out to be much more significant in a three-camera system. A near miss in a two camera setup tends to produce a correlation dip that can still register a correct range. With three cameras the slightly misplaced dips from two searches must coincide, they do only infrequently. We solved this problem by adding an additional search to a radius of two pixels from the nominal range-hypothesis track. Correct carpet matches soared to above 90%, at a slight cost in range precision and a 5x cost in computing speed. (On a 1000MHz G4, our pure C++ program processes 10,000-feature images into a 3D map in 11 seconds, 10 of them now for correlation. Fortunately stereoscopic correlation is a good candidate for speedup by a signal processor such as the AltiVec unit in the G4.). The door errors also derived from calibration limitations, but this time in image color. We had made an effort to color-correct the camera images, and planned to compare patches by simple sum of squares of differences, reasoning that it best preserved information: for instance dark and light patches would remain distinct. But the color adjustments left residual differences, and those differences were greater than the subtle door grain variations. A term in the comparison to subtract out the difference between the patch means solved the problem. By clipping that term to the equivalent of about 20 gray level steps (out of 255 for 8-bit pixels) the program preserves the ability to tell dark from light.
The correlation is now so good that a simple superposition of the range rays themselves results in a plausibly usable navigational map. This was never the case with the old program.
We derive rays of evidence to be added to the grid map from stereoscopic match curves by first converting the match curve into a probability curve using a combination of empirical data a simple image noise model with a parameter set by the sensor model learning process. The components entering into the conversion are a histogram H of the actual match values encountered in a particular feature search. All but one (the correct match) represent values that this feature might elicit by chance. The match value we expect from the correct position is derived by assuming the RMS value would be a perfect zero but for noise in the image. We model the noise by a Gaussian G with a variance parameter set by learning. The correlation diagrams above show the results when this parameter, in gray-level steps, is set to 0.25 (in red), 1 (in green) or 4 (in blue). In fact, our learning process, given the chance to vary this parameter in 0.1 increments, insists firmly that the correct value is 0.5. The old program used a simpler but similar model. In it the error was represented by an exponential distribution, and matches were done in monochrome, but there too the learning brought the noise parameter to around 0.5 gray levels. Hypothesize, as a step in a Bayesian inference, that the correct match occurs at a displacement k. The probability of observing the match curve we do is the product of the probabilities of each point in the curve having the value it does. Points other than k, can be said to have probabilities corresponding to their position in the match value histogram H. Point k's probability is given by the noise model G. The following slide outlines the steps in converting match value to probability. The curve is then convolved with a (learning parameter modified) stereoscopic geometric range uncertainty model to make an evidence ray backbone, which is swept (also in a parameter-modified way) to make a 2D cross-section, itself swept again in 3D to make a cone of evidence to be added to a grid map.
After applying Bayes' rule, the match-to-probability formula acquires an additional input, P(D=k), the prior probability that various positions along the stereo ray are occupied. In the first draft of the program described here, we made that prior an innocuous curve that slightly favors long distances over short ones. The occlusion error problem suggested a far more potent use for the term. This is the "Prior Probe" idea mentioned earlier. We can provide an informed distance prior by measuring the occupancy probability already accumulated in the grid along the stereo ray line of sight, probing it before we construct new evidence to be add to it. There is a natural parameter relating the weight of evidence we accumulate in the grid to probability. It manifests itself as a scaling of the evidence values, or an exponent multiplier in a formula giving probability. If the parameter is small, probabilities cluster around 0.5, if large they diverge to near 1 and 0. We will add this parameter to the learning suite. Evidence accumulates consistently at real scene objects as they are repeatedly stereoscopically ranged. There is less accumulation in regions of occlusion error, other things being equal, as the spurious matches distribute randomly over a large area. The weaker accumulation in error regions will produce a weak prior, discouraging further accumulation there relative to real objects along the line of sight. In the hallway occlusion example, we expect the hallway to become increasingly less likely to receive positive occupancy evidence as rangings accumulate. Range values will be pulled either to the foreground or the background. Of course, if the parameter were too big it might accrete evidence around randomly generated seeds. Too small, it will have no effect. We expect the learning to tune it well, since both random blobs, or trails of error as in the present grid, spoil the color variance.
Learning, as presently implemented, works quite well for making images for presentation, and happens also to have made maps suitable for navigation, but the latter outcome is not enforced. Evidence is accumulated in the present implementation as floating point numbers imagined to represent occupancy probability coded as log(p/(1-p))). The grid cells are initialized to zero, and increased and decreased in value by learning-sculpted stereo evidence rays. For many purposes they can be left as continuous values. For learning, however, it is necessary to choose a threshold above which cells are considered opaque to be colored and below which they are transparent. For imaging we prefer a positive threshold, so that untouched cells are deemed transparent. We can then place our overhead cameras in untouched space and still expect them to see the scene. For navigation, we would probably treat untouched cells as occupied, and insist that paths be planned only through definitely empty regions. Learning presently is oblivious to the latter imperative, and could conceivably generate a good-looking, high-scoring grid with traversable areas left unknown rather than empty, making it useless for navigation. "Two Thresholds" addresses this problem, and more subtle ones, including indirectly occlusion error noise. Learning scores will be the average of two visualizations one using an upper and the other a lower threshold, probably 0.5 and -0.5. Cells below the lower threshold are deemed empty, those above the upper value are assumed occupied, those between are unknown. The upper threshold produces a lean grid, the lower a fat one, but both should be within reason. Areas that have been seen by the robot should be either occupied or empty, except at boundaries between the two. A grid where many seen cells are left unknown will score poorly. We think this might, by itself, tend to clear out some areas of marginal occupancy, like the occlusion noise. One twist: overhead views cannot be used to color the lower threshold grid, as they may be uselessly encased in opaque unknown cells. But they will contribute to the upper threshold score, as now.
Hardware differences in our three cameras and perhaps limitations in the beta camera control and capture software resulted in substantial differences in color response. We corrected most of these using selected images from the test run itself. In some views the cameras saw mostly blank walls orthogonal to the line of sight, in various kinds of lighting. We combined several of those images, registered so corresponding pixels in rectified images from the three cameras saw the same wall locations, and excised any picture areas with sharp boundaries. We were then able to plot comparative values for each primary color relating one camera to another. Using the left camera as reference, we found we were able to substantially correct the color in the other two with a one-parameter least-squares gamma curve. If the intensity range is 0:1 the curve has the form c' = 1-(1-c)^gamma. The plots, superimposed in their own color, and (faintly) the gamma curves for each camera are shown in the following diagram, along with a sample image before (above) and after (below) color correction.
We've described our calibration approach in earlier reports. We precisely point our cameras at a test pattern of many spots, and the program finds an optical center and a radial distortion function. By doing this at several distances we can also precisely solve for the focal plane of the lens. The calibration results are used to rectify run camera images to a specified ideal perspective geometry. The process is quite good. The following image shows (clockwise) our latest calibration setup, a raw image of the test pattern, a rectified version of the raw image, and an synthetic spot image showing exactly what the rectified image should look like.
As stereoscopic correlation so rudely informed us, the rectification is not perfect. Below is a display showing the difference between the rectified real image and the synthetic ideal, rendered in red and green. Asymmetric red and green color fringes show slight misplacement, especially in some corners, of up to two pixels.
We spent considerable time and effort attempting to squeeze the color component of our images into 8 rather than the original 16 bits (the other 8 bits are intensity), to save both space and the accelerate matching. We ultimately abandoned the idea as too information-lossy. Below is a diagram of a symmetric hexagonal tiling of the color space into 217 cells. At right is a test image broken into orthogonal intensity and color components. The top image is intensity, the middle shows full pure color information, the bottom image color quantized by the hexagonal encoding. Reassembled with their intensity component, quantized color images look reasonable, but we decided the resolution loss risked compromising stereoscopic match quality.
We, of course, continue to follow unexpected avenues the research presents, including many not mentioned. (Ok, one more: the two-pixel radius fine search now done by the correlator returns displacement errors keyed to individual pixels. If we average these over many images during a run, noise should be mitigated, and we could use the average displacements to construct a "rubber sheet" fine correction for the rectifications, that could be kept updated as the run proceeded. In this way we might reduce the need for fine search, perhaps to a radius of one pixel, and increase the ranging accuracy. The same approach could be used to field-adjust color correction).
Meanwhile we are making preparations to gather images from uncalibrated robot runs for the next phase of development. Research programmer Scott Crosby already has tested preliminary versions of the map registration code we will need to process this data, undergraduate Chris Schroeder is developing driver code for several test robots borrowed from other projects, engineer Jesse Easudes is designing a mount to attach camera assemblies, laptop computer and laser light mottler to the test robots. Since registering fix-mounted overhead cameras with uncalibrated robot paths would be quite burdensome, we have decided to instead place a fourth "training" camera (not unlike bicycle training wheels) on the robot itself, mounted as high as possible above the stereoscopic cameras, looking down at some angle chosen to avoid views of the robot. Scott's registration program will tell us where stereo cameras are, from that we can easily calculate where the rigidly attached training camera is at each stop. Advantageously, this approach will provide a much greater amount of training imagery.