Simultaneous Estimation of an Object's Class, Pose, and 3D Reconstruction from a Single Image

Project Summary

In this project we show how to estimate the class, 3D pose and 3D reconstruction of an object from a *single* 2D image. In contrast with other approacheds that address these task separately, we show that there is an advantage in addressing these tasks simultaneosly. In addition, we show that considering the objects in 3D is also beneficial.

Main Ideas

This project is based on the following three main ideas:

**1- Thinking in 3D:** Humans find remarkably easy to perceive three dimensional (3D) objects, even when presented with a single two dimensional image (e.g., when closing an eye or looking at a picture). This ability is essential to interact with the environment and to “understand” the observed images, which arguably can only be achieved when the underlying 3D structure of the scene is understood (see Fig. 1).

One ingredient contributing to the complexity of this task (of perceiving 3D objects in 2D images) is that it is inherently ambiguous, when no other constrains are imposed. Humans presumably solve it by relying on prior knowledge about the objects involved and the laws of nature that govern the physical interactions between objects and the creation of images. In this project we incorporate prior knowledge about different object classes by means of*3D Shape Priors*. A 3D shape prior for a particular class encodes the probability that an object of the class, after being registered to a canonical pose, covers each point in 3D space (see Fig. 2).

Figure 1: Examples of 2D images where the use of 3D prior knowledge is essential to understand the scene.

One ingredient contributing to the complexity of this task (of perceiving 3D objects in 2D images) is that it is inherently ambiguous, when no other constrains are imposed. Humans presumably solve it by relying on prior knowledge about the objects involved and the laws of nature that govern the physical interactions between objects and the creation of images. In this project we incorporate prior knowledge about different object classes by means of

Figure 2: 2D cuts of different 3D shape priors along vertical planes (a), for the classes cups and mugs (b). Two different subclasses of each class are shown. The color of each point indicates the probability that the point would be inside an object of the class (see color bar to interpret the colors).

Figure 3: Appearance (left) can be deceiving, since it can be changed (purposely or not) to hide the identity or class of the object. Shape (right), in contrast, is more difficult to camouflage. And in some cases it provides enough information not just to distinguish the class of the objects in the scene, but also the activities they are performing.

Summary of the proposed approach

To implement these ideas we proposed a family of inference algorithms that we refer to as *self-conscious* (SC) algorithms. SC algorithms are complementary to *Branch and Bound* algorithms and are based on a *hypothesize-and-verify paradigm*: multiple hypotheses are created, their log-probabilities are evaluated, and hypotheses are discarded as soon as they are proved worse (*i.e.*, their likelihood is lower) than some other hypothesis. In this project each hypothesis corresponds to a pair (Object Class, Object Pose) and thus finding the best hypothesis is equivalent to solving the classification and pose estimation problems (the 3D reconstruction is obtained as a by-product during the computation of the log-probability).

Since the number of hypotheses to consider could be potentially very large, and since the evaluation of each hypothesis' log-probability could be computationally very costly, it is essential to perform these evaluations very efficiently. To address these issues, we made use of two mechanisms. The first one is a*focus-of-attention-mechanism* (FoAM) to sensibly and dynamically allocate the computational *budget* among the different hypotheses to evaluate. This mechanism keeps track of the progress made for each computation cycle spent on each hypothesis, and decides on-the-fly where to spend new computation cycles in order to economically discard as many hypotheses as possible. The second mechanism is a *bounding mechanism* (BM) which computes *lower and upper bounds* for the log-probability of a hypothesis, instead of evaluating it *exactly*. These bounds are in general significantly less expensive to evaluate, and they are obtained for a given computational budget (allocated by the FoAM), which in turn defines the tightness of the bounds (*i.e.*, higher budgets result in tighter bounds). In some cases, these inexpensive bounds are already sufficient to discard a hypothesis. Otherwise, these bounds can be efficiently and progressively refined by spending extra computational cycles on them (Fig. 4).

Since the number of hypotheses to consider could be potentially very large, and since the evaluation of each hypothesis' log-probability could be computationally very costly, it is essential to perform these evaluations very efficiently. To address these issues, we made use of two mechanisms. The first one is a

Figure 4: Progressive refinement of the hypotheses' log-probability bounds in a SC algorithm. The bounds of one hypothesis are represented by the two lines of the same color. Cycles in which the bounds of a hypothesis are refined are indicated by the marker 'o'. Hypotheses 3 (cyan) and 4 (blue) are discarded during the 1st refinement cycle. During the 21st cycle Hypothesis 2 (green) is discarded, proving that Hypothesis 1 (red) is optimal. Note that it is not necessary to compute exactly the log-probabilities to select the best hypothesis: the bounds are sufficient and much cheaper to compute.

SC algorithms are very efficient because the interaction between the FoAM and the BM results in that most hypotheses are discarded with very little computation (Fig. 5). Moreover, they are guaranteed to find the optimal solution, even in probabilistic graphical models with loops where other methods fail. Furthermore, they are guaranteed to find the *set of all optimal solutions*, defined as those hypotheses whose log-probability's upper bound is greater than the maximum lower bound. This is a property that no other algorithm known to us can currently claim.

Figure 5: Summary of the results reported in [1]. Left: Original input image. Middle: Two views of the 3D reconstruction obtained for the input image. Right: Histogram of hypotheses grouped by bound pair computations spent on them. Note that very modest work is performed for most hypotheses (in particular 60% of the hypotheses were discarded with less than 10 bound pair computations).

Publications