3D Object Pose Estimation and Categorization
Project Summary
Object detection, pose estimation and categorization are core research problems in computer vision. Even though these problems are solved almost trivially by humans, they have been surprisingly resistant to decades of research. In this project, we develop two major approaches to tackle these fundamental problems. The first approach uses a new class of 3D object models called 3D wireframe models, which are learned from images and then used for pose estimation. The second approach augments deep networks for object detection and classification to also perform 3D pose estimation. Applications of this research include autonomous navigation (detection and localization in 3D of vehicles and pedestrians for cars) and robotics (identifying and interacting with objects, locating obstacles and determining room layout for navigation).
Object localization and Pose Estimation using 3D Wireframe models
A wireframe model is a sparse collection of 3D points, edges and surface normals defined only at a few points on the boundaries of the 3D object. The model is designed such that, when projected onto the image, it resembles a 2D HOG template for the object, hence it can be easily matched to the image by performing fine-grained 3D pose estimation, which gives a 2D detection as a byproduct. In this project, we aim to design algorithms to (1) learn deformable wireframe models from 2D images and (2) use these models for holistic scene understanding (semantic segmentation with 3D pose and layout estimation). The proposed learning algorithm replaces the 3D reconstruction error with pose estimation score to create a new correspondence free non-rigid structure from motion algorithm. The project aims to design new top-down energy terms based on 3D wireframe models that combine semantic segmentation, 3D pose and layout in a CRF-based energy to solve these problems together in a principled manner. It also aims to derive optimization strategies to efficiently solve these problem formulations.

This work [1] introduces a new class of 3D object models called 3D Wireframe models which allow for efficient 3D object localization and fine-grained 3D pose estimation from a single 2D image. The approach follows the classical paradigm of matching a 3D model to the 2D observations. The 3D object model is composed of a set of 3D edge primitives learned from 2D object blueprints, which can be viewed as a 3D generalization of HOG features. This model is used to define a matching cost obtained by applying a rigid-body transformation to the 3D object model, projecting it onto the image plane, and matching the projected model to HOG features extracted from the input image. We also introduce a very efficient branch-and bound algorithm for finding the 3D pose that maximizes the matching score. For this, 3D integral images of quantized HOGs are employed to evaluate in constant time the maximum attainable matching scores of individual model primitives. Experimental evaluation is performed on three different datasets of cars and demonstrated promising results with testing times as low as less than half a second.

In [2] we extended this work by using the wireframe models within a Conditional Random Field (CRF) for semantic segmentation. Specifically, we proposed new top-down potentials for image segmentation and pose estimation based on the shape and volume of a 3D wireframe object model. We show that these complex top-down potentials can be easily decomposed into standard forms for efficient inference in both the segmentation and pose estimation tasks. Experiments on a car dataset show that knowledge of segmentation helps perform pose estimation better and vice versa.
Deep networks have been extremely successful at detecting and classifying objects in 2D images. The goal of this project is to extend deep networks so that they can also reason in 3D, specifically for the joint tasks of 3D pose estimation and categorization. Specifically, we studied the following research questions: (1) What is an appropriate representation for 3D object pose and correspondingly, what is the correct problem formulation for this task? (2) What are good loss functions to use while training CNNs for the pose estimation task? And (3) what should the network architecture be for these pose CNNs? Prior work solved the pose estimation problem as a classification task by discretizing 3D pose into Euler angle bins and solving a classification problem while training a standard classification network with cross-entropy loss. This ignores the continuous nature of the 3D pose space and its underlying geometry. In [3,4], we first concentrated on the task of estimating the orientation of the object (captured by a rotation matrix) given its category label and 2D localization (given by an oracle or obtained as the output of a detection system). In [5] we relaxed these constraints and solved for object orientation with unknown category and unknown localization, respectively. In objective [6] we further concentrated on the problem of designing more compact network architectures for object classification.
Work supported by NSF grant 1527340 (link).
E. Yoruk and R. Vidal.
A 3D wireframe model for efficient object localization and pose estimation
In Workshop on 3D Representation and Recognition at IEEE International Conference on Computer Vision, 2013.
S. Mahendran and R. Vidal.
arXiv 2016
S. Mahendran, H. Ali and R. Vidal.
In Workshop on Deep Learning for Robotic Vision at Conference on Computer Vision and Pattern Recognition, 2017
S. Mahendran, H. Ali and R. Vidal.
In British Machine Vision Conference, 2018.
S. Mahendran, H. Ali and R. Vidal.
In European Conference on Computer Vision, 2018.
H. Lobel, R. Vidal, and A. Soto
In Computer Vision and Image Understanding, 2020.