JHU Johns Hopkins Computer Vision Machine Learning

Research Problems

Human activity is an inherently dynamic process that can be understood beyond doubt when the temporal evolution is also taken into account. For example, an image depicting a certain pose of the human figure can be confused with any kind of gait e.g. walking, running or even speed walking or walking with a limp but a video, even over a very small time frame, can immediately allude to the viewer the exact nature of the gait. The dynamics of a particular action are a very crucial tool in identifying and understanding the action. The goal of this research is to better understand the dynamics behind the human activity process as captured in videos and to use these to perform recognition of the action in any given video.

Feature Selection

Selecting the right features for the representation of human motion dynamics is the first step towards developing a recognition paradigm. We cannot just use intensity or color trajectories as they are not representative of the true motion. In the past, researchers have used motion capture data, however accurately extracting limbs and joints from a video without using stick-up markers is difficult. Moreover, there is always the question whether the representation should be unique to the whole person i.e. global, or whether a collection of local representations such as a parts-based approach is better. Global motion models allow for the construction of a spatially compact and temporally consistent formulation. However the drawback is that these models break down when there is a large amount of occlusion of the subject, in which case parts-based or local approaches are applicable. Local models however do not take into account of the dynamics of the scene. Since our goal is to explicitly model the dynamics of human motion, we propose a global model.

Assuming that there is only one person moving in the scene and the background is stationary, we first compute the optical flow for each frame of the entire video. We quantize the optical flow as shown in figure 2. Each optical flow vector is binned according to its principal angle with the horizontal. The contribution to each bin is wiehted by the magnitude of the optical flow vector. This gives us a scale and fronto-parallel direction invariant feature, a Histogram of Oriented Optical Flow (HOOF) in each frame. The time series of such HOOF represents the motion in the scene.

System Identification

Since HOOF features are histograms and do not lie in a Euclidean space, we cannot model HOOF time series as Linear Dynamical Systems. Instead, we model the temporal evolution of HOOF features using a linear-state non-linear dynamical system using kernels on the space of histograms,

where Φ is an implicit map of a kernel on the space of histograms. Some of the metrics that can be used with the kernel are the Bhattacharrya distance, Χ² distance, the Histogram Intersection kernel, and the Minimum Distance Pairwise Assignment. Using Kernel PCA, we identify the parameters y_mean, x₀, A, B and the covariance of the noise processes v and w, as well as the kernel principal components that represent the C function.

Classification of Human Activities

Once the system parameters have been identified, we need to define a measure of affinity between two given Non Linear Dynamical Systems. A family of such measures between two Linear Dynamical Systems were introduced by Vishwanathan et. al. as the family of Binet Cauchy kernels. One of these kernels, called the trace kernel between two ARMA model can be evaluated as:

Since we cannot use LDS to model HOOF trajectories, we need an affinity measure that is applicable to the non-linear dynamical systems. We propose the Binet-Cauchy kernels for non-linear dynamical systems as:

Once an affinity measure has been defined between two non-linear dynamical systems, we use k-Nearest Neighbors (k-NN) Classification. Also to perform more sophisticated classification, we can use kernel SVMs with the previously defined kernels.

Results

We perform Leave-one-out classification using the Binet-Cauchy kernels for non-linear dynamical systems on the HOOF trajectories for the Weizmann human action dataset. This dataset contains 10 actions performed by 9 persons each for a total of 90 videos. For each video, HOOF feature time series were extracted and the corresponding system parameters were identified. For the test video, the Binet-Cauchy kernel for NLDS was computed with all the training videos and the label of the nearest one was assigned to the test video. The results were averaged over all the videos using a leave-one-out scheme. The confusion matrix for the classification results is shown below.s

Publications

[1]

R. Chaudhry, A. Ravichandran, G. Hager and R. Vidal.

Histograms of Oriented Optical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for the Recognition of Human Actions.

IEEE Conference on Computer Vision and Pattern Recognition, June 2009.

Acknowledgments

This work was partially supported by the grants: NSF CAREER 0447739, NSF CDI-1 0941463, and ARL Robotics-CTA 80014MC.