JHU Johns Hopkins Computer Vision Machine Learning

Research Problems

Visual phenomenon such as human motion, lip articulation and dynamic textures are some examples of inherently dynamic processes found ubiquitously in the environment. A picture is worth a thousand words, but a video is worth a million. It is well established that videos contain a lot more information about a phenomenon than just a still picture. For example, an image depicting a certain pose of the human figure can be confused with any kind of gait e.g. walking, running or even speed walking but a video, even over a very small time frame, can immediately allude to the viewer the exact nature of the gait. Hence the dynamics of a particular action are a very crucial tool in identifying and understanding the action.

As shown later, we specifically look at the dynamics of Lips when people, for example, utter their name, as part of an identification phrase, or utter a digit from 0 to 9. We want to understand the dynamics behind the speech process as captured in videos and want to use these to do recognition of the person or recognition of the spoken digit.

Feature Selection

Selecting the right features for the representation of lip dynamics is the first step towards developing a recognition paradigm. We cannot use intensity or color trajectories as they are not representative of the true lip dynamics. We also cannot use motion capture data as that is not suitable for a possible biometrics application. We hence use a 3-stage processing system consisting of the following steps:
(a) Remove the natural head motion during the speaking act
(b) Extract outer lip contours to collect six key-points on the lips and track these points throughout the video
(c) Interpolate from these six points to a total of 32 equi-distant points on the lip contours and record two types of features:
1. Landmarks - The coordinates of these points,
2. Distances - The distance between corresponding landmarks on the upper lip and the lower lip.

System Identification

We model the temporal evolution of different lip features using the Linear Dynamical Systems Framework. More specifically, a sequence of feature trajectories is assumed to be a realization from a second-order stationary stochastic process and hence it can be modeled with a state space model:

To find the parameters for the Linear Dynamical System M = (x₀, A, B, C, R) we consider the use of two popular Linear System Identification methods, namely:
1. N4SID
2. PCA-based suboptimal ID

Classification of Lip Articulation

Once the system parameters have been identified using any of the above mentioned methods, we need to define a measure of affinity between two given Linear Dynamical Systems. Over the past few years, a number of metrics have been proposed on the space of linear dynamical systems. A number of distances are based on the notion of subspace angles between two systems. These subspace angles are precisely the angles between the infinite observability matrices of the two systems. Since the infinite observability matrices cannot be found in a practical situation, we can estimate the subspace angles indirectly through the solution of a generalized eigen-value problem that translates to finding the solution of a Lyapunov equation. Once the subspace angles, {θ_i}²ⁿ_i=1, have been found, a number of distances can be defined as:

Finsler distance
Martin distance
Frobenius distance

Another method for defining a measure of affinity between two LDS is through a kernel. The Martin Kernel, defined as:

Martin kernel

comes directly from the associated Martin Distance and is immediately computable from the subspace angles. Chan and Vasconcelos proposed another metric based on the Kullback-Leibler Divergence between the probability distributions of the outputs of the dynamical systems:

Vishwanathan et. al. introduced the family of Binet Cauchy kernels for the analysis of dynamical systems. One of these kernels, called the trace kernel between two ARMA model can be evaluated as:

Once an affinity metric has been defined between two linear dynamical systems, we use k-Nearest Neighbors (k-NN) Classification in the case of distances or k-Furthest Neighbors (k-FN) Classification for kernels. Also to perform more sophisticated classification, we use kernel SVMs with the previously defined kernels.

Results

We perform a large number of experiments, based on both the scenarios, namely 'Name' and 'Digit' using both the Landmarks features and the Distances features. To evaluate the performance of the different lip features, identification methods, distances and kernels for LDSs, and classification methods, we performed classification of the data sets G4, G8, and G12 (G12 is a group of 12 people who's lip motions were investigated. G4 is a smaller group of the first 4 people of these 12 people and G8 consists of the first 8 people) using leave-one-out classification. The following figure shows the group make up:

Groups - First row of the figure shows G4, the first 2 rows make up G8 and all the figures together represent G12.

The following table shows the classification errors of all groups in the name and digit scenarios for landmark trajectories, L, using the PCA-based identification method, and using both 1-NN classification with 5 different distances and SVM classification with 3 different kernels.

Dynamic texture Categorization

In dynamic textures, the temporal evolution of image intensities is captured by a linear dynamical system, whose parameters live in a Stiefel manifold: clearly non-Euclidean. Boosting is a remarkably simple and flexible classification algorithm with widespread applications in computer vision. However, the application of boosting to non- Euclidean, infinite length, and time-varying data, such as videos, is not straightforward. We present a novel boosting method for the recognition of visual dynamical processes. Our key contribution is the design of weak classifiers (features) that are formulated as linear dynamical systems. The main advantage of such features is that they can be applied to infinitely long sequences and that they can be efficiently computed by solving a set of Sylvester equations. This method can be applied to dynamic texture classification.

	1-NN					SVM
PCA-L	d_F	d_M	d_f	d_KL	k_T	k_M	k_KL	k_T
G4	38	13	10	3	0	17	0	0
G8	63	41	25	23	33	45	15	33
G12	63	49	34	36	39	47	17	40
Name scenario

	1-NN					SVM
PCA-L	d_F	d_M	d_f	d_KL	k_T	k_M	k_KL	k_T
G4	43	28	15	13	28	40	10	10
G8	55	46	30	34	45	41	13	36
G12	63	48	33	43	51	64	20	43
Digit scenario