|
|
|
| Recognition of Visual Dynamic Processes |
|
|
|
|
|
|
|
Research Problems |
Visual phenomenon such as human motion, lip articulation and dynamic textures are some examples of inherently dynamic processes found ubiquitously in the environment. A picture is worth a thousand words, but a video is worth a million. It is well established that videos contain a lot more information about a phenomenon than just a still picture. For example, an image depicting a certain pose of the human figure can be confused with any kind of gait e.g. walking, running or even speed walking but a video, even over a very small time frame, can immediately allude to the viewer the exact nature of the gait. Hence the dynamics of a particular action are a very crucial tool in identifying and understanding the action.
As shown later, we specifically look at the dynamics of Lips when people, for example, utter their name, as part of an identification phrase, or utter a digit from 0 to 9. We want to understand the dynamics behind the speech process as captured in videos and want to use these to do recognition of the person or recognition of the spoken digit.
|
| |
|
Feature Selection |
Selecting the right features for the representation of lip dynamics is the first step towards developing a recognition paradigm. We cannot use intensity or color trajectories as they are not representative of the true lip dynamics. We also cannot use motion capture data as that is not suitable for a possible biometrics application. We hence use a 3-stage processing system consisting of the following steps:
(a) Remove the natural head motion during the speaking act
(b) Extract outer lip contours to collect six key-points on the lips and track these points throughout the video
(c) Interpolate from these six points to a total of 32 equi-distant points on the lip contours and record two types of features: 1. Landmarks - The coordinates of these points, 2. Distances - The distance between corresponding landmarks on the upper lip and the lower lip.
|
|
|
|
 |
|
|
| Lip contours |
Landmark features |
Distance features |
|
| |
| System Identification |
|
We model the temporal evolution of different lip features using the Linear Dynamical Systems Framework. More specifically, a sequence of feature trajectories is
assumed to be a realization from a second-order stationary stochastic process
and hence it can be modeled with a state space model:
|
|
|
To find the parameters for the Linear Dynamical System M = (x0, A, B, C, R) we consider the use of two popular Linear System Identification methods, namely:
1. N4SID
2. PCA-based suboptimal ID
|
| |
|
Classification of Lip Articulation |
Once the system parameters have been identified using any of the above mentioned methods, we need to define a measure of affinity between two given Linear Dynamical Systems. Over the past few years, a number of metrics have been proposed on the space of linear dynamical systems. A number of distances are based on the notion of subspace angles between two systems. These subspace angles are precisely the angles between the infinite observability matrices of the two systems. Since the infinite observability matrices cannot be found in a practical situation, we can estimate the subspace angles indirectly through the solution of a generalized eigen-value problem that translates to finding the solution of a Lyapunov equation. Once the subspace angles, {θi}2ni=1, have been found, a number of distances can be defined as:
|
| Finsler distance
|
|
| Martin distance
|
|
| Frobenius distance
|
|
|
|
Another method for defining a measure of affinity between two LDS is through a kernel. The Martin Kernel, defined as:
|
| Martin kernel
|
|
|
comes directly from the associated Martin Distance and is immediately computable from the subspace angles. Chan and Vasconcelos proposed another metric based on the Kullback-Leibler Divergence between the probability distributions of the outputs of the dynamical systems:
Vishwanathan et. al. introduced the family of Binet Cauchy kernels for the analysis of dynamical systems. One of these kernels, called the trace kernel between two ARMA model can be evaluated as:
Once an affinity metric has been defined between two linear dynamical systems, we use k-Nearest Neighbors (k-NN) Classification in the case of distances or k-Furthest Neighbors (k-FN) Classification for kernels. Also to perform more sophisticated classification, we use kernel SVMs with the previously defined kernels.
|
|
Results |
We perform a large number of experiments, based on both the scenarios, namely 'Name' and 'Digit' using both the Landmarks features and the Distances features. To evaluate the performance of the different lip features, identification methods,
distances and kernels for LDSs, and classification methods, we performed
classification of the data sets G4, G8, and G12 (G12 is a group of 12 people who's lip motions were investigated. G4 is a smaller group of the first 4 people of these 12 people and G8 consists of the first 8 people) using leave-one-out classification. The following figure shows the group make up:
Groups - First row of the figure shows G4, the first 2 rows make up G8 and all the figures together represent G12.
The following table shows the classification errors of all groups in the name and digit
scenarios for landmark trajectories, L, using the PCA-based identification method, and using
both 1-NN classification with 5 different distances and SVM classification with
3 different kernels.
|
|
|
1-NN
|
SVM
|
|
PCA-L
|
dF | dM | df | dKL | kT |
kM | kKL | kT |
|
G4
|
38 | 13 | 10 | 3 | 0 |
17 | 0 | 0 |
|
G8
|
63 | 41 | 25 | 23 | 33 |
45 | 15 | 33 |
|
G12
|
63 | 49 | 34 | 36 | 39 |
47 | 17 | 40 |
|
Name scenario
|
|
|
|
|
1-NN
|
SVM
|
|
PCA-L
|
dF | dM | df | dKL | kT |
kM | kKL | kT |
|
G4
|
43 | 28 | 15 | 13 | 28 |
40 | 10 | 10 |
|
G8
|
55 | 46 | 30 | 34 | 45 |
41 | 13 | 36 |
|
G12
|
63 | 48 | 33 | 43 | 51 |
64 | 20 | 43 |
|
Digit scenario
|
|
|