JHU Computer Vision Machine Learning

Project Summary

GEAR (Grounded Early Adaptive Rehabilitation) is a collaborative research effort between the University of Delaware, University of California Riverside and Johns Hopkins University that brings together robotics engineers, cognitive scientists, and physical therapists, for the purpose of designing new rehabilitation environments and methods for young children with mobility disorders. The envisioned pediatric rehabilitation environment consists of a portable harness system intended to partially compensate for body weight and facilitate the children’s mobility within a 10 x 10 feet area, a small humanoid robot that socially interacts with subjects, trying to engage with them in games designed to make them maintain particular levels of physical activity, and a network of cameras capturing and identifying the motion in the environment and informing the robot so that the latter adjusts its behavior depending on that of the child.

The realization of this system presents unique new research challenges in the fields of pediatric rehabilitation, robot control, machine vision, and computational learning. One of them is to develop activity recognition methods, which are essential for facilitating children-robot interaction. Our team aims to develop highly interpretable, structured representations and models of children movements that capture spatial and temporal relationships among moving body parts, actions and activities, and can be automatically learned from multimodal time-series data.

Activity Models

We have been working on the development of a library of activity models that are specifically designed for children. However, we took strides along this goal by using datasets such as MSR Action 3D, MSR DailyActivity3D, Berkeley MHAD, which were collected from adults performing various activities, e.g., hand waving, clapping, jumping, drinking. In particular, we have worked on the development of so-called "moving poselets" [1], which are a library of movements associated with a specific body part configuration (e.g., hand moving forward). We used motion capture data from body parts to learn a library of moving poselets as well as activity classifiers based on moving poselets. This work was published in the Chalearn Looking at People Workshop at the International Conference on Computer Vision (2015) [1]. We then extended this work to video data by developing a spatiotemporal convolutional neural network model for predicting fine-grained activities that can be decomposed as a sequence of actions. This work was presented at the European Conference in Computer Vision (2016) [2].

Multiview Action Classification

Our team from the University of Delaware designed the envisioned pediatric rehabilitation environment and acquired data of infants (7 to 24 months old subjects) performing actions in it from multiple cameras. In these scenes not only infants but also robots and adults are present, with the infant being one of the smallest actors in the scene. Moreover, the set-up is challenging because the infants are often occluded by other actors or elements in the scene, and thus the information from a given camera is not always useful for action classification purposes. From this multiview data we aim to classify the main motor actions seen in infant development (crawling, sitting, standing and walking). We first proposed to address this problem by a multiple instance learning SVM scheme (MI-SVM), which considers views as instances of the same sample and takes into account that the action might not be observed in all of them. This work was published in the Journal of Neuroengineering and Rehabilitation [3]. More recently, we have been working on addressing the challenges imposed by the complexity of the scene by using local features from spatial regions of interest in a detection-based multiview action classification scheme. We propose to leverage deep networks for feature extraction and classification, while introducing learnable fusion coefficients to weigh the importance of each view in the final prediction. This work was accepted for oral presentation in the International Conference on Pattern Recognition (ICPR 2020) [4].

People

R. Vidal, E. Mavroudi, C. Pacheco, L. Tao.

Acknowledgement

Work supported by NIH grant R01HD87133-01.

Publications

[1]

L. Tao and R. Vidal.

Moving Poselets: A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition

In Chalearn Looking at People Workshop, International Conference on Computer Vision, December 2015.

[2]

C. Lea, A. Reiter, R. Vidal.

Segmental Spatio-Temporal CNNs for Fine-grained Action Segmentation and Classification

In European Conference on Computer Vision, October 2016.

[3]

E. Kokkoni, E. Mavroudi, A. Zehfroosh, J. C. Galloway, R. Vidal, J. Heinz, H. G. Tanner.

Gearing Smart Environments for Pediatric Motor Rehabilitation

Journal of Neuroengineering and Rehabilitation, vol. 17, no.1, 2020.

[4]

C. Pacheco, E. Mavroudi, E. Kokkoni, H. G. Tanner, R. Vidal.

A Detection-based Approach to Multiview Action Classification in Infants

In 25th 2020 International Conference on Pattern Recognition, January 2021.