Multimodal Algorithms for Motor Imitation Assessment in Children with Autism
Project Summary
Computerized Assessment of Motor Imitation (CAMI) is a collaborative research between Kennedy Krieger Institute, University of Washington and Johns Hopkins University that brings together neurologists, biomedical engineers, and computer scientists for the purpose of designing, developing and testing an objective, reproducible and highly-scalable multimodal system to observe children performing a brief videogame-like motor imitation task, quantitatively assess their motor imitation performance, and investigate its validity as a phenotypic biomarker for autism. Accomplishing this goal will require an interdisciplinary approach which combines expertise in autism, child development, computer vision and machine learning.
Specifically, this project will: (1) design motor imitation tasks that are relevant for ASD assessment, (2) design, test and validate a scalable and flexible system to collect and label multimodal data of children imitating a sequence of movements; (3) design a novel fine-grained representation of human movements that can be learned efficiently and is suitable for comparing the children's movements to the movements they need to imitate; (4) develop novel computer vision and metric learning algorithms for learning and comparing multimodal representations of human movements, and (5) use such metrics to generate candidate imitations scores that can be used as potential quantitative biomarkers for ASD.
The motor imitation assessment methods to be developed in this project could be used in a wide variety of applications beyond assessing children with ASD, such as providing imitation performance scores for video-based rehabilitation therapy, surgical skill assessment, athletic activities and other movement-based instructional activities.

Multimodal System for Monitoring Children with ASD
Recognizing the need for highly-scalable systems for improving autism screening, diagnosis, and evaluation (for guiding intervention), this grant focuses on adapting our published 3D computerized assessment of motor imitation, which was highly successful at distinguishing autistic children from those without autism [1] to 2D systems [2]. Therefore, it is critical to collect new data that allows us to not only validate our prior work, but also support the development of new algorithms. Toward that end, we successfully designed and implemented systems for simultaneous 2D and 3D recording of participant’s movements, including using highly-scalable devices (phone cameras and camcorders), in Co-I Mostofsky’s laboratory at the Center for Neurodevelopmental and Imaging Research (CNIR). This has provided data for the development of 2D pose estimation methods that are similarly effective at distinguishing autistic individuals and specific behaviors.
Multimodal Representation of Human Movements
Understanding human movements in images and videos has played a central role in computer vision for several decades. For instance, the problem of body pose estimation has found numerous applications in action recognition, motion analysis, gaming, video surveillance, and more recently also in rehabilitation medicine. Much of the recent work in 2D pose estimation uses encoder-decoder architectures based on convolutional neural networks (CNNs) to predict body landmarks in an image or video. Such approaches perform very well on images, but their performance deteriorates in high-resolution images and videos due to their inability to capture long term dependencies. Vision Transformers (ViTs) have recently emerged as a powerful alternative to CNNs for processing both images and videos. ViTs use multi-head self-attention to recombine and process patch tokens depending on relationships between each pair of tokens. Therefore, ViTs can model long-range dependencies and are able to generate a global representation of the overall image. However, the computational complexity of ViTs increases quadratically with the number of input tokens, making them intractable for processing high-resolution images and long videos. To reduce the computational complexity, we propose a multi-scale vision transformer, based on Swin Transformer, that constructs hierarchical feature maps and restricts the computation of self-attention to a local window. Moreover, we introduce several efficient sparse transformer architectures for 2D human body pose estimation in images and videos. Since the main computational bottleneck is the number of tokens that need to be processed, we propose to reduce complexity while maintaining high-resolution representations by selecting a small number of informative body part patches and dropping uninformative and background patches. We propose two strategies to achieve this goal: (1) a lightweight 2D pose estimation transformer to guide token selection and (2) an adaptive patch selection network to automatically select the most informative patches. Experiments on the two common 2D pose estimation benchmarks, i.e., COCO and MPII, demonstrate that the proposed methods show significant improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
Learning and Comparing Movements for Imitation Assessment
Leveraging the additional efforts into recruitment and data collection, we are interested in studying the generalization performance of the proposed imitation assessment method. To this end, we evaluated the CAMI model published in [1] in a different sample. Specifically, we used CAMI to assess the imitation performance of 53 new participants imitating two sequences of movements. In lack of ground truth annotations regarding the imitation performance for this new dataset, we evaluated the discriminative ability of the CAMI scores respect to the autism diagnosis of the subjects. Results show a drop in discriminative ability of imitation scores in this new dataset when compared to previous data, which could be explained by the lower autism severity observed in the participants of the new sample (measured according to ADOS scores). These results attest to the open challenges in addressing different populations within the highly heterogenous autism spectrum disorder. We are currently working on identifying additional features that could be relevant for imitation assessment, such as the continuous relative phase (CRP), aiming to strengthen the imitation assessment model used by CAMI.
People
R. Vidal, S. Mostofsky, D. Lidstone, K. Kinfu, C. Pacheco.
Acknowledgement
Work supported by NSF grant.
Publications
[1]
B. Tunçgenç, C. Pacheco, R. Rochowiak, R. Nicholas, S. Rengarajan, E. Zou, B. Messenger, R. Vidal, and S. H. Mostofsky.
Computerized Assessment of Motor Imitation as a Scalable Method for Distinguishing Children With Autism
Journal of Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, vol. 6, no.3, 2021.
[2]
D. E. Lidstone, R. Rochowiak, C. Pacheco, B. Tunçgenç, R. Vidal, and S. H. Mostofsky.
Automated and scalable Computerized Assessment of Motor Imitation (CAMI) in children with Autism Spectrum Disorder using a single 2D camera: A pilot study
In Research in Autism Spectrum Disorders, Volume 87, 2021.
[3]
T. G. George, R. Rochowiak, K. T. King, D. Lidstone, C. Pacheco, B. Tunçgenç, R. Vidal, S. H. Mostofsky, and A. T. Eggebrecht.
Illuminating brain function during gross motor imitation using high-density diffuse optical tomography (HD-DOT)
In Biophotonics Congress: Biomedical Optics 2022.