This abstract is for a Work in Progress (WIP) paper and addresses adaptive computer learning and personal response systems with a potential for mobile applications.
Given the decreasing cost of sensors and the increasing emphasis on harnessing multimodal data in education, researchers are exploring how to use this diverse data to improve student engagement and enhance academic performance [1]. Over the past several years, the performance of classifiers for each modality has improved significantly [2]. Several studies have investigated leveraging multimodal data and a combination of classifiers to model user engagement, knowledge and preferences [1].
This paper aims to leverage affect-aware classifiers and other personal data to better determine individualized optimal educational interaction, thereby replicating some of the benefits of one-on-one learning experiences. We present research focused on designing, developing, and evaluating a learning system that integrates facial expression, head pose, and task performance, to construct rich models of users’ affective mannerisms, answer proficiencies, and interaction preferences. These models inform our selection of an optimal educational interaction, such as suggesting when to review content, proceed to new content, take a break, or provide emotional support, leading to highly adaptive and engaging educational experiences.
The proposed approach involves collecting webcam and interactive data between a user and the learning system. Features are extracted from the webcam footage and fed to multiple modules, each designed for a particular task: head pose estimation, facial expression and American Sign Language (ASL) recognition. Although, this system is designed to help parents of deaf children learn ASL, it is available for anyone who wants to learn the language. Moreover, the proposed system can be modified to learn content other than ASL.
The pose estimation and expression recognition modules determine the affect unit, which is designed to learn the user’s idiosyncratic emotional expressions. An answer accuracy is derived from the hand-gesture recognition module and is combined with interactive information, such as the answer rate. These data are aggregated with the features generated by the affect unit. The aggregator fuses the diverse features and labels, evaluating the state of each in relation to user performance. Subsequently, the data is transmitted from the aggregator to a recommendation agent, which, guided by criteria including answer accuracy, response rate, and user affect, discerns the most suitable educational interaction.
In this abstract, we have outlined three pivotal aspects that our research addresses: comprehending and modeling human multimodal data (i.e. affect states); flexibly adapting interactive user and task models; and optimizing the selection of educational interactions. In completing this research, we hope to inform future research on using affect-aware, adaptive multimodal systems to optimize interactions and boost user engagement and curricular performance.
[1] Wilson Chango, Juan A. Lara, Rebeca Cerezo, Cristóbal Romero A review on data fusion in multimodal learning analytics and educational data mining. WIREs Data Mining and Knowledge Discovery. 05 April 2022
[2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime MultiPerson 2D Pose Estimation using Part Affinity Fields. In CVPR.
Are you a researcher? Would you like to cite this paper? Visit the ASEE document repository at peer.asee.org for more tools and easy citations.