The purpose of this paper is to describe our research group’s motivations and methods for resolving ambiguity in engineering students’ interaction data. Students’ social interactions can be an integral part of their success in schooling. For example, researchers have identified positive correlations between students’ interaction levels and students’ sense of belonging, access to knowledge, and ability to work in a team. To identify these relationships, engineering educators often employ a method called Social Network Analysis (SNA). SNA enables a quantitative look at inherently qualitative social interactions via mathematical representations of real-world social networks. With such mathematical representations, researchers can extract network measures, like centrality, which estimates how connected a student is to other students. Such measures, when compared to students’ outcomes of interest (e.g., academic success or belonging) provide insights into which network characteristics relate to desired outcomes.
However, the conclusions drawn from SNA are limited to the study’s scope. For example, online interaction data limits study conclusions to online networks. In face to face (f2f) networks, data collection and consolidation requirements (e.g., removing name variances and reducing missing network information) scale with study scope, making large-scale f2f SNA difficult. To balance authentic interactions with a manageable study scope, researchers often conduct SNA in small settings like single classrooms. Further, such studies often ask students to identify connections from a roster of student enrollment, which reduces the number of potential interactions. To analyze more authentic student networks, our research group is conducting an open response, large scale (1000+ nodes) study of a full 1st and 2nd year cohort of engineering students’ f2f and online social networks over two years.
During this study, our research group found that the primary issue in data consolidation is reference ambiguity (i.e., differences in response spelling, formatting, etc.). Entity resolution (the process of assigning ambiguous connections to real life entities) is a valuable method which researchers have applied in some f2f network contexts, but we have not observed in engineering education studies. To make entity resolution more available to engineering education researchers, this paper presents our development and deployment of a python-based entity resolution module: EntityRAID (Entity Resolution for Ambiguous Interaction Data).
EntityRAID begins by initializing a key with high-confidence names (i.e., self-reported and/or registry names). After initializing the key, EntityRAID compares resolved names (names in the key) to remaining ambiguous names through the Levenshtein distance (literal string similarity) and Double Metaphone (phonetic similarity) algorithms and consolidates similar pairs of names via user-defined thresholds. Lastly, EntityRAID resolves the remaining ambiguous names using a low-confidence key of non-participant full names.
EntityRAID is posted in a public GitHub Repository and will enable engineering educators to perform large-scale SNA at a reduced resource cost. This improvement should allow researchers to focus on more authentic student networks than previously studied, generating broader and more generalizable conclusions about which social practices help students succeed.
Are you a researcher? Would you like to cite this paper? Visit the ASEE document repository at peer.asee.org for more tools and easy citations.