2025 ASEE Annual Conference & Exposition

BOARD # 279: NSF IUSE: Handling Imbalanced Engineering Persistence Data in Machine Learning with Undersampling & SMOTE

Presented at NSF Grantees Poster Session I

Engineering student persistence remains low in undergraduate universities across the United States (50-60% graduate within 6 years [1]), especially for underrepresented minority (URM) groups (~40% [2]), with the largest dropout rate occurring in the first year (18.5%, [2]). Research has revealed that students decide to leave engineering for many interrelated reasons (e.g., [3], [4], [5], [6]), making it challenging to intervene and help many students at the same time. However, recent developments in machine learning (ML) technologies and motivation theory present a unique opportunity for making substantial, equitable change: identifying individualized factors for intervention to promote persistence. Predictive ML models such as neural networks, random forest, and Bayesian networks can be used to identify students at risk of leaving engineering, and ML explanation methods such as SHAP and LIME can be used to indicate reasons behind these predictions. These tools can be applied to engineering persistence data to predict which students might leave and why, such that interventions could be targeted to meet individual student needs.
Our funded NSF-IUSE project was awarded to test and compare the effectiveness of the current cutting-edge ML models in predicting engineering attrition and identifying individualized targets for intervention. Over the course of three years, we will evaluate the accuracy and consistency of three common ML models using 5 years of retrospective data from a large southeastern university, and then test the generalization of results on data from a similar institution.

Thusfar, we have completed Phase 1 of the project, which included data preprocessing and preliminary predictive model testing. Our work has revealed several important considerations for using ML to predict engineering persistence. First, different models require different preprocessing styles for categorical data such as race and gender [BLIND]. Specifically, the assumptions behind naïve Bayesian models do not agree with the standard one-hot-encoding procedure often used in neural network and random forest models.

This poster will present our more recent findings. We found that Undersampling and Synthetic Minority Over-sampling TechniquE (SMOTE) are two techniques that, when used together, effectively handle the imbalanced data intrinsic to first-year persistence. Random undersampling consists of randomly removing samples of the majority class (i.e., retained students) in the dataset [7]. Meanwhile, SMOTE is an over-sampling technique in which synthetic samples are created for the minority class (i.e., withdrawn students [8]). They are combined to balance the dataset and enhance the performance of the ML classifiers. Our results show that these methods successfully improve model F1 scores (a standard ML metric that will be defined in the paper) by 16% and are therefore worth using when predicting engineering persistence.

References
[1] PCAST President’s Council on Advisors on Science and Technology, “Engage to excel: Producing one million additional college graduates with degrees in science, technology, engineering, and mathematics,” Washington, DC: Office of the President, 2012. doi: 10.1080/10668921003609210.
[2] B. L. Yoder, “Engineering by the numbers: ASEE retention and time-to graduation benchmarks for undergraduate engineering schools, departments and programs,” American Society for Engineering Education, 2017, Accessed: Jan. 05, 2023. [Online]. Available: https://ira.asee.org/wp-content/uploads/2017/07/2017-Engineering-by-the-Numbers-3.pdf.
[3] V. Tinto, Leaving college: Rethinking the causes and cures of student attrition. ERIC, 1987.
[4] V. Tinto and J. Cullen, “Dropout in Higher Education: A Review and Theoretical Synthesis of Recent Research.,” Washington, D.C. Office of Planning, Budgeting, and Evaluation, Department of Health, Education, and Welfare Contract OEC-0–73–1409, 1973.
[5] J. Bean and S. B. Eaton, “The psychology underlying successful retention practices,” J Coll Stud Ret, vol. 3, no. 1, pp. 73–89, 2001, doi: 10.2190/6r55-4b30-28xg-l8u0.
[6] C. P. Veenstra, E. L. Dey, and G. D. Herrin, “A model for freshman engineering retention,” Adv Eng Educ, vol. 1, no. 3, 2009.
[7] H. He and Y. Ma, Imbalanced learning: Foundations, algorithms, and applications. 2013. doi: 10.1002/9781118646106.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, 2002, doi: 10.1613/jair.953.

Authors

Arinan De Piemonte Dourado http://orcid.org/0000-0002-0793-9577 University of Louisville [biography]

Arinan Dourado, Ph.D., is an Assistant Professor of Mechanical Engineering at the University of Louisville.
Prior to joining UofL, he worked as a Lecturer in his home country (Brazil) for three years, teaching and mentoring low-income, first-generation STEM students from rural communities. Additionally, Dr. Dourado worked as an instructor at the University of Central Florida for two years, primarily serving Hispanic first-generation students. Currently, his working on developing and applying machine learning/artificial intelligence tools to identify and suggest intervention actions to increase student retention and success.
Christian Zuniga-Navarrete University of Louisville
Alvin Tran University of Louisville
Luis Segura University of Louisville
Dr. Xiaomei Wang University of Louisville [biography]

Xiaomei Wang is an Assistant Professor in the Industrial Engineering Department of University of Louisville. She received her PhD in Industrial Engineering from University at Buffalo.
Dr. Campbell R Bego http://orcid.org/0000-0002-8125-3178 University of Louisville [biography]

Campbell Rightmyer Bego, PhD, PE, studies learning and persistence in undergraduate engineering programs in the Department of Engineering Fundamentals at the University of Louisville's Speed School of Engineering. She obtained a BS from Columbia University in Mechanical Engineering, a PE license in Mechanical Engineering from the state of New York, and an MS and PhD in Cognitive Science from the University of Louisville. Her current interests are generative AI and information literacy for engineering, individualized, first-year persistence interventions, and the effectiveness of evidence-based practices in the engineering classroom.

Download paper (1.24 MB)

» Download paper

« View session