2025 ASEE Annual Conference & Exposition

NSF IUSE: Handling Imbalanced Engineering Persistence Data in Machine Learning with Undersampling & SMOTE

Presented at NSF Grantees Poster Session I

Engineering student persistence remains low in undergraduate universities across the United States (50-60% graduate within 6 years [1]), especially for underrepresented minority (URM) groups (~40% [2]), with the largest dropout rate occurring in the first year (18.5%, [2]). Research has revealed that students decide to leave engineering for many interrelated reasons (e.g., [3], [4], [5], [6]), making it challenging to intervene and help many students at the same time. However, recent developments in machine learning (ML) technologies and motivation theory present a unique opportunity for making substantial, equitable change: identifying individualized factors for intervention to promote persistence. Predictive ML models such as neural networks, random forest, and Bayesian networks can be used to identify students at risk of leaving engineering, and ML explanation methods such as SHAP and LIME can be used to indicate reasons behind these predictions. These tools can be applied to engineering persistence data to predict which students might leave and why, such that interventions could be targeted to meet individual student needs.
Our funded NSF-IUSE project was awarded to test and compare the effectiveness of the current cutting-edge ML models in predicting engineering attrition and identifying individualized targets for intervention. Over the course of three years, we will evaluate the accuracy and consistency of three common ML models using 5 years of retrospective data from a large southeastern university, and then test the generalization of results on data from a similar institution.

Thusfar, we have completed Phase 1 of the project, which included data preprocessing and preliminary predictive model testing. Our work has revealed several important considerations for using ML to predict engineering persistence. First, different models require different preprocessing styles for categorical data such as race and gender [BLIND]. Specifically, the assumptions behind naïve Bayesian models do not agree with the standard one-hot-encoding procedure often used in neural network and random forest models.

This poster will present our more recent findings. We found that Undersampling and Synthetic Minority Over-sampling TechniquE (SMOTE) are two techniques that, when used together, effectively handle the imbalanced data intrinsic to first-year persistence. Random undersampling consists of randomly removing samples of the majority class (i.e., retained students) in the dataset [7]. Meanwhile, SMOTE is an over-sampling technique in which synthetic samples are created for the minority class (i.e., withdrawn students [8]). They are combined to balance the dataset and enhance the performance of the ML classifiers. Our results show that these methods successfully improve model F1 scores (a standard ML metric that will be defined in the paper) by 16% and are therefore worth using when predicting engineering persistence.

References
[1] PCAST President’s Council on Advisors on Science and Technology, “Engage to excel: Producing one million additional college graduates with degrees in science, technology, engineering, and mathematics,” Washington, DC: Office of the President, 2012. doi: 10.1080/10668921003609210.
[2] B. L. Yoder, “Engineering by the numbers: ASEE retention and time-to graduation benchmarks for undergraduate engineering schools, departments and programs,” American Society for Engineering Education, 2017, Accessed: Jan. 05, 2023. [Online]. Available: https://ira.asee.org/wp-content/uploads/2017/07/2017-Engineering-by-the-Numbers-3.pdf.
[3] V. Tinto, Leaving college: Rethinking the causes and cures of student attrition. ERIC, 1987.
[4] V. Tinto and J. Cullen, “Dropout in Higher Education: A Review and Theoretical Synthesis of Recent Research.,” Washington, D.C. Office of Planning, Budgeting, and Evaluation, Department of Health, Education, and Welfare Contract OEC-0–73–1409, 1973.
[5] J. Bean and S. B. Eaton, “The psychology underlying successful retention practices,” J Coll Stud Ret, vol. 3, no. 1, pp. 73–89, 2001, doi: 10.2190/6r55-4b30-28xg-l8u0.
[6] C. P. Veenstra, E. L. Dey, and G. D. Herrin, “A model for freshman engineering retention,” Adv Eng Educ, vol. 1, no. 3, 2009.
[7] H. He and Y. Ma, Imbalanced learning: Foundations, algorithms, and applications. 2013. doi: 10.1002/9781118646106.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, 2002, doi: 10.1613/jair.953.

Authors
  1. Alvin Tran University of Louisville
  2. Christian Zuniga-Navarrete University of Louisville
  3. Dr. Xiaomei Wang University of Louisville [biography]
  4. Luis Segura University of Louisville
Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 22, 2025, and to all visitors after the conference ends on June 25, 2025