2026 ASEE Annual Conference & Exposition

LLM Use to Evaluate Student Weekly Reflections in First Year Design Class

Presented at Computers in Education (CoED): Learning, Engagement & Inclusion (2 of 9) -- M408B

This empirical paper investigates whether Large Language Models (LLMs) are effective in assisting professors in summarizing and scoring students' weekly reflections. Reflections support improved retention and enable instructors to identify those who may need personalized assistance. Driven by the need for scalable, timely, and actionable feedback analysis in large classes, this study examines the reliability of LLM summarizations. The LLM-generated qualitative data is then validated by correlating it with student self-reported quantitative improvements in a first-year engineering course.

Weekly student feedback collected from this course was pre-processed and analyzed using the LLM ChatGPT-4o-mini. The analysis included LLM-based summarization and rubric-based Likert scoring on a 1 to 5 scale. The summarization rating categories included roadblocks, tone/mood, perceived learning, team sentiment, persistence, and engagement quality. These categories were based on Likert responses and open-ended questions from the feedback form, in which students were asked to discuss breakthroughs and roadblocks on their design project, teamwork, and general feedback about the class. Manual analysis of past student reflections served as a benchmark to evaluate the accuracy and consistency of LLM-generated summaries and sentiment scores. The comparison emphasized how effectively both methods identified students who may require additional support or intervention. After the LLM demonstrated performance comparable to manual grading, Spearman's correlation tests were used to examine how LLM-derived rubric scores align with students’ end-of-quarter Likert responses and to illustrate the potential for using these validated scores to establish concurrent validity.

Results from this study indicate that LLM-generated scoring based on rubrics is comparable to human scoring. These results suggest that LLMs can serve as effective, scalable graders of short reflection forms in large engineering classes when paired with clear rubrics and prompts. This work highlights how integrating LLM-generated reflection scores establishes a foundation for supporting timely interventions for at-risk students in the future.

Authors
  1. Anthony Gwun Hynn Chin University of California, San Diego [biography]
  2. Dr. Nathan Delson University of California, San Diego [biography]
  3. Jishan Kharbanda University of California, Los Angeles
  4. Hollis Voinov University of California, San Diego [biography]
  5. Andrea Tueanh Huynh University of California, San Diego [biography]
Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026