The mastery learning framework emphasizes formative assessment, where students are given opportunities to fail, receive feedback, and improve until they reach proficiency.
In this approach, timely and consistent feedback is essential, yet remains a major challenge in large-enrollment programming courses.
Current mastery learning platforms typically rely on automated testing, which provides correctness but offers little guidance on code quality or style.
As a result, students may achieve functional solutions without receiving formative insights until instructional staff intervene, creating a resource bottleneck.
This pilot study explores the integration of large language models (LLMs) into an existing mastery learning platform for programming courses.
Our system operates in three phases: (1) automated correctness testing through traditional test cases, (2) generation of targeted semantic feedback via a secure, institutionally hosted LLM API once a correctness threshold is achieved, and (3) post-processing to refine the feedback so that it remains constructive without over-directing the student’s problem-solving process.
The system enables instructors to customize the assessment criteria, aligning feedback with specific learning objectives.
We evaluated the system by deploying it in a graduate-level programming course in the machine learning track at Duke University, using a within-subject, two-stage survey design.
Baseline surveys captured student experiences with grade-only feedback, while post-deployment surveys assessed perceived clarity, usefulness, efficiency, and motivational impact of the LLM-augmented feedback.
Although this pilot study received feedback from a small cohort size (N = 9), the qualitative results indicated that students preferred the AI-supported feedback over the grade-only baseline as they reported improved interpretability, reduced error diagnosis time, and lower perceived reliance on teaching assistants.
This work provides evidence that workflow-constrained LLM pipelines, rather than raw model capability, deliver scalable, pedagogically aligned formative feedback.
The findings suggest a practical hybrid assessment model where deterministic grading ensures correctness, LLMs provide structured interpretive guidance, and instructors focus on higher-order conceptual learning.
The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026