Online learning has become increasingly prevalent across all educational levels, with students frequently turning to Large Language Models (LLMs) for academic assistance. While AI-driven instructional tools promise to personalize education through real-time, accessible guidance, LLMs can hallucinate, misinterpret information, and provide inaccurate responses with unwarranted confidence. Further, general LLMs lack pedagogical objectives and tend to provide direct answers rather than provide scaffolded learning experiences. These limitations are especially salient in an engineering context, where existing AI tutors primarily target coding or mathematics rather than engineering topics and problem-solving methods.
This work describes and evaluates a novel AI Tutor integrated into two undergraduate Electrical and Computer Engineering (ECE) courses at a large public research university in the southeastern United States. We introduce AI-Tutor, a web-based interface connecting to an LLM equipped with course materials through retrieval-augmented generation (RAG). Guided by Self-Regulated Learning Theory, the system provides four core functions through carefully designed prompt structures: (1) checking homework solutions, (2) guiding students through homework solutions from the beginning, (3) answering open-ended conceptual questions, and (4) generating instructor-facing analytics. Students initiate interactions by asking questions about a homework problem or a course concept, and the tutor responds by breaking explanations into small, manageable steps that engage the student in text-based dialog. The instructor summary aggregates these interactions, identifying common issues and sticking points. Piloted in Spring 2025 and refined for Fall 2025, the tutor generated more than 10,000 interactions from 118 voluntary student users across two courses in Fall 2025 as part of an IRB-approved study.
Using a hybrid human evaluation and LLM-as-judge methodology, we systematically evaluate Fall 2025 tutor responses across five dimensions: Technical Accuracy, Relevancy, Coherence, Evaluator-Perceived Quality, and Feasibility within Interface. Technical accuracy captures whether a response correctly retrieves RAG-supplied information and avoids typos, calculation errors, and hallucinations. Relevancy indicates whether a response answers the student’s question. Coherence assesses logical flow within the response. Evaluator-perceived quality represents whether the response matches the quality of one that might be provided by a human, such as a graduate teaching assistant during office hours. Finally, Awareness of Limitations examines whether the tutor’s responses are feasible given the current interface and features, excluding actions the system cannot perform. We also collect data indicating when student users explicitly express disagreement, confusion, or frustration with a response as a secondary marker of quality. This evaluation effort allows us to quantify error rates and characterize the types of errors that occur. These results indicate whether RAG-enhanced LLMs can provide reliable instructional support comparable to human teaching assistants.
The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026