2026 ASEE Annual Conference & Exposition

Before Detection: Stress-Testing Engineering Controls Assessments with Large Language Model Red Teams

Presented at DSAI-Session 8: Generative AI in Assessment, Grading, and Accreditation

As generative AI tools become integrated into learning environments, educators must understand how different large language models (LLMs) interpret and solve assessment items intended to measure human reasoning. This study mimics potential student-user conditions by treating models as red team participants, allowing educators to probe the relative utility of AI systems in correctly identifying problem structure, articulating reasoning steps, and producing valid solutions at each level of Bloom’s taxonomy. Comparative results reveal variation in problem-solving logic and in susceptibility to overproduced logic and solutions.

This study evaluates the performance of multiple LLMs, including Copilot, ChatGPT, and locally run Ollama instances, when tasked with answering Engineering Controls exam questions designed across Bloom’s Taxonomy levels. A total of 21 exam items (short-answer and design-based) were drawn from Engineering Controls textbooks, test banks, and other sources. Each item was classified by independent reviewers according to Bloom’s levels (remember, understand, apply, analyze, evaluate, create). Three LLM configurations were compared: (1) Copilot GPT-4 (cloud-hosted, proprietary weights); (2) ChatGPT-5.2 container; (3) Ollama Llama-3 70B (locally hosted, open-weights model). Models received no feedback between runs to avoid iterative tuning. Each model’s response to every test item was scored using a rubric framework: Accuracy, Reasoning Transparency, and Bloom Alignment. The authors expect that higher order cognitive tasks will produce lower rated responses for AI reasoning and transparency.

This study offers practical value for educators by (1) identifying likely points of misinterpretation in exam questions, (2) revealing common solution patterns that emerge in AI-assisted responses, and (3) demonstrating how locally hosted, in-house LLMs can be used to audit assessment items and evaluate their robustness.

Authors

Dr. Alyson Grace Eggleston http://orcid.org/0009-0007-1163-7444 The Pennsylvania State University [biography]

Alyson Eggleston is an Associate Professor in the Penn State Hershey College of Medicine and Director of Evaluation for the Penn State Clinical and Translational Science Institute. Her research and teaching background focuses on program assessment, STEM technical communication, industry-informed curricula, and educational outcomes veteran and active duty students.
Dr. Robert J. Rabb P.E. The Pennsylvania State University [biography]

Robert Rabb is the associate dean for education in the College of Engineering at Penn State. He previously served as a professor and the Mechanical Engineering Department Chair at The Citadel. He previously taught mechanical engineering at the United States Military Academy at West Point. He received his B.S. in Mechanical Engineering from the United Military Academy and his M.S. and PhD in Mechanical Engineering from the University of Texas at Austin. His research and teaching interests are in mechatronics, regenerative power, and multidisciplinary engineering.

Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026

« View session