As generative AI tools become integrated into learning environments, educators must understand how different large language models (LLMs) interpret and solve assessment items intended to measure human reasoning. This study mimics potential student-user conditions by treating models as red team participants, allowing educators to probe the relative utility of AI systems in correctly identifying problem structure, articulating reasoning steps, and producing valid solutions at each level of Bloom’s taxonomy. Comparative results reveal variation in problem-solving logic and in susceptibility to overproduced logic and solutions.
This study evaluates the performance of multiple LLMs, including Copilot, ChatGPT, and locally run Ollama instances, when tasked with answering Engineering Controls exam questions designed across Bloom’s Taxonomy levels. A total of 21 exam items (short-answer and design-based) were drawn from Engineering Controls textbooks, test banks, and other sources. Each item was classified by independent reviewers according to Bloom’s levels (remember, understand, apply, analyze, evaluate, create). Three LLM configurations were compared: (1) Copilot GPT-4 (cloud-hosted, proprietary weights); (2) ChatGPT-5.2 container; (3) Ollama Llama-3 70B (locally hosted, open-weights model). Models received no feedback between runs to avoid iterative tuning. Each model’s response to every test item was scored using a rubric framework: Accuracy, Reasoning Transparency, and Bloom Alignment. The authors expect that higher order cognitive tasks will produce lower rated responses for AI reasoning and transparency.
This study offers practical value for educators by (1) identifying likely points of misinterpretation in exam questions, (2) revealing common solution patterns that emerge in AI-assisted responses, and (3) demonstrating how locally hosted, in-house LLMs can be used to audit assessment items and evaluate their robustness.
http://orcid.org/0009-0007-1163-7444
The Pennsylvania State University
[biography]
The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026