2026 ASEE Annual Conference & Exposition

A Cross-Disciplinary Study Evaluating the Effectiveness of GenAI created Multiple Choice Questions

Presented at Computers in Education (CoED): Learning, Engagement & Inclusion (2 of 9) -- M408B

As GenAI tools become increasingly available in education, instructors are actively exploring how to leverage AI to manage growing workloads, particularly in assessment design, where creating and maintaining question banks for homework, quizzes, and exams requires considerable time and effort. However, the effectiveness of AI-generated assessments remains an open question, with instructors uncertain about question quality and reliability of the AI system.

Multiple choice questions (MCQs) represent a particularly compelling use case for AI automation given their widespread use, objectivity, and need for large question banks. However, the use of poorly engineered prompts can negate the advantages of using AI by generating questions that are flawed or irrelevant, requiring more effort to evaluate and often need extensive edits. Systematic frameworks for evaluating AI-generated MCQ effectiveness and empirical evidence of GenAI's adherence to quality guidelines remain limited.

We introduce an AI-powered MCQ generation tool within an online interactive textbook platform to generate assessments. The tool uses GPT-4.1 with structured prompts that incorporate explicit guardrails and learning science principles to generate high quality MCQs from instructor-selected content.

To evaluate whether GenAI reliably adheres to these prompt-based guardrails, we present a multi-dimensional framework and apply it to over 500 questions generated for introductory courses in Computer Science, Mathematics, and Data Science. Our framework evaluates the efficacy of MCQs based on quality and relevance, each of which are defined by several quantifiable metrics. Additionally, we classify MCQs using Bloom's taxonomy to identify patterns in cognitive complexity of AI generated questions. We also investigate the correlation between Bloom's taxonomy and the efficacy of the MCQ to expose how AI handles question generation at different Bloom's levels.

Overall, our findings reveal the extent to which GenAI adheres to prompt-based guardrails and generates effective MCQs. We present both strengths and systematic failure patterns that inform best practices for GenAI-assisted assessment design.

Authors
  1. Erica Perich zyBooks, A Wiley Brand [biography]
  2. Dr. Yamuna Rajasekhar zyBooks, A Wiley Brand [biography]
Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026