Engineering education is rapidly integrating generative artificial-intelligence (GenAI) tools that promise faster, more consistent assessment—yet their reliability in discipline-specific contexts remains uncertain. This mixed-methods study compared ChatGPT-4, Claude 3.5, and Perplexity AI across four undergraduate engineering assignments (two lower-level, two upper-level). Quantitative analyses - one-way ANOVA followed by Tukey’s HSD (α = .05) contrasted AI scores with expert grades, while qualitative feedback from faculty and students captured perceptions of clarity, fairness, and workload. ChatGPT-4 mirrored expert grades on complex tasks (|Δ| ≤ 3.5 %), whereas Claude 3.5 and Perplexity AI under-scored upper-level work by as much as 27 %. Stakeholders appreciated the rubric’s consistency and faster turnaround but criticized the models’ rigidity and opaque rationales. These findings support a hybrid approach in which AI tools provide baseline scores and instructors supply higher-order judgement. Further research should examine discipline-specific fine-tuning and the long-term impact of AI-assisted grading on student learning and educator workload.
The full paper will be available to logged in and registered conference attendees once the conference starts on June 22, 2025, and to all visitors after the conference ends on June 25, 2025