2026 ASEE Annual Conference & Exposition

Assessing the Reliability of Large Language Models for Scientific Information Extraction

Presented at AI-Enhanced Learning Ecosystems in Engineering Education

Building on our prior work, WIP: Leveraging AI for Literature Reviews (Gong & Maitra, 2025), which introduced an AI-assisted step by step guideline for new researchers, this study extends the framework to evaluate Retrieval-Augmented Generation (RAG) capabilities. We propose a systematic approach for assessing a single large language model’s (LLM) RAG capabilities using three quantitative metrics—correctness, completeness, and compliance—that capture factual accuracy, contextual coverage, and adherence to citation or formatting standards. These metrics are tested on literature-retrieval and summarization tasks drawn from perovskite solar-cell and additive-manufacturing datasets, allowing a comparison between AI-generated and human-verified reviews. By quantifying how reliably an LLM retrieves, attributes, and integrates external sources, the study establishes a reproducible method for benchmarking AI tools such as ChatGPT, Perplexity, and Elicit in research workflows.
The second phase explores multi-agent RAG collaboration, modeling interactions among multiple LLMs (e.g., GPT-4, Claude, Gemini) as agents with distinct “personalities.” Each agent is characterized by behavioral traits—willingness to change, cooperation, verification precedence, and acceptance of reasoning—that influence how information is exchanged and reconciled. Through multi-shot dialogue experiments, we visualize knowledge exchange networks that reveal patterns of consensus, contradiction, and reasoning acceptance across agents. Preliminary results show that inter-agent verification enhances completeness and factual alignment but may reduce efficiency when compliance constraints dominate. This multi-level RAG evaluation framework links algorithmic accuracy with collaborative reasoning behavior, offering new directions for teaching undergraduates how to critically evaluate, verify, and synthesize AI-generated research outputs within ethical and academically rigorous contexts.

Authors
  1. Luke Schneider Pennsylvania State University, Behrend College
  2. Debalina Maitra Kennesaw State University [biography]
  3. Jing Zhao Pennsylvania State University, Behrend College
  4. Dr. Jiawei Gong Orcid 16x16http://orcid.org/https://0000-0003-4318-9387 Pennsylvania State University, Behrend College [biography]
Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026